Templates are used to define chemical spaces using easily calculable properties
/home/dmai/miniconda3/envs/mrl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

Overview

Templates are a core concept in MRL used to define chemical spaces. Templates collect a series of molecular heuristics and validate if a molecule meets those criteria. For example:

Molecular weight: 250-450
Rotatable bonds: Less than 8
PAINS Filter: Pass

Templates can also be used to assign a score for meeting heuristic criteria. This allows us to define different criteria for must-have molecular properties versus nice-to-have_ chemical properties. In a reinforcement learning context, this translates into giving a score bonus to molecules that fit the nice-to-have criteria. Scores can also be negative to allow for penalizing a molecule that still passes the must-have criteria.

Must Have:
Molecular weight: 250-450, 
Rotatable bonds: Less than 8
PAINS Filter: Pass

Nice To Have:
Molecular weight: 350-400 (+1), 
TPSA: Less than 80 (+1)
Substructure Match: '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1' (+3)
Substructure Match: '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1' (-1)

Based on the above criteria, a molecule that passes the must-have criteria could get a score between -1 and +5 based on meeting the nice-to-have criteria.

Templates are instantiated through the Template class. A template is a collections of filters, created through the Filter class. See Filter for more details on defining filter functions.

Templates contain two sets of filters - hard filters and soft filters. Hard filters contain the must-have criteria for a molecule, while the soft filters contain the nice-to-have criteria.

During model training, a generative model creates a batch of compounds. These compounds are first screened against the hard filters. Compounds that fail the hard filters can then be excluded from the training batch or assigned a default failure score. Compounds that pass the hard filters are then scored using the soft filters. Soft filters can provide a small score bonus or penalty for a molecule in addition to the main score function.

Soft filters incentivise a generative model to maximize the soft filter conditions without making them a hard requirement. This allows soft filters to be highly targeted towards narrow property ranges or highly specific substructures. If these highly targeted criteria were set as hard filters, they might invalidate too many compounds and cause the model to struggle during training.

Templates

The Template class holds a collection of hard filters and soft filters and manages screening a molecule against those filters. Templates by default log the filter results in detail, allowing you to inspect which filters a molecule failed.

Templates can be merged by adding templates together, ie new_template = template1 + template2. Adding two templates merges the hard and soft filters in each template.

Templates by default will log all molecules passed through the hard and soft filters to an internal dataframe, and build a lookup table of {smile : final_score}. The first time a molecule is screened, it is added to the internal dataframe and lookup table. If that molecule is seen again (defined by the smiles string), the lookup value is returned. If this behavior isn't desired, pass log=False to disable all logging and lookup, or use_lookup=False to keep logging but avoid using the lookup table.

class Template[source]

Template(hard_filters, soft_filters=None, log=False, use_lookup=False, fail_score=0.0, cpus=None, mode='smile')

Template - class for managing hard and soft filters

Inputs:

  • hard_filters list[Filter]: list of Filter objects used for pass/fail screening

  • soft_filters list[Filter]: list of Filter objects used for soft scoring

  • log bool: if True, template will log screened compounds

  • use_lookup bool: if True, filter results are stored in a lookup table. If a compound is re-screened, the lookup value is returned

  • fail_score float: placeholder score for compounds that fail to pass hard filters

  • cpus Optional[int]: number of CPUs to use. If None, defaults to os.environ['ncpus']

  • mode str['smile', 'protein', 'dna', 'rna']: determines how inputs are converted to Mol objects

class BlankTemplate[source]

BlankTemplate(**template_kwargs) :: Template

Empty template (no hard or soft filters)

class ValidMoleculeTemplate[source]

ValidMoleculeTemplate(hard=True, soft=False, **template_kwargs) :: Template

Template for checking if an input is a single valid chemical structure

class RuleOf5Template[source]

RuleOf5Template(hard=True, soft=False, **template_kwargs) :: Template

Template for Lipinski's rule of 5 (en.wikipedia.org/wiki/Lipinski%27s_rule_of_five)

class GhoseTemplate[source]

GhoseTemplate(hard=True, soft=False, **template_kwargs) :: Template

Template for Ghose filters (doi.org/10.1021/cc9800071)

class VeberTemplate[source]

VeberTemplate(hard=True, soft=False, **template_kwargs) :: Template

Template for Veber filters (doi.org/10.1021/jm020017n)

class REOSTemplate[source]

REOSTemplate(hard=True, soft=False, **template_kwargs) :: Template

Template for REOS filters (10.1016/s0169-409x(02)00003-0)

class RuleOf3Template[source]

RuleOf3Template(hard=True, soft=False, **template_kwargs) :: Template

Template for rule of 5 filter (doi.org/10.1016/S1359-6446(03)02831-9)

smiles = [
    'c1ccccc1',
    'Cc1cc(NC)ccc1',
    'Cc1cc(NC)cnc1',
    'Cc1cccc(NCc2ccccc2)c1'
]

mols = [to_mol(i) for i in smiles]

# hard filters
hard_filters = [
    ValidityFilter(),
    SingleCompoundFilter(),
    MolWtFilter(None, 500),
    HBDFilter(None, 5),
    HBAFilter(None, 10),
    LogPFilter(None, 5)
    ]

# soft filters
soft_filters = [
    TPSAFilter(None, 110, score=1),
    RotBondFilter(None, 8, score=1),
    StructureFilter(['[*]-[#6]1:[#6]:[#6]:[#6]2:[#7]:[#6]:[#7H]:[#6]:2:[#6]:1'],
                    exclude=False, score=1)
    ]

template = Template(hard_filters, soft_filters)
assert template.hf('CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC')[0]
assert template.sf('CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC')[0]==3.0
assert template('CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC')
t1 = Template(hard_filters[:3], soft_filters[:2])
t2 = Template(hard_filters[3:], soft_filters[2:])

assert (t1+t2)(mols) == template(mols)
assert (t1+t2)(mols, filter_type='soft') == template(mols, filter_type='soft')
template = RuleOf3Template(log=True)
df = pd.read_csv('files/smiles.csv')
passes, fails = template.screen_mols(df.smiles.values)
assert len(passes) == 92
assert all(template.sample(50, log='hard').final.values==True)
template.save('files/test_temp.template', with_data=False)
template2 = Template.from_file('files/test_temp.template')
assert template2.hard_log.shape[0]==0
assert template.hard_log.shape[0]==2000
os.remove('files/test_temp.template')
template = Template([])
assert template(mols) == [True, True, True, True]
template = RuleOf3Template(log=True, mode='protein')
assert template(['MAAR', 'A']) == [False, True]