Overview
Templates are a core concept in MRL used to define chemical spaces. Templates collect a series of molecular heuristics and validate if a molecule meets those criteria. For example:
Molecular weight: 250-450
Rotatable bonds: Less than 8
PAINS Filter: Pass
Templates can also be used to assign a score for meeting heuristic criteria. This allows us to define different criteria for must-have molecular properties versus nice-to-have_ chemical properties. In a reinforcement learning context, this translates into giving a score bonus to molecules that fit the nice-to-have criteria. Scores can also be negative to allow for penalizing a molecule that still passes the must-have criteria.
Must Have:
Molecular weight: 250-450,
Rotatable bonds: Less than 8
PAINS Filter: Pass
Nice To Have:
Molecular weight: 350-400 (+1),
TPSA: Less than 80 (+1)
Substructure Match: '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1' (+3)
Substructure Match: '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1' (-1)
Based on the above criteria, a molecule that passes the must-have criteria could get a score between -1 and +5 based on meeting the nice-to-have criteria.
Templates are instantiated through the Template
class. A template is a collections of filters, created through the Filter
class. See Filter
for more details on defining filter functions.
Templates contain two sets of filters - hard filters and soft filters. Hard filters contain the must-have criteria for a molecule, while the soft filters contain the nice-to-have criteria.
During model training, a generative model creates a batch of compounds. These compounds are first screened against the hard filters. Compounds that fail the hard filters can then be excluded from the training batch or assigned a default failure score. Compounds that pass the hard filters are then scored using the soft filters. Soft filters can provide a small score bonus or penalty for a molecule in addition to the main score function.
Soft filters incentivise a generative model to maximize the soft filter conditions without making them a hard requirement. This allows soft filters to be highly targeted towards narrow property ranges or highly specific substructures. If these highly targeted criteria were set as hard filters, they might invalidate too many compounds and cause the model to struggle during training.
Templates
The Template
class holds a collection of hard filters and soft filters and manages screening a molecule against those filters. Templates by default log the filter results in detail, allowing you to inspect which filters a molecule failed.
Templates can be merged by adding templates together, ie new_template = template1 + template2
. Adding two templates merges the hard and soft filters in each template.
Templates by default will log all molecules passed through the hard and soft filters to an internal dataframe, and build a lookup table of {smile : final_score}
. The first time a molecule is screened, it is added to the internal dataframe and lookup table. If that molecule is seen again (defined by the smiles string), the lookup value is returned. If this behavior isn't desired, pass log=False
to disable all logging and lookup, or use_lookup=False
to keep logging but avoid using the lookup table.
smiles = [
'c1ccccc1',
'Cc1cc(NC)ccc1',
'Cc1cc(NC)cnc1',
'Cc1cccc(NCc2ccccc2)c1'
]
mols = [to_mol(i) for i in smiles]
# hard filters
hard_filters = [
ValidityFilter(),
SingleCompoundFilter(),
MolWtFilter(None, 500),
HBDFilter(None, 5),
HBAFilter(None, 10),
LogPFilter(None, 5)
]
# soft filters
soft_filters = [
TPSAFilter(None, 110, score=1),
RotBondFilter(None, 8, score=1),
StructureFilter(['[*]-[#6]1:[#6]:[#6]:[#6]2:[#7]:[#6]:[#7H]:[#6]:2:[#6]:1'],
exclude=False, score=1)
]
template = Template(hard_filters, soft_filters)
assert template.hf('CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC')[0]
assert template.sf('CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC')[0]==3.0
assert template('CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC')
t1 = Template(hard_filters[:3], soft_filters[:2])
t2 = Template(hard_filters[3:], soft_filters[2:])
assert (t1+t2)(mols) == template(mols)
assert (t1+t2)(mols, filter_type='soft') == template(mols, filter_type='soft')
template = RuleOf3Template(log=True)
df = pd.read_csv('files/smiles.csv')
passes, fails = template.screen_mols(df.smiles.values)
assert len(passes) == 92
assert all(template.sample(50, log='hard').final.values==True)
template.save('files/test_temp.template', with_data=False)
template2 = Template.from_file('files/test_temp.template')
assert template2.hard_log.shape[0]==0
assert template.hard_log.shape[0]==2000
os.remove('files/test_temp.template')
template = Template([])
assert template(mols) == [True, True, True, True]
template = RuleOf3Template(log=True, mode='protein')
assert template(['MAAR', 'A']) == [False, True]