This is a basic overview of using the Template
class to filter compounds. This tutorial looks under the hood at how templates function and how you can create your own templates
import sys
sys.path.append('..')
from mrl.imports import *
from mrl.core import *
from mrl.chem import *
from mrl.templates.all import *
First lets get some compounds
df = pd.read_csv('../files/smiles.csv')
# if in Collab
# download_files()
# df = pd.read_csv('files/smiles.csv')
smiles = df.smiles.values
mols = to_mols(smiles)
len(mols)
Basic Templates
Now we create our template. We'll use the RuleOf3Template
class, which implements the "Rule of 3" constraints (doi.org/10.1016/S1359-6446(03)02831-9). The Rule of 3 imposes the following criteria:
- Molecular weight < 300
- LogP < 3
- Hydrogen Bond Donors < 3
- Hydrogen Bond Acceptors < 3
- Rotatable bonds < 3
This is a specific case of the Template
class
template = RuleOf3Template(log=True)
Passing our list of mols to the template, we get out a list of bools
outputs = template(mols)
Which we can use to filter our list of mols for those that pass
passing = [mols[i] for i in range(len(mols)) if outputs[i]]
len(passing)
Under the hood, the Template contains a list of filters, implemented with the Filter
class. These filters take in a mol and assign it a True/False output.
Take for example the MolWtFilter
class. If we create it will the inputs min_val=50, max_val=300
, it will return True for any molecule that has a molecular weight between 50 and 300
filt = MolWtFilter(min_val=50, max_val=300)
sum(filt(mols)) # number of passing compounds
A Template
is basically a list of filters with some added functions for logging data, saving/loading templates and parallel processing.
We can re-create the RuleOf3Template
we used earlier by specifying the individual filters for the Rule of 3
filters = [
MolWtFilter(None, 300),
LogPFilter(None, 3),
HBDFilter(None, 3),
HBAFilter(None, 3),
RotBondFilter(None, 3)
]
template = Template(filters, log=True)
This gives the same result as before
sum(template(mols))
Templates can be saved or loaded as such:
template.save('my_template.template')
new_template = Template.from_file('my_template.template')
os.remove('my_template.template')
Templates also hold a log of all compounds screened
log = template.hard_log
log.columns = template.hard_col_names
log.head()
In the log we can see the results from each individual filter. A molecule will only pass if all filters return True
So far we have used Templates and Filters to assign a True/False pass/fail criteria to a molecule. Filters that return a True/False output are called Hard Filters.
We can also use templates to assign a score to compounds as well. Filters that return a score rather than True/False are called Soft Filters.
We can use a combination of hard and soft filters to precisely define our desired chemical space. Hard filters can be thought of as must-have criteria, while soft filters can be thought of as nice-to-have criteria.
In a reinforcement learning context, compounds that fail the hard filters can be removed from training. Passing compounds can then be scored with the soft filters to give a score bonus to highly desirable molecules. This incentivises the model to generate compounds that meet the criteria of the soft filters.
Hard filters are best used to define large ranges of easy to calculate chemical properties to roughly filter compounds. Soft filters are best used to express preferences for specific substructures or narrow property ranges.
For example, lets use the same Rule of 3 hard filters we've been using thus far with some new soft filters.
hard_filters = [
MolWtFilter(None, 300), # note that `None` means that bound is ignored
LogPFilter(None, 3),
HBDFilter(None, 3),
HBAFilter(None, 3),
RotBondFilter(None, 3)
]
soft_filters = [
StructureFilter(['[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1'], exclude=False, score=1, fail_score=-1),
StructureFilter(['[#6]1:[#6]:[#7]:[#7]:[#6]:[#6]:1', '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1'],
exclude=True, criteria='any', score=0., fail_score=-1),
MolWtFilter(200,250, score=1),
TPSAFilter(80, 120, score=1)
]
template = Template(hard_filters, soft_filters)
Lets go through the soft filters we creates.
The first filter is a structure filter that looks for a desirable structure - a 6-membered aromatic ring with one nitrogen. We want to give a score bonus to compounds that meet this criteria. The filter is a StructureFilter
. The main argument is a list of smarts to look for, in this case just the one smarts. The exclude=False
denotes that the filter will return True
if a mol matches the given smarts. score=1
means a compound that matches the smarts gets a score of 1
. fail_score=-1
means a compound that fails to match the smarts gets a score of -1
.
The second filter looks at substructure filtering from a different perspective. Now we want to exclude undesirable structures. In this case, we have smarts for an aromatic ring with a N-N
feature and a N-N-N
feature. We pass excude=True
and criteria='any'
to denote that the filter will return False
if a molecule matches any of the smarts given. We set score=0.
and fail_score=-1
so that compounds that don't match the smarts get no score, while compounds that match any of the smarts get -1
.
The next two filters, MolWtFilter
and TPSAFilter
are property filters for molecular weight and TPSA, giving a compound a score of 1
for meeting each criteria.
For a detailed look at different filter functions and their arguments, see the Filter page.
Now we can filter compounds using the Template.screen_mols
function. This returns two lists, passes
and fails
. passes
contains tuples of (mol, score)
for compounds that passes the hard filters. fails
contains a list of mols that failed the hard filters
passes, fails = template.screen_mols(mols)
passes[:6]
We can also easily merge filters through addition. template1 + template2
will return a template that contains the hard and soft filters from both input templates. For example
t1 = ValidMoleculeTemplate() # returns True for valid single compounds, good check for generative models
t2 = RuleOf5Template() # rule of 5
t3 = t1 + t2
We can see the filters in the __repr__
for the templates
t1
t2
t3