Basic overview of using templates

This is a basic overview of using the Template class to filter compounds. This tutorial looks under the hood at how templates function and how you can create your own templates

import sys
sys.path.append('..')

from mrl.imports import *
from mrl.core import *
from mrl.chem import *
from mrl.templates.all import *
/home/dmai/miniconda3/envs/mrl/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

First lets get some compounds

df = pd.read_csv('../files/smiles.csv')

# if in Collab
# download_files()
# df = pd.read_csv('files/smiles.csv')
smiles = df.smiles.values
mols = to_mols(smiles)
len(mols)
2000

Basic Templates

Now we create our template. We'll use the RuleOf3Template class, which implements the "Rule of 3" constraints (doi.org/10.1016/S1359-6446(03)02831-9). The Rule of 3 imposes the following criteria:

  • Molecular weight < 300
  • LogP < 3
  • Hydrogen Bond Donors < 3
  • Hydrogen Bond Acceptors < 3
  • Rotatable bonds < 3

This is a specific case of the Template class

template = RuleOf3Template(log=True)

Passing our list of mols to the template, we get out a list of bools

outputs = template(mols)

Which we can use to filter our list of mols for those that pass

passing = [mols[i] for i in range(len(mols)) if outputs[i]]
len(passing)
92

Under the hood, the Template contains a list of filters, implemented with the Filter class. These filters take in a mol and assign it a True/False output.

Take for example the MolWtFilter class. If we create it will the inputs min_val=50, max_val=300, it will return True for any molecule that has a molecular weight between 50 and 300

filt = MolWtFilter(min_val=50, max_val=300)
sum(filt(mols)) # number of passing compounds
503

A Template is basically a list of filters with some added functions for logging data, saving/loading templates and parallel processing.

We can re-create the RuleOf3Template we used earlier by specifying the individual filters for the Rule of 3

filters = [
    MolWtFilter(None, 300),
    LogPFilter(None, 3),
    HBDFilter(None, 3),
    HBAFilter(None, 3),
    RotBondFilter(None, 3)
]

template = Template(filters, log=True)

This gives the same result as before

sum(template(mols))
92

Templates can be saved or loaded as such:

template.save('my_template.template')
new_template = Template.from_file('my_template.template')
os.remove('my_template.template')

Templates also hold a log of all compounds screened

log = template.hard_log
log.columns = template.hard_col_names
log.head()
smiles molwt logp hbd hba rotbond final
0 CNc1nc(SCC(=O)Nc2cc(Cl)ccc2OC)nc2ccccc12 False False True False False False
1 COc1ccc(C(=O)Oc2ccc(/C=C3\C(=N)N4OC(C)=CC4=NC3... False False True False False False
2 Cc1sc(NC(=O)c2ccccc2)c(C(N)=O)c1C True True True True True True
3 COc1ccc(NCc2noc(-c3ccoc3)n2)cc1OC(F)F False False True False False False
4 O=C(COC(=O)c1cccc(Br)c1)c1ccc2c(c1)OCCCO2 False False True False False False

In the log we can see the results from each individual filter. A molecule will only pass if all filters return True

So far we have used Templates and Filters to assign a True/False pass/fail criteria to a molecule. Filters that return a True/False output are called Hard Filters.

We can also use templates to assign a score to compounds as well. Filters that return a score rather than True/False are called Soft Filters.

We can use a combination of hard and soft filters to precisely define our desired chemical space. Hard filters can be thought of as must-have criteria, while soft filters can be thought of as nice-to-have criteria.

In a reinforcement learning context, compounds that fail the hard filters can be removed from training. Passing compounds can then be scored with the soft filters to give a score bonus to highly desirable molecules. This incentivises the model to generate compounds that meet the criteria of the soft filters.

Hard filters are best used to define large ranges of easy to calculate chemical properties to roughly filter compounds. Soft filters are best used to express preferences for specific substructures or narrow property ranges.

For example, lets use the same Rule of 3 hard filters we've been using thus far with some new soft filters.

hard_filters = [
    MolWtFilter(None, 300), # note that `None` means that bound is ignored
    LogPFilter(None, 3),
    HBDFilter(None, 3),
    HBAFilter(None, 3),
    RotBondFilter(None, 3)
]

soft_filters = [
    StructureFilter(['[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1'], exclude=False, score=1, fail_score=-1),
    StructureFilter(['[#6]1:[#6]:[#7]:[#7]:[#6]:[#6]:1', '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1'], 
                    exclude=True, criteria='any', score=0., fail_score=-1),
    MolWtFilter(200,250, score=1),
    TPSAFilter(80, 120, score=1)
]

template = Template(hard_filters, soft_filters)

Lets go through the soft filters we creates.

The first filter is a structure filter that looks for a desirable structure - a 6-membered aromatic ring with one nitrogen. We want to give a score bonus to compounds that meet this criteria. The filter is a StructureFilter. The main argument is a list of smarts to look for, in this case just the one smarts. The exclude=False denotes that the filter will return True if a mol matches the given smarts. score=1 means a compound that matches the smarts gets a score of 1. fail_score=-1 means a compound that fails to match the smarts gets a score of -1.

The second filter looks at substructure filtering from a different perspective. Now we want to exclude undesirable structures. In this case, we have smarts for an aromatic ring with a N-N feature and a N-N-N feature. We pass excude=True and criteria='any' to denote that the filter will return False if a molecule matches any of the smarts given. We set score=0. and fail_score=-1 so that compounds that don't match the smarts get no score, while compounds that match any of the smarts get -1.

The next two filters, MolWtFilter and TPSAFilter are property filters for molecular weight and TPSA, giving a compound a score of 1 for meeting each criteria.

For a detailed look at different filter functions and their arguments, see the Filter page.

Now we can filter compounds using the Template.screen_mols function. This returns two lists, passes and fails. passes contains tuples of (mol, score) for compounds that passes the hard filters. fails contains a list of mols that failed the hard filters

passes, fails = template.screen_mols(mols)
passes[:6]
[(<rdkit.Chem.rdchem.Mol at 0x7ff6235dfa80>, -1.0, 2),
 (<rdkit.Chem.rdchem.Mol at 0x7ff6235eaf80>, -1.0, 68),
 (<rdkit.Chem.rdchem.Mol at 0x7ff6235ec0d0>, -1.0, 120),
 (<rdkit.Chem.rdchem.Mol at 0x7ff6235ec350>, -1.0, 128),
 (<rdkit.Chem.rdchem.Mol at 0x7ff6235ec440>, 0.0, 131),
 (<rdkit.Chem.rdchem.Mol at 0x7ff6235ec760>, 2.0, 141)]

We can also easily merge filters through addition. template1 + template2 will return a template that contains the hard and soft filters from both input templates. For example

t1 = ValidMoleculeTemplate() # returns True for valid single compounds, good check for generative models
t2 = RuleOf5Template() # rule of 5
t3 = t1 + t2

We can see the filters in the __repr__ for the templates

t1
Template
	Hard Filter:
		Vaidity Filter
		Single Compound Filter
	Soft Filter:
		
t2
Template
	Hard Filter:
		hbd (None, 5)
		hba (None, 10)
		molwt (None, 500)
		logp (None, 5)
	Soft Filter:
		
t3
Template
	Hard Filter:
		Vaidity Filter
		Single Compound Filter
		hbd (None, 5)
		hba (None, 10)
		molwt (None, 500)
		logp (None, 5)
	Soft Filter: