Intermediate overview of using templates

This tutorial follows from the Basic Template Tutorial. This tutorial looks more under the hood at filter functions and how they can be extended

import sys
sys.path.append('..')

from mrl.imports import *
from mrl.core import *
from mrl.chem import *
from mrl.templates.all import *
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
/home/dmai/miniconda3/envs/mrl/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

First lets get some compounds

df = pd.read_csv('../files/smiles.csv')

# if in Collab
# download_files
# df = pd.read_csv('files/smiles.csv')
smiles = df.smiles.values
mols = to_mols(smiles)
len(mols)
2000

Previously we looked at building templates from existing filters in the library. Now we'll look at creating custom filters.

Property Filters

Property filters calculate some property of a compound and determine if it fallls within a specific range. This is impemented in the PropertyFilter class. A PropertyFilter takes in a mol_function, which is any function that converts an RDKit Mol to a numeric value.

For example, say we wanted to create a filter for the number of amide bonds in a molecule. We first need to define a function that computes our desired property - in this case the number of amide bonds - and choose a range

def my_property_function(mol):
    return rdMolDescriptors.CalcNumAmideBonds(mol)

my_filter = PropertyFilter(my_property_function, min_val=None, max_val=2, name='amide bonds')
my_filter
amide bonds (None, 2)

Now we have a filter that will return True if a molecule has 2 or fewer amide bonds.

One important point to note is creating my_property_function as a wrapper around the rdMolDescriptors.CalcNumAmideBonds. Why don't we just pass rdMolDescriptors.CalcNumAmideBonds like my_filter = PropertyFilter(rdMolDescriptors.CalcNumAmideBonds, ...)

Filters and templates automatically use parallel processing to evauate large numbers of molecules. RDKit functions break python's multiprocessing because RDKit functions can't be pickled. Creating a wrapper around RDKit functions fixes this issue.

Now we can filter compounds based on the 2 or fewer amide bond criteria.

outputs = my_filter(mols)
passes = []
fails = []
for i, mol in enumerate(mols):
    if outputs[i]:
        passes.append(mol)
    else:
        fails.append(mol)
        
print(len(passes), len(fails))
1894 106

Property and Criteria Functions

Under the hood, the filter function performs two steps. The first is to compute PropertyFilter.property_function, which for this filter computes the value of my_property_function, the number of amide bonds. The result of the property function is then sent to PropertyFilter.criteria_function, which converts the output of the property function to a boolean value based on the min_val, max_val arguments we passed.

Under the hood, it looks like this:

class PropertyFilter(Filter):
    def __init__(self, mol_function, min_val=None, max_val=None, score=None, fail_score=0., name=None):

        self.mol_function = mol_function
        self.min_val = min_val
        self.max_val = max_val

        if name is None:
            name = mol_function.__name__

        super().__init__(score, name, fail_score=fail_score)

    def property_function(self, mol):
        return self.mol_function(mol)

    def criteria_function(self, property_output):
        lower_bound = (property_output>=self.min_val) if self.min_val is not None else True
        upper_bound = (property_output<=self.max_val) if self.max_val is not None else True
        output = lower_bound and upper_bound

        return output

Custom Filters

To make custom filters, subclass the main Filter function and define your own property_function and criteria_function. For a simple example, look at the SingleCompoundFilter class. This filter determines if a molecule is a single compound. In SMILES strings, different compounds are separated by a period .. To filter based on the single compound criteria, we need to set up our property_function and criteria_function to return True if a period is not found in the SMILES string.

class SingleCompoundFilter(Filter):
    def __init__(self, score=None, name=None, fail_score=0.):        
        if name is None:
            name = 'Single Compound Filter'

        super().__init__(score, name, fail_score=fail_score)
        self.priority=1

    def property_function(self, mol):
        smile = to_smile(mol)
        return smile

    def criteria_function(self, property_output):
        return not '.' in property_output

Score Criteria

The above code uses the filter we created as a hard filter that returns True/False pass/fail outputs. By adding a score, we can convert the filter to a soft filter that returns a float value

my_filter = PropertyFilter(my_property_function, min_val=None, max_val=2, name='amide bonds',
                          score=1, fail_score=0)

Passing some mols to the filter with with_score=True will return the soft score rather than the hard boolean values

my_filter(mols[:10], with_score=True)
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Now the filter returns 1 for passing molecules and 0 for failing molecules. For most MPO functions, this will be sufficient. However, we can use fancier score methods if we want to

Score Functions

Under the hood, scores are determined by the ScoreFunction class. A score function has a __call__ method that takes in the output of a filter's property_function and criteria_function and returns a numeric value.

Lets say for the amide bond example we want our score to actually return 1/(1+n) where n is the number of amide bonds. We can implement that as follows:

class MyScore(ScoreFunction):
    def __call__(self, property_output, criteria_output):
        
        return 1/(1+property_output)

Now we can pass an instance of MyScore to the score keyword argument in PropertyFilter. The Filter will verify the passes score function is an instance of the main ScoreFunction class, so it's important to subclass

score = MyScore()
my_filter = PropertyFilter(my_property_function, min_val=None, max_val=2, name='amide bonds',
                          score=score)

Now we can get pass some mols to our filter and see the score defined by our custom score function

my_filter(mols[:10], with_score=True)
[0.5,
 0.5,
 0.3333333333333333,
 1.0,
 1.0,
 1.0,
 0.5,
 0.3333333333333333,
 0.5,
 0.5]

If your property or score calculation is complex, it can be easier to abstract the property calculation and score computation entirely. This can be done with the PassThroughScore class, which just returns the output of the property function.

class PassThroughScore(ScoreFunction):
    "Pass through for no score"
    def __call__(self, property_output, criteria_output):
        return property_output

We can set up a filter using this like so

my_filter = PropertyFilter(my_property_function, min_val=None, max_val=None, name='my filter',
                          score=PassThroughScore())

The above filter when used as a soft filter will simply return the output of my_property_function. When used as a hard filter will always return True since min_val and max_val are None.