This tutorial follows from the Basic Template Tutorial. This tutorial looks more under the hood at filter functions and how they can be extended
import sys
sys.path.append('..')
from mrl.imports import *
from mrl.core import *
from mrl.chem import *
from mrl.templates.all import *
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
First lets get some compounds
df = pd.read_csv('../files/smiles.csv')
# if in Collab
# download_files
# df = pd.read_csv('files/smiles.csv')
smiles = df.smiles.values
mols = to_mols(smiles)
len(mols)
Previously we looked at building templates from existing filters in the library. Now we'll look at creating custom filters.
Property Filters
Property filters calculate some property of a compound and determine if it fallls within a specific range. This is impemented in the PropertyFilter
class. A PropertyFilter
takes in a mol_function
, which is any function that converts an RDKit Mol
to a numeric value.
For example, say we wanted to create a filter for the number of amide bonds in a molecule. We first need to define a function that computes our desired property - in this case the number of amide bonds - and choose a range
def my_property_function(mol):
return rdMolDescriptors.CalcNumAmideBonds(mol)
my_filter = PropertyFilter(my_property_function, min_val=None, max_val=2, name='amide bonds')
my_filter
Now we have a filter that will return True
if a molecule has 2 or fewer amide bonds.
One important point to note is creating my_property_function
as a wrapper around the rdMolDescriptors.CalcNumAmideBonds
. Why don't we just pass rdMolDescriptors.CalcNumAmideBonds
like my_filter = PropertyFilter(rdMolDescriptors.CalcNumAmideBonds, ...)
Filters and templates automatically use parallel processing to evauate large numbers of molecules. RDKit functions break python's multiprocessing because RDKit functions can't be pickled. Creating a wrapper around RDKit functions fixes this issue.
Now we can filter compounds based on the 2 or fewer amide bond criteria.
outputs = my_filter(mols)
passes = []
fails = []
for i, mol in enumerate(mols):
if outputs[i]:
passes.append(mol)
else:
fails.append(mol)
print(len(passes), len(fails))
Property and Criteria Functions
Under the hood, the filter function performs two steps. The first is to compute PropertyFilter.property_function
, which for this filter computes the value of my_property_function
, the number of amide bonds. The result of the property function is then sent to PropertyFilter.criteria_function
, which converts the output of the property function to a boolean value based on the min_val
, max_val
arguments we passed.
Under the hood, it looks like this:
class PropertyFilter(Filter):
def __init__(self, mol_function, min_val=None, max_val=None, score=None, fail_score=0., name=None):
self.mol_function = mol_function
self.min_val = min_val
self.max_val = max_val
if name is None:
name = mol_function.__name__
super().__init__(score, name, fail_score=fail_score)
def property_function(self, mol):
return self.mol_function(mol)
def criteria_function(self, property_output):
lower_bound = (property_output>=self.min_val) if self.min_val is not None else True
upper_bound = (property_output<=self.max_val) if self.max_val is not None else True
output = lower_bound and upper_bound
return output
Custom Filters
To make custom filters, subclass the main Filter
function and define your own property_function
and criteria_function
. For a simple example, look at the SingleCompoundFilter
class. This filter determines if a molecule is a single compound. In SMILES strings, different compounds are separated by a period .
. To filter based on the single compound criteria, we need to set up our property_function
and criteria_function
to return True
if a period is not found in the SMILES string.
class SingleCompoundFilter(Filter):
def __init__(self, score=None, name=None, fail_score=0.):
if name is None:
name = 'Single Compound Filter'
super().__init__(score, name, fail_score=fail_score)
self.priority=1
def property_function(self, mol):
smile = to_smile(mol)
return smile
def criteria_function(self, property_output):
return not '.' in property_output
my_filter = PropertyFilter(my_property_function, min_val=None, max_val=2, name='amide bonds',
score=1, fail_score=0)
Passing some mols to the filter with with_score=True
will return the soft score rather than the hard boolean values
my_filter(mols[:10], with_score=True)
Now the filter returns 1
for passing molecules and 0
for failing molecules. For most MPO functions, this will be sufficient. However, we can use fancier score methods if we want to
Score Functions
Under the hood, scores are determined by the ScoreFunction
class. A score function has a __call__
method that takes in the output of a filter's property_function
and criteria_function
and returns a numeric value.
Lets say for the amide bond example we want our score to actually return 1/(1+n)
where n
is the number of amide bonds. We can implement that as follows:
class MyScore(ScoreFunction):
def __call__(self, property_output, criteria_output):
return 1/(1+property_output)
Now we can pass an instance of MyScore
to the score
keyword argument in PropertyFilter
. The Filter will verify the passes score function is an instance of the main ScoreFunction
class, so it's important to subclass
score = MyScore()
my_filter = PropertyFilter(my_property_function, min_val=None, max_val=2, name='amide bonds',
score=score)
Now we can get pass some mols to our filter and see the score defined by our custom score function
my_filter(mols[:10], with_score=True)
If your property or score calculation is complex, it can be easier to abstract the property calculation and score computation entirely. This can be done with the PassThroughScore
class, which just returns the output of the property function.
class PassThroughScore(ScoreFunction):
"Pass through for no score"
def __call__(self, property_output, criteria_output):
return property_output
We can set up a filter using this like so
my_filter = PropertyFilter(my_property_function, min_val=None, max_val=None, name='my filter',
score=PassThroughScore())
The above filter when used as a soft filter will simply return the output of my_property_function
. When used as a hard filter will always return True
since min_val
and max_val
are None
.