Overview
A core concept in MRL is using molecular templates, expressed with the Template
class, to define chemical spaces. A template contains a set of filters that define desirable property ranges, such as
Molecular weight: 250-450
Rotatable bonds: Less than 8
PAINS Filter: Pass
These property specifications are expressed through the Filter
class. The primary function of a filter is to define some pass/fail criteria for a molecule. This is done through the property_function
and criteria_function
methods. property_function
computes some value based on the input molecule. criteria_function
converts the output of property_function
to a single boolean value.
Filters follow the convention that True
means the input Mol
has passed the criteria_function
function, while False
means the Mol
has failed the criteria_function
.
We can also use filters to express a soft preference for chemical properties by adding a score. If a score is provided, the output of property_function
and criteria_function
are sent to a ScoreFunction
subclass, which returns a numeric value.
This allows us to use filters to define both the must-have chemical properties as well as nice-to-have properties. For example:
Must Have:
Molecular weight: 250-450,
Rotatable bonds: Less than 8
PAINS Filter: Pass
Nice To Have:
Molecular weight: 350-400 (+1),
TPSA: Less than 80 (+1)
Substructure Match: '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1' (+3)
Substructure Match: '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1' (-1)
Score Functions
ScoreFunction
classes take the outputs of a filter (both property_function
and criteria_function
, see Filter
for more details) and return a numeric score. This can be used to incentivise a generative model to produce molecules with specific properties without making those properties must-have constraints.
ConstantScore
returns a standard value based on if the criteria_output
is True
or False
. For more sophisticated scores like those seen in MPO functions, something like LinearDecayScore
can be used, which returns a constant score within a certain range, but decays the score outside that range.
ScoreFunction
can be subclassed with any variant that takes as input property_output
and criteria_output
and returns a numeric value
score = LinearDecayScore(1, 1,5,10,15, fail_score=-1)
plt.plot(np.linspace(0,16),[score(i, True) for i in np.linspace(0,16)])
score = LinearDecayScore(1, 1,5,None, None, fail_score=-1)
plt.plot(np.linspace(0,16),[score(i, True) for i in np.linspace(0,16)])
Filters
As described before, Filters serve the function of defining some pass/fail criteria for a given molecule. Filters contain a property_function
, which computes some property of a molecule, and a criteria_function
which converts the output of the property function to a boolean value, following the convention where True
denotes a pass.
Filters can optionally contain a score, which can be any of (None, int, float, ScoreFunction)
. A score of None
is converted to NoScore
, while a numeric score (int or float) is converted to ConstantScore
.
The eval_mol
function evaluates the filter on a given input. If with_score=True
is passed, the output of self.score_function
is returned, while if with_score=False
is passed, the boolean output of criteria_function
is returned
set(['protein', 'dna'])
ValidityFilter
and SingleCompoundFilter
are general molecule quantity filters. Generative models may produce invalid structures or multiple compounds when a single compound is desired. These filters can be used to eliminate those outputs
f = ValidityFilter()
assert f('CC')
assert not f('cc') # invalid smiles
f = ValidityFilter(mode='protein')
assert f('MAARG')
assert not f('MXRA')
f = SingleCompoundFilter()
assert f('CC')
assert not f('CC.CC')
f = CharacterCountFilter(['C'], min_val=1, max_val=3)
assert f('CC')
assert not f('N')
f = CharacterCountFilter(['A'], min_val=0, max_val=3, mode='protein')
assert f('MMM')
assert not f('MAMAMAMA')
f = CharacterCountFilter(['C'], min_val=0.1, max_val=0.4, per_length=True)
assert f('CCNNN')
assert not f('N')
f = CharacterCountFilter(['D', 'A', 'M'], min_val=0, max_val=2, mode='protein')
assert f('D')
assert f('DAM')
assert not f('DDDAM')
f = AttachmentFilter(2, 2)
assert f('*CC*')
assert not f('*CC')
The most common type of filter used is one that determines if a specific molecular property is within a certain range. This is implemented with the PropertyFilter
class. PropertyFilter
will work for any mol_function
that takes in a Mol
object and returns a numeric output. The numeric output is then compared to min_val
and max_val
. Unspecified bounds (ie max_val=None
) are ignored.
For convenience, a number of PropertyFilter
named after specific properties are provided
f = PropertyFilter(molwt, 100, 300)
assert f('O=C(C)Oc1ccccc1C(=O)O')
f = PropertyFilter(molwt, None, None, score=5)
assert f('O=C(C)Oc1ccccc1C(=O)O', with_score=True) == 5
f = MolWtFilter(100, 500, score=WeightedPropertyScore(2.))
assert f('O=C(C)Oc1ccccc1C(=O)O', with_score=True) == 2*molwt(to_mol('O=C(C)Oc1ccccc1C(=O)O'))
f = MolWtFilter(100, 500, mode='protein')
assert f('MAAR')
f = MolWtFilter(400, 500)
assert f('O=C(C)Oc1ccccc1C(=O)O') == False
f = HeteroatomFilter(2, 4)
assert f('O=C(C)Oc1ccccc1C(=O)O')
Another common filter is based on substructure matching. Substructure filtering is typically done in a hard filter fashion used to remove compounds (ie exclude all compounds with PAINS structures).
Substructure filters can also be used in a soft filter fashion to express a preference for molecular substructures. For example, if you would like (but not require) your compound to have a 3-ring scaffold system, that can be implemented through structural filtering as well.
Structure filters take in a list of SMARTS to filter against (or any subclass of Catalog
), as well as a criteria
(any, all, float).
If citeria=any
, property_function
will return True
if any filters are matched.
If citeria=all
, property_function
will return True
if all filters are matched.
If citeria=float
, property_function
will return True
if float
percent of filters (inclusive) are matched.
If criteria=int
, property_function
will return True
if more than int
filters (inclusive) are matched.
criteria_function
will then evaluate the property_function
output based on criteria
.
The exclude
parameter defines how the filter treats structure matches. Substructure matching returns True
when a match is found. If exclude=True
, the filter will return False
when a match is found. If exclude=False
, the filter will return True
when a match is found.
To make this more explicit, the ExclusionFilter
class always has the exclusion behavior and the KeepFilter
class always has the inclusion behavior.
smarts = [
'[*]-[#6]1:[#6]:[#6](-[#0]):[#6]:[#6](-[*]):[#6]:1',
'[*]-[#6]1:[#6]:[#6](-[*]):[#6]:[#6]:[#6]:1',
'[*]-[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1',
'[*]-[#6]1:[#6]:[#6](-[#7]-[*]):[#6]:[#6]:[#6]:1',
'[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1'
]
smiles = [
'c1ccccc1',
'Cc1cc(NC)ccc1',
'Cc1cc(NC)cnc1',
'Cc1cccc(NCc2ccccc2)c1'
]
mols = [to_mol(i) for i in smiles]
f = StructureFilter(smarts, exclude=False, criteria='any')
assert f(mols[1]) == True
catalog = SmartsCatalog(smarts)
f = StructureFilter(catalog, exclude=False, criteria='all')
assert f(mols[1]) == False
f = StructureFilter(smarts, exclude=True, criteria='any')
assert f(mols[1]) == False
f = StructureFilter(smarts, exclude=False, criteria=0.3)
assert f(mols[1]) == True
f = StructureFilter(smarts, exclude=False, criteria=3)
assert f(mols[1]) == True
f = StructureFilter(smarts, exclude=False, criteria=4)
assert f(mols[1]) == False
try:
StructureFilter(smarts, exclude=False, criteria='bla')
output=False
except:
output=True
assert output
Some wrappers for PAINS filters
filt = PAINSAFilter(criteria=5)
assert all(filt(mols))
f = FPFilter.from_smiles(smiles, fp_thresh=0.6)
assert f(mols) == [True, True, True, True]
f = FPFilter.from_smiles(smiles, fp_thresh=0.6, criteria='all')
assert f(mols) == [False, False, False, False]
f = FPFilter.from_smiles(smiles[:1], fp_thresh=0.6)
assert f(mols)==[True, False, False, False]
f = FPFilter.from_smiles(smiles[:2], fp_thresh=0.38, criteria=0.3)
assert f(mols) == [True, True, False, True]
f = FPFilter.from_smiles(smiles[:2], fp_thresh=0.07, criteria=2)
assert f(mols) == [True, True, False, True]