Overview
A core concept in MRL is using molecular templates, expressed with the Template class, to define chemical spaces. A template contains a set of filters that define desirable property ranges, such as
Molecular weight: 250-450
Rotatable bonds: Less than 8
PAINS Filter: Pass
These property specifications are expressed through the Filter class. The primary function of a filter is to define some pass/fail criteria for a molecule. This is done through the property_function and criteria_function methods. property_function computes some value based on the input molecule. criteria_function converts the output of property_function to a single boolean value.
Filters follow the convention that True means the input Mol has passed the criteria_function function, while False means the Mol has failed the criteria_function.
We can also use filters to express a soft preference for chemical properties by adding a score. If a score is provided, the output of property_function and criteria_function are sent to a ScoreFunction subclass, which returns a numeric value.
This allows us to use filters to define both the must-have chemical properties as well as nice-to-have properties. For example:
Must Have:
Molecular weight: 250-450,
Rotatable bonds: Less than 8
PAINS Filter: Pass
Nice To Have:
Molecular weight: 350-400 (+1),
TPSA: Less than 80 (+1)
Substructure Match: '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1' (+3)
Substructure Match: '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1' (-1)
Score Functions
ScoreFunction classes take the outputs of a filter (both property_function and criteria_function, see Filter for more details) and return a numeric score. This can be used to incentivise a generative model to produce molecules with specific properties without making those properties must-have constraints.
ConstantScore returns a standard value based on if the criteria_output is True or False. For more sophisticated scores like those seen in MPO functions, something like LinearDecayScore can be used, which returns a constant score within a certain range, but decays the score outside that range.
ScoreFunction can be subclassed with any variant that takes as input property_output and criteria_output and returns a numeric value
score = LinearDecayScore(1, 1,5,10,15, fail_score=-1)
plt.plot(np.linspace(0,16),[score(i, True) for i in np.linspace(0,16)])
score = LinearDecayScore(1, 1,5,None, None, fail_score=-1)
plt.plot(np.linspace(0,16),[score(i, True) for i in np.linspace(0,16)])
Filters
As described before, Filters serve the function of defining some pass/fail criteria for a given molecule. Filters contain a property_function, which computes some property of a molecule, and a criteria_function which converts the output of the property function to a boolean value, following the convention where True denotes a pass.
Filters can optionally contain a score, which can be any of (None, int, float, ScoreFunction). A score of None is converted to NoScore, while a numeric score (int or float) is converted to ConstantScore.
The eval_mol function evaluates the filter on a given input. If with_score=True is passed, the output of self.score_function is returned, while if with_score=False is passed, the boolean output of criteria_function is returned
set(['protein', 'dna'])
ValidityFilter and SingleCompoundFilter are general molecule quantity filters. Generative models may produce invalid structures or multiple compounds when a single compound is desired. These filters can be used to eliminate those outputs
f = ValidityFilter()
assert f('CC')
assert not f('cc') # invalid smiles
f = ValidityFilter(mode='protein')
assert f('MAARG')
assert not f('MXRA')
f = SingleCompoundFilter()
assert f('CC')
assert not f('CC.CC')
f = CharacterCountFilter(['C'], min_val=1, max_val=3)
assert f('CC')
assert not f('N')
f = CharacterCountFilter(['A'], min_val=0, max_val=3, mode='protein')
assert f('MMM')
assert not f('MAMAMAMA')
f = CharacterCountFilter(['C'], min_val=0.1, max_val=0.4, per_length=True)
assert f('CCNNN')
assert not f('N')
f = CharacterCountFilter(['D', 'A', 'M'], min_val=0, max_val=2, mode='protein')
assert f('D')
assert f('DAM')
assert not f('DDDAM')
f = AttachmentFilter(2, 2)
assert f('*CC*')
assert not f('*CC')
The most common type of filter used is one that determines if a specific molecular property is within a certain range. This is implemented with the PropertyFilter class. PropertyFilter will work for any mol_function that takes in a Mol object and returns a numeric output. The numeric output is then compared to min_val and max_val. Unspecified bounds (ie max_val=None) are ignored.
For convenience, a number of PropertyFilter named after specific properties are provided
f = PropertyFilter(molwt, 100, 300)
assert f('O=C(C)Oc1ccccc1C(=O)O')
f = PropertyFilter(molwt, None, None, score=5)
assert f('O=C(C)Oc1ccccc1C(=O)O', with_score=True) == 5
f = MolWtFilter(100, 500, score=WeightedPropertyScore(2.))
assert f('O=C(C)Oc1ccccc1C(=O)O', with_score=True) == 2*molwt(to_mol('O=C(C)Oc1ccccc1C(=O)O'))
f = MolWtFilter(100, 500, mode='protein')
assert f('MAAR')
f = MolWtFilter(400, 500)
assert f('O=C(C)Oc1ccccc1C(=O)O') == False
f = HeteroatomFilter(2, 4)
assert f('O=C(C)Oc1ccccc1C(=O)O')
Another common filter is based on substructure matching. Substructure filtering is typically done in a hard filter fashion used to remove compounds (ie exclude all compounds with PAINS structures).
Substructure filters can also be used in a soft filter fashion to express a preference for molecular substructures. For example, if you would like (but not require) your compound to have a 3-ring scaffold system, that can be implemented through structural filtering as well.
Structure filters take in a list of SMARTS to filter against (or any subclass of Catalog), as well as a criteria (any, all, float).
If citeria=any, property_function will return True if any filters are matched.
If citeria=all, property_function will return True if all filters are matched.
If citeria=float, property_function will return True if float percent of filters (inclusive) are matched.
If criteria=int, property_function will return True if more than int filters (inclusive) are matched.
criteria_function will then evaluate the property_function output based on criteria.
The exclude parameter defines how the filter treats structure matches. Substructure matching returns True when a match is found. If exclude=True, the filter will return False when a match is found. If exclude=False, the filter will return True when a match is found.
To make this more explicit, the ExclusionFilter class always has the exclusion behavior and the KeepFilter class always has the inclusion behavior.
smarts = [
'[*]-[#6]1:[#6]:[#6](-[#0]):[#6]:[#6](-[*]):[#6]:1',
'[*]-[#6]1:[#6]:[#6](-[*]):[#6]:[#6]:[#6]:1',
'[*]-[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1',
'[*]-[#6]1:[#6]:[#6](-[#7]-[*]):[#6]:[#6]:[#6]:1',
'[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1'
]
smiles = [
'c1ccccc1',
'Cc1cc(NC)ccc1',
'Cc1cc(NC)cnc1',
'Cc1cccc(NCc2ccccc2)c1'
]
mols = [to_mol(i) for i in smiles]
f = StructureFilter(smarts, exclude=False, criteria='any')
assert f(mols[1]) == True
catalog = SmartsCatalog(smarts)
f = StructureFilter(catalog, exclude=False, criteria='all')
assert f(mols[1]) == False
f = StructureFilter(smarts, exclude=True, criteria='any')
assert f(mols[1]) == False
f = StructureFilter(smarts, exclude=False, criteria=0.3)
assert f(mols[1]) == True
f = StructureFilter(smarts, exclude=False, criteria=3)
assert f(mols[1]) == True
f = StructureFilter(smarts, exclude=False, criteria=4)
assert f(mols[1]) == False
try:
StructureFilter(smarts, exclude=False, criteria='bla')
output=False
except:
output=True
assert output
Some wrappers for PAINS filters
filt = PAINSAFilter(criteria=5)
assert all(filt(mols))
f = FPFilter.from_smiles(smiles, fp_thresh=0.6)
assert f(mols) == [True, True, True, True]
f = FPFilter.from_smiles(smiles, fp_thresh=0.6, criteria='all')
assert f(mols) == [False, False, False, False]
f = FPFilter.from_smiles(smiles[:1], fp_thresh=0.6)
assert f(mols)==[True, False, False, False]
f = FPFilter.from_smiles(smiles[:2], fp_thresh=0.38, criteria=0.3)
assert f(mols) == [True, True, False, True]
f = FPFilter.from_smiles(smiles[:2], fp_thresh=0.07, criteria=2)
assert f(mols) == [True, True, False, True]