Filters are used to define pass/fail criteria for screening molecules
/home/dmai/miniconda3/envs/mrl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

Overview

A core concept in MRL is using molecular templates, expressed with the Template class, to define chemical spaces. A template contains a set of filters that define desirable property ranges, such as

Molecular weight: 250-450
Rotatable bonds: Less than 8
PAINS Filter: Pass

These property specifications are expressed through the Filter class. The primary function of a filter is to define some pass/fail criteria for a molecule. This is done through the property_function and criteria_function methods. property_function computes some value based on the input molecule. criteria_function converts the output of property_function to a single boolean value.

Filters follow the convention that True means the input Mol has passed the criteria_function function, while False means the Mol has failed the criteria_function.

We can also use filters to express a soft preference for chemical properties by adding a score. If a score is provided, the output of property_function and criteria_function are sent to a ScoreFunction subclass, which returns a numeric value.

This allows us to use filters to define both the must-have chemical properties as well as nice-to-have properties. For example:

Must Have:
Molecular weight: 250-450, 
Rotatable bonds: Less than 8
PAINS Filter: Pass

Nice To Have:
Molecular weight: 350-400 (+1), 
TPSA: Less than 80 (+1)
Substructure Match: '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1' (+3)
Substructure Match: '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1' (-1)

Score Functions

ScoreFunction classes take the outputs of a filter (both property_function and criteria_function, see Filter for more details) and return a numeric score. This can be used to incentivise a generative model to produce molecules with specific properties without making those properties must-have constraints.

ConstantScore returns a standard value based on if the criteria_output is True or False. For more sophisticated scores like those seen in MPO functions, something like LinearDecayScore can be used, which returns a constant score within a certain range, but decays the score outside that range.

ScoreFunction can be subclassed with any variant that takes as input property_output and criteria_output and returns a numeric value

class ScoreFunction[source]

ScoreFunction()

Base score function

class NoScore[source]

NoScore() :: ScoreFunction

Pass through for no score

class PassThroughScore[source]

PassThroughScore() :: ScoreFunction

Pass through for property_output

class ModifiedScore[source]

ModifiedScore(fail_score=0.0) :: ScoreFunction

Base class for scores where property_output is modified by some function

class ConstantScore[source]

ConstantScore(pass_score, fail_score) :: ModifiedScore

Returns pass_score if criteria_output, else fail_score

class WeightedPropertyScore[source]

WeightedPropertyScore(weight, fail_score=0.0) :: ModifiedScore

Returns weight*property_output if criteria_output, else fail_score

class PropertyFunctionScore[source]

PropertyFunctionScore(function, fail_score=0.0) :: ModifiedScore

Returns output function(property_output)

class LinearDecayScore[source]

LinearDecayScore(pass_score, low_start, low_end, high_start, high_end, fail_score=0.0) :: ScoreFunction

LinearDecayScore - score with linear decay. low_start<low_end<high_start<high_end

Returns pass_score if criteria_output=True and low_end<=property_output<=high_start. If low_start<=property_output<=low_end or high_start<=property_output<=high_end, the score is a linear interpolation between pass_score and fail_score. Otherwise, returns fail_score.

One of low_end, high_start must be not None.

If one of low_end, high_start is None, the corresponding bound is ignored

if low_start or high_end is None, the score immediately drops to fail_score

score = LinearDecayScore(1, 1,5,10,15, fail_score=-1)
plt.plot(np.linspace(0,16),[score(i, True) for i in np.linspace(0,16)])
[<matplotlib.lines.Line2D at 0x7f680558e3d0>]
score = LinearDecayScore(1, 1,5,None, None, fail_score=-1)
plt.plot(np.linspace(0,16),[score(i, True) for i in np.linspace(0,16)])
[<matplotlib.lines.Line2D at 0x7f6804d53d10>]

Filters

As described before, Filters serve the function of defining some pass/fail criteria for a given molecule. Filters contain a property_function, which computes some property of a molecule, and a criteria_function which converts the output of the property function to a boolean value, following the convention where True denotes a pass.

Filters can optionally contain a score, which can be any of (None, int, float, ScoreFunction). A score of None is converted to NoScore, while a numeric score (int or float) is converted to ConstantScore.

The eval_mol function evaluates the filter on a given input. If with_score=True is passed, the output of self.score_function is returned, while if with_score=False is passed, the boolean output of criteria_function is returned

set(['protein', 'dna'])
{'dna', 'protein'}

class Filter[source]

Filter(score=None, name=None, fail_score=0.0, mode='smile')

Filter - base filter function class

Inputs:

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

  • mode str['smile', 'protein', 'dna', 'rna']: determines how inputs are converted to Mol objects

ValidityFilter and SingleCompoundFilter are general molecule quantity filters. Generative models may produce invalid structures or multiple compounds when a single compound is desired. These filters can be used to eliminate those outputs

class ValidityFilter[source]

ValidityFilter(score=None, name=None, fail_score=0.0, mode='smile') :: Filter

ValidityFilter - checks to see if a given Mol is a valid compound

Inputs:

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

  • mode str['smile', 'protein', 'dna', 'rna']: determines how inputs are converted to Mol objects

class SingleCompoundFilter[source]

SingleCompoundFilter(score=None, name=None, fail_score=0.0, mode='smile') :: Filter

SingleCompoundFilter - checks to see if a given Mol is a single compound

Inputs:

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

  • mode str['smile', 'protein', 'dna', 'rna']: determines how inputs are converted to Mol objects

f = ValidityFilter()
assert f('CC')
assert not f('cc') # invalid smiles

f = ValidityFilter(mode='protein')
assert f('MAARG')
assert not f('MXRA')

f = SingleCompoundFilter()
assert f('CC')
assert not f('CC.CC')

class CharacterCountFilter[source]

CharacterCountFilter(chars, min_val=None, max_val=None, per_length=False, score=None, name=None, fail_score=0.0, mode='smile') :: Filter

CharacterCountFilter - validates Mol based on the count of the specified character

Inputs:

  • chars list[str]: character to count

  • min_val Optional[float, int]: min value for count

  • max_val Optional[float, int]: max value for count

`per_length bool: if True, counts are normalized by string length

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

  • mode str['smile', 'protein', 'dna', 'rna']: determines how inputs are converted to Mol objects

class AttachmentFilter[source]

AttachmentFilter(min_val=None, max_val=None, per_length=False, score=None, name=None, fail_score=0.0, mode='smile') :: CharacterCountFilter

AttachmentFilter - validates Mol based on the number of * attachment points

Inputs:

  • min_val Optional[float, int]: min attachment value

  • max_val Optional[float, int]: max attachment value

`per_length bool: if True, counts are normalized by string length

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

  • mode str['smile', 'protein', 'dna', 'rna']: determines how inputs are converted to Mol objects

f = CharacterCountFilter(['C'], min_val=1, max_val=3)
assert f('CC')
assert not f('N')

f = CharacterCountFilter(['A'], min_val=0, max_val=3, mode='protein')
assert f('MMM')
assert not f('MAMAMAMA')

f = CharacterCountFilter(['C'], min_val=0.1, max_val=0.4, per_length=True)
assert f('CCNNN')
assert not f('N')

f = CharacterCountFilter(['D', 'A', 'M'], min_val=0, max_val=2, mode='protein')
assert f('D')
assert f('DAM')
assert not f('DDDAM')

f = AttachmentFilter(2, 2)
assert f('*CC*')
assert not f('*CC')

The most common type of filter used is one that determines if a specific molecular property is within a certain range. This is implemented with the PropertyFilter class. PropertyFilter will work for any mol_function that takes in a Mol object and returns a numeric output. The numeric output is then compared to min_val and max_val. Unspecified bounds (ie max_val=None) are ignored.

For convenience, a number of PropertyFilter named after specific properties are provided

class PropertyFilter[source]

PropertyFilter(mol_function, min_val=None, max_val=None, score=None, fail_score=0.0, name=None, mode='smile') :: Filter

PropertyFilter - filters mols based on mol_function

Inputs:

  • mol_function Callable: any function that takes as input a Mol object and returns a single numeric value

  • min_val Optional[float, int]: inclusive lower bound for filter (ignored if None)

  • max_val Optional[float, int]: inclusive upper bound for filter (ignored if None)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

  • mode str['smile', 'protein', 'dna', 'rna']: determines how inputs are converted to Mol objects

class MolWtFilter[source]

MolWtFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Molecular weight filter

class HBDFilter[source]

HBDFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Hydrogen bond donor filter

class HBAFilter[source]

HBAFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Hydrogen bond acceptor filter

class TPSAFilter[source]

TPSAFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

TPSA filter

class RotBondFilter[source]

RotBondFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Rotatable bond filter

class SP3Filter[source]

SP3Filter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Fractioon sp3 filter

class LogPFilter[source]

LogPFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

LogP filter

class PenalizedLogPFilter[source]

PenalizedLogPFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Penalized LogP filter

class RingFilter[source]

RingFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Ring filter

class HeteroatomFilter[source]

HeteroatomFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Heteroatom filter

class AromaticRingFilter[source]

AromaticRingFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Aromatic ring filter

class HeavyAtomsFilter[source]

HeavyAtomsFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Number of heavy atoms filter

class MRFilter[source]

MRFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Molar refractivity of atoms filter

class ChargeFilter[source]

ChargeFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Formal charge of atoms filter

class TotalAtomFilter[source]

TotalAtomFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Total number of atoms filter (incudes H)

class QEDFilter[source]

QEDFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Total number of atoms filter (incudes H)

class SAFilter[source]

SAFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

SA Score fillter

class LooseRotBondFilter[source]

LooseRotBondFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Loose Rotatable bond filter

class MaxRingFilter[source]

MaxRingFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Max ring size filter

class MinRingFilter[source]

MinRingFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Min ring size filter

class BridgeheadFilter[source]

BridgeheadFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Number of bridgehead carbons filter

class SpiroFilter[source]

SpiroFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Spiro carbon filter

class ChiralFilter[source]

ChiralFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Chiral center filter

class RotChainFilter[source]

RotChainFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Longest rotatable bond chain filter

class RadicalFilter[source]

RadicalFilter(min_val, max_val, score=None, name=None, **kwargs) :: PropertyFilter

Number of radical electrons filter

f = PropertyFilter(molwt, 100, 300)
assert f('O=C(C)Oc1ccccc1C(=O)O')

f = PropertyFilter(molwt, None, None, score=5)
assert f('O=C(C)Oc1ccccc1C(=O)O', with_score=True) == 5

f = MolWtFilter(100, 500, score=WeightedPropertyScore(2.))
assert f('O=C(C)Oc1ccccc1C(=O)O', with_score=True) == 2*molwt(to_mol('O=C(C)Oc1ccccc1C(=O)O'))

f = MolWtFilter(100, 500, mode='protein')
assert f('MAAR')

f = MolWtFilter(400, 500)
assert f('O=C(C)Oc1ccccc1C(=O)O') == False

f = HeteroatomFilter(2, 4)
assert f('O=C(C)Oc1ccccc1C(=O)O')

Another common filter is based on substructure matching. Substructure filtering is typically done in a hard filter fashion used to remove compounds (ie exclude all compounds with PAINS structures).

Substructure filters can also be used in a soft filter fashion to express a preference for molecular substructures. For example, if you would like (but not require) your compound to have a 3-ring scaffold system, that can be implemented through structural filtering as well.

Structure filters take in a list of SMARTS to filter against (or any subclass of Catalog), as well as a criteria (any, all, float).

If citeria=any, property_function will return True if any filters are matched.

If citeria=all, property_function will return True if all filters are matched.

If citeria=float, property_function will return True if float percent of filters (inclusive) are matched.

If criteria=int, property_function will return True if more than int filters (inclusive) are matched.

criteria_function will then evaluate the property_function output based on criteria.

The exclude parameter defines how the filter treats structure matches. Substructure matching returns True when a match is found. If exclude=True, the filter will return False when a match is found. If exclude=False, the filter will return True when a match is found.

To make this more explicit, the ExclusionFilter class always has the exclusion behavior and the KeepFilter class always has the inclusion behavior.

criteria_check[source]

criteria_check(criteria)

class StructureFilter[source]

StructureFilter(smarts, exclude=True, criteria='any', score=None, name=None, fail_score=0.0) :: Filter

StructureFilter - filters mols based on structures in smarts

Inputs:

  • smarts [list, SmartsCatalog]: list of smarts strings for filtering or SmartsCatalog

  • exclude bool: if True, filter returns False when a structure match is found

  • criteria ['any', 'all', float, int]: match criteria. (match any filter, match all filters, match float percent of filters, match int number of filters)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

class ExclusionFilter[source]

ExclusionFilter(smarts, criteria='any', score=None, name=None, fail_score=0.0) :: StructureFilter

ExclusionFilter - excludes mols with substructure matches to smarts

Inputs:

  • smarts [list, SmartsCatalog]: list of smarts strings for filtering or SmartsCatalog

  • criteria ['any', 'all', float, int]: match criteria. (match any filter, match all filters, match float percent of filters, match int number of filters)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

class KeepFilter[source]

KeepFilter(smarts, criteria='any', score=None, name=None, fail_score=0.0) :: StructureFilter

KeepFilter - keeps mols with substructure matches to smarts

Inputs:

  • smarts [list, SmartsCatalog]: list of smarts strings for filtering or SmartsCatalog

  • criteria ['any', 'all', float, int]: match criteria. (match any filter, match all filters, match float percent of filters, match int number of filters)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

smarts = [
    '[*]-[#6]1:[#6]:[#6](-[#0]):[#6]:[#6](-[*]):[#6]:1',
    '[*]-[#6]1:[#6]:[#6](-[*]):[#6]:[#6]:[#6]:1',
    '[*]-[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1',
    '[*]-[#6]1:[#6]:[#6](-[#7]-[*]):[#6]:[#6]:[#6]:1',
    '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1'
]

smiles = [
    'c1ccccc1',
    'Cc1cc(NC)ccc1',
    'Cc1cc(NC)cnc1',
    'Cc1cccc(NCc2ccccc2)c1'
]

mols = [to_mol(i) for i in smiles]

f = StructureFilter(smarts, exclude=False, criteria='any')
assert f(mols[1]) == True

catalog = SmartsCatalog(smarts)
f = StructureFilter(catalog, exclude=False, criteria='all')
assert f(mols[1]) == False

f = StructureFilter(smarts, exclude=True, criteria='any')
assert f(mols[1]) == False

f = StructureFilter(smarts, exclude=False, criteria=0.3)
assert f(mols[1]) == True

f = StructureFilter(smarts, exclude=False, criteria=3)
assert f(mols[1]) == True

f = StructureFilter(smarts, exclude=False, criteria=4)
assert f(mols[1]) == False
try:
    StructureFilter(smarts, exclude=False, criteria='bla')
    output=False
except:
    output=True
    
assert output

Some wrappers for PAINS filters

class PAINSFilter[source]

PAINSFilter(criteria='any', score=None, name=None, fail_score=0.0) :: ExclusionFilter

PAINSFilter - excludes mols with substructure matches to PAINS filters

Inputs:

  • criteria ['any', 'all', float, int]: match criteria. (match any filter, match all filters, match float percent of filters, match int number of filters)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

class PAINSAFilter[source]

PAINSAFilter(criteria='any', score=None, name=None, fail_score=0.0) :: ExclusionFilter

PAINSAFilter - excludes mols with substructure matches to PAINS_A filters

Inputs:

  • criteria ['any', 'all', float, int]: match criteria. (match any filter, match all filters, match float percent of filters, match int number of filters)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

class PAINSBFilter[source]

PAINSBFilter(criteria='any', score=None, name=None, fail_score=0.0) :: ExclusionFilter

PAINSBFilter - excludes mols with substructure matches to PAINS_B filters

Inputs:

  • criteria ['any', 'all', float, int]: match criteria. (match any filter, match all filters, match float percent of filters, match int number of filters)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

class PAINSCFilter[source]

PAINSCFilter(criteria='any', score=None, name=None, fail_score=0.0) :: ExclusionFilter

PAINSCFilter - excludes mols with substructure matches to PAINS_C filters

Inputs:

  • criteria ['any', 'all', float, int]: match criteria. (match any filter, match all filters, match float percent of filters, match int number of filters)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

filt = PAINSAFilter(criteria=5)
assert all(filt(mols))

FPFilter allows for filtering based on fingerprint similarity. For a given molecule, a fingerprint of fp_type is generated and compared to reference_fps based on fp_metric. Fingerprint similarity scores greater than fp_thresh evaluate to True.

See FP for fingerprint types and similarity metrics.

class FPFilter[source]

FPFilter(reference_fps, fp_type, fp_metric, criteria='any', fp_thresh=0.0, score=None, name=None, fail_score=0.0) :: Filter

FPFilter - filters mols based on fingerprint similarity to reference_smiles

Inputs:

  • reference_smiles listlist of smiles or Mol objects for comparison

  • fp_type str: fingerprint function. see FP for available functions

  • fp_metric str: fingerprint similarity metric. see FP for available metrics

  • criteria ['any', 'all', float, int]: match criteria. (match any filter, match all filters, match float percent of filters, match int number of filters)

  • fp_thresh float: fingerprint similarity cutoff for defining a match

  • name Optional[str]: filter name used for repr

  • fail_score [float, int]: used in Filter.set_score if score_function is (int, float)

  • score [None, int, float, ScoreFunction]: see Filter.set_score

FPFilter.from_smiles[source]

FPFilter.from_smiles(reference_smiles, fp_type='ECFP6', fp_metric='tanimoto', criteria='any', fp_thresh=0.0, score=None, name=None, fail_score=0)

creates FPFilter from reference_smiles

reference_smiles can be a list of smiles or a list of Mols

f = FPFilter.from_smiles(smiles, fp_thresh=0.6)
assert f(mols) == [True, True, True, True]

f = FPFilter.from_smiles(smiles, fp_thresh=0.6, criteria='all')
assert f(mols) == [False, False, False, False]

f = FPFilter.from_smiles(smiles[:1], fp_thresh=0.6)
assert f(mols)==[True, False, False, False]

f = FPFilter.from_smiles(smiles[:2], fp_thresh=0.38, criteria=0.3)
assert f(mols) == [True, True, False, True]

f = FPFilter.from_smiles(smiles[:2], fp_thresh=0.07, criteria=2)
assert f(mols) == [True, True, False, True]