Templates
A core concept in MRL is the ability to control the chemical space explored by a generative model. When applying generative design to a drug design program, an essential requirement is that compounds generated by the model be relevant to the drug design program with respect to chemical properties and structure.
What it means for a compound to be relevant depends a lot on the specifics of the program and what stage of development the program is in. Compound requirements could be a set of property heuristics, like having a molecular weight or TPSA within a certain range, or a required substructure like a scaffold or specific ring configuration.
MRL uses the Template
class to express these requirements. Templates are used to constrain chemical spaces using a set of pass/fail criteria based on easy to calculate chemical properties, such as
Molecular weight: 250-450
Rotatable bonds: Less than 8
PAINS Filter: Pass
When training a generative model with reinforcement learning, compounds that fail these filters can be removed from training or given a large score penalty.
Templates can also be used to assign a score for meeting heuristic criteria. This allows us to define different criteria for must-have molecular properties versus nice-to-have_ chemical properties. In a reinforcement learning context, this translates into giving a score bonus to molecules that fit the nice-to-have criteria. Scores can also be negative to allow for penalizing a molecule that still passes the must-have criteria. For example:
Must Have:
Molecular weight: 250-450,
Rotatable bonds: Less than 8
PAINS Filter: Pass
Nice To Have:
Molecular weight: 350-400 (+1),
TPSA: Less than 80 (+1)
Substructure Match: '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1' (+3)
Substructure Match: '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1' (-1)
Based on the above criteria, a molecule that passes the must-have criteria could get a score between -1 and +5 based on meeting the nice-to-have criteria. During reinforcement learning training, a generative model will be incentivized to favor compounds that both pass the must-have requirements and match the nice-to-have requirements. This allows the nice-to-have requirements to be highly targeted towards narrow property ranges or highly specific substructures without causing issues during training. If these highly targeted criteria were set as hard filters, they might invalidate too many compounds and cause the model to struggle during training.
Templates can also be used to screen training datasets to bias initial models towards desired structures.
Template Structure
Templates are created from the Template
class. Templates contain two sets of filters - hard filters and soft filters. Hard filters denote the must have criteria, while soft filters denote the nice to have criteria. Hard filters are used to assign a True/False pass/fail score to a molecule. Soft filters assign a numeric score to molecules. Hard and soft filters are created with the Filter
class, described below.
For more info on Templates, see the Template page.
Filter Structure
A Filter
expresses some property specification. The primary function of a filter is to define some pass/fail criteria for a molecule. Filters contain a property_function
and a criteria_function
. property_function
computes some value based on the input molecule. criteria_function
converts the output of property_function
to a single boolean value. Filters follow the convention that True
means the input Mol
has passed the criteria_function
function, while False
means the Mol
has failed the criteria_function
.
Optionally, filters can contain a ScoreFunction
, which maps the results of property_function
and criteria_function
to a numeric score. This can be something as simple as returning a constant score when criteria_function=True
to some complex function of the property calculated.
Score functions should be used for soft filters that apply some score bonus/penalty to a compound. Score functions are not necessary for hard filters, which use the output of criteria_function
to determine if a molecule passes or fails.
For more info on Filters, see the Filter page.
Block Templates
The templates described so far deal with evaluating whole molecules. For finer control, we may wish to apply structural constraints at different scales of the molecule.
Say we have compounds of the form R1-scaffold-R2
, and we want to apply different constraints to R1
, the scaffold
and R2
. With the Block
class and some slight changes to molecular representation, we can do this.
First we need to change how molecules are represented to be able to definitively determine which sections of a compound correspond to R1
, R2
and the scaffold
. We convert the full molecule R1-scaffold-R2
to a sequence of fragments *R1.*scaffold*.*R2
. To determine which fragment corresponds to what part of the molecule, we add isotope and map numbers to the wildcard *
atoms. We convert *
to [{isotope}:{map_number}]
. The map_number
determines which wildcards link together, and the isotope
is used to differentiate atoms with the same map number. This gives us our final fragment representation of the form [1*:1]R1.[2*:1]scaffold[2*:2].[1*:1]R2
.
Now we can use the Block
class to construct a set of nested templates, like so:
Block 1 - full molecule template
Block 2 - scaffold template
Block 3 - R1 Template
Block 4 - R2 Template
When a fragment string is processed, each region R1
, R2
and scaffold
are sent to their separate templates and evaluated. Then the fragments are fused into a single compound and evaluated by the full molecule template.
This framework allows us to have greater control over chemical space. We can use this convention to specify different desired structures and properties at R1
, R2
and scaffold
For more info on Blocks and fragment representation, see the Block page.