Overview of the role of Templates in MRL

Templates

A core concept in MRL is the ability to control the chemical space explored by a generative model. When applying generative design to a drug design program, an essential requirement is that compounds generated by the model be relevant to the drug design program with respect to chemical properties and structure.

What it means for a compound to be relevant depends a lot on the specifics of the program and what stage of development the program is in. Compound requirements could be a set of property heuristics, like having a molecular weight or TPSA within a certain range, or a required substructure like a scaffold or specific ring configuration.

MRL uses the Template class to express these requirements. Templates are used to constrain chemical spaces using a set of pass/fail criteria based on easy to calculate chemical properties, such as

Molecular weight: 250-450
Rotatable bonds: Less than 8
PAINS Filter: Pass

When training a generative model with reinforcement learning, compounds that fail these filters can be removed from training or given a large score penalty.

Templates can also be used to assign a score for meeting heuristic criteria. This allows us to define different criteria for must-have molecular properties versus nice-to-have_ chemical properties. In a reinforcement learning context, this translates into giving a score bonus to molecules that fit the nice-to-have criteria. Scores can also be negative to allow for penalizing a molecule that still passes the must-have criteria. For example:

Must Have:
Molecular weight: 250-450, 
Rotatable bonds: Less than 8
PAINS Filter: Pass

Nice To Have:
Molecular weight: 350-400 (+1), 
TPSA: Less than 80 (+1)
Substructure Match: '[#6]1:[#6]:[#7]:[#6]:[#6]:[#6]:1' (+3)
Substructure Match: '[#6]1:[#6]:[#7]:[#7]:[#7]:[#6]:1' (-1)

Based on the above criteria, a molecule that passes the must-have criteria could get a score between -1 and +5 based on meeting the nice-to-have criteria. During reinforcement learning training, a generative model will be incentivized to favor compounds that both pass the must-have requirements and match the nice-to-have requirements. This allows the nice-to-have requirements to be highly targeted towards narrow property ranges or highly specific substructures without causing issues during training. If these highly targeted criteria were set as hard filters, they might invalidate too many compounds and cause the model to struggle during training.

Templates can also be used to screen training datasets to bias initial models towards desired structures.

Template Structure

Templates are created from the Template class. Templates contain two sets of filters - hard filters and soft filters. Hard filters denote the must have criteria, while soft filters denote the nice to have criteria. Hard filters are used to assign a True/False pass/fail score to a molecule. Soft filters assign a numeric score to molecules. Hard and soft filters are created with the Filter class, described below.

For more info on Templates, see the Template page.

Filter Structure

A Filter expresses some property specification. The primary function of a filter is to define some pass/fail criteria for a molecule. Filters contain a property_function and a criteria_function. property_function computes some value based on the input molecule. criteria_function converts the output of property_function to a single boolean value. Filters follow the convention that True means the input Mol has passed the criteria_function function, while False means the Mol has failed the criteria_function.

Optionally, filters can contain a ScoreFunction, which maps the results of property_function and criteria_function to a numeric score. This can be something as simple as returning a constant score when criteria_function=True to some complex function of the property calculated.

Score functions should be used for soft filters that apply some score bonus/penalty to a compound. Score functions are not necessary for hard filters, which use the output of criteria_function to determine if a molecule passes or fails.

For more info on Filters, see the Filter page.

Block Templates

The templates described so far deal with evaluating whole molecules. For finer control, we may wish to apply structural constraints at different scales of the molecule.

Say we have compounds of the form R1-scaffold-R2, and we want to apply different constraints to R1, the scaffold and R2. With the Block class and some slight changes to molecular representation, we can do this.

First we need to change how molecules are represented to be able to definitively determine which sections of a compound correspond to R1, R2 and the scaffold. We convert the full molecule R1-scaffold-R2 to a sequence of fragments *R1.*scaffold*.*R2. To determine which fragment corresponds to what part of the molecule, we add isotope and map numbers to the wildcard * atoms. We convert * to [{isotope}:{map_number}]. The map_number determines which wildcards link together, and the isotope is used to differentiate atoms with the same map number. This gives us our final fragment representation of the form [1*:1]R1.[2*:1]scaffold[2*:2].[1*:1]R2.

Now we can use the Block class to construct a set of nested templates, like so:

Block 1 - full molecule template
    Block 2 - scaffold template
    Block 3 - R1 Template
    Block 4 - R2 Template

When a fragment string is processed, each region R1, R2 and scaffold are sent to their separate templates and evaluated. Then the fragments are fused into a single compound and evaluated by the full molecule template.

This framework allows us to have greater control over chemical space. We can use this convention to specify different desired structures and properties at R1, R2 and scaffold

For more info on Blocks and fragment representation, see the Block page.