Overview
Templates define a chemical space using a set of hard and soft filters to evaluate a molecule. For greater control, we may wish to apply different templates to different parts of the molecule. With the Block
and BlockTemplate
classes and some modifications to how we represent SMILES strings, we can do this.
For a detailed example of using Blocks, see the Block Tutorial page
Motivating Example: R-group Optimization
Say we have a compound of the form R1-scaffold-R2
, where the scaffold is a defined structure. We want to run a generative screen on the two R-groups while keeping the scaffold constant. We also have preferences on the chemical structures present at each R-group. We want R1
to contain at least 1 ring, and we want R2
to have no rings.
One way to approach this would be to build a dataset of compounds with the format R1-scaffold-R2
where all R1
and R2
structures conform to our desired properties. We would then train a generative model on this dataset and run a standard generative screen. However, this runs the risk of generated compounds deviating from our desired chemotype. This means we need to create a template that enforces the R1-scaffold-R2
chemotype with our stated constraints on R1
and R2
.
This leads to the problem of factoring arbitrary chemical structures into our R1-scaffold-R2
, with all sorts of nasty problems and edge cases cropping up if the model deviates from this chemotype during training. Doing so would require defining a wide variety of SMARTS strings and fragmentation rules to deal with matching arbitrary compounds to the R1-scaffold-R2
chemotype.
Comparatively, it is much easier to generate a string of molecular fragments and assemble them into a single compound. Rather than generating compounds of the form R1-scaffold-R2
, we generate fragments of the form *R1.*scaffold*.*R2
. Since our scaffold is defined, we can simplify this to generating *R1.*R2
, which is helpful because shorter sequence spaces are easier to learn. In doing so, we have embedded our chemotype prior into the generation process.
Now we can evaluate and assemble compounds to compare them to our desired constraints. We create different templates for R1
, R2
and the full compound R1-scaffold-R2
.
The model generates some sequence of the form *R1.*R2
. We split this into *R1
and *R2
, and evaluate each R group with their respective template. Then we add in the scaffold, fuse the fragments into the full molecule, and evaluate the full molecule template.
When we generate and evaluate compounds in this way, we can build in a strong guarantee that compounds passing this tiered templating process will conform to our specified chemotype.
The Block Class
The process described above is implemented using the Block
class. A block is a general abstraction for a chunk of a molecule. The Block
class holds a Template
describing desired molecular properties and additional information specifying how the block links to other blocks.
Blocks are assembled into a tree structure, with some blocks nested within other blocks. For the above example, we would have four blocks in total. We would have two MolBlock
blocks for R1
and R2
and a ConstantMolBlock
for the pre-defined scaffold. The block structure would look like this:
Block tree:
Block 1 - full molecule
Block 2 - scaffold (constant)
Block 3 - R1 + R1 Template
Block 4 - R2 + R2 Template
If we passed our *R1.*R2
generated sequence to the head block, Block 1, the sequence would be broken down into fragments and routed to subblocks. R1
would be screened against the R1 Template in Block 3. R2
would be screened against the R2 template in Block 4. Then R1
and R2
would be merged with the constant scaffold in Block 2 to create the full R1-scaffold-R2
compound. Then the full compound would be screened by the full molecule template in Block 1.
Fragment Routing
One important detail is how fragments are routed to different blocks. If the model generates *R1.*R2
, how do we know that R1
should be routed to one block and R2
should be routed to a different block? We specify block assignments using isotope and atom map numbers on the wildcard atoms.
Map numbers are used to determine which atoms are fused together. If we had a fragment sequence *X.*Y*.*Z
, we don't know how to assemble the fragments. If we add atom map numbers to get [*:2]X.[*:1]Y[*:2].[*:1]Z
, then we can definitely assemble the sequence into [*:2]X.[*:1]Y[*:2].[*:1]Z >> Z[*:1].[*:1]Y[*:2].[*:2]X >> Z-Y-X
.
Isotope numbers are used to differentiate fragments with the same map number. Say we have a fragment sequence of [*:1]X.[*:1]Y
, and we want X
to be evaluated by one template and Y
to be evaluated by another. We need a way of uniquely distinguishing one fragment from another. By adding isotopes to wildcards in addition to map numbers, ie [*:1]X >> [1*:1]X
, we can uniquely match fragments to blocks.
Going back to the previous example, our framework of *R1.*scaffold*.*R2
would become [2*:1]R1.[1*:1]scaffold[1*:2].[2*:2]R2
, and our generative model would create sequences of the form [2*:1]R1.[2*:2]R2
. This is easily implemented by adding new tokens to the vocabulary of our generative model.
These links would be specified in the block structure:
Block tree:
Block 1 - full molecule, links=[]
Block 2 - scaffold (constant), links=['[1*:1]', '[1*:2]']
Block 3 - R1 + R1 Template, links=['[2*:1]']
Block 4 - R2 + R2 Template, links=['[2*:2]']
When analyzing a fragment sequence, a block will extract the isotope/map number links of {isotope}*:{map_num}
from each fragment. The links present in a fragment are compared to the links specified in the Block, as well as any subblocks contained within the block.
Fragment Routing Convention
Wildcards should be expressed as [{isotope}*:{map_number}]
. The isotope
value sould be 1
or 2
. 0
should not be used as RDKit canonicalization removes zero isotope values from wildcards.
The BlockTemplate Class
Once the block tree is assembled, it can be used to create a BlockTemplate
. The BlockTemplate
class is a wrapper for a series of nested Blocks
designed to behave like the Template
class. BlockTemplate
has the same __call__
API as Template
, and provides saving/loading functions and data logging, as well as easy access to nodes via unnested lists and dictionaries.
Code Example
For the R1-scaffold-R2
example, setting up the blocks would look something like this
r1_template = Template(r1_hard_filters, r1_soft_filters)
r1_links = ['2*:1']
r1_block = MolBlock(r1_template, r1_links, name='r1')
r2_template = Template(r2_hard_filters, r2_soft_filters)
r2_links = ['2*:2']
r2_block = MolBlock(r2_template, r2_links, name='r2')
scaffold_links = ['1*:1', '1*:2']
scaffold_block = ConstantMolBlock(scaffold_smile, name='scaffold', links=scaffold_links)
full_molecule_template = Template(full_hard_filters, full_soft_filters)
main_block = MolBlock(full_molecule_template, [], name='full_molecule', subblocks=[r1_block, r2_block, scaffold_block])
block_template = BlockTemplate(main_block)
# scaffod
scaffold_smile = 'c1nc2c([1*:2])cncc2cc1[1*:1]'
scaffold_block = ConstantMolBlock(scaffold_smile, 'scaffold')
# R1, must have ring, be between 50-250 g/mol. must have 1 ring. ideally less thn 100-200 g/mol
r1_template = Template(
[MolWtFilter(50, 250),
RingFilter(1,1)],
[MolWtFilter(100, 200, 1)],
fail_score=-1
)
# R2, must have no rings, be between 0-200 g/mol. must have 0 rings. ideally less thn 50-150 g/mol
r2_template = Template(
[MolWtFilter(0, 200),
RingFilter(None,0)],
[MolWtFilter(50,150,1)],
fail_score=-1
)
# full compound, must be between 200 and 550 g/mol
full_template = Template(
[MolWtFilter(200, 550)],
fail_score=-1)
block_template = DoubleRGroupBlockTemplate(scaffold_smile, r1_template, r2_template, full_template)
frag_strings = ['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N[2*:1].O=C(Cl)[2*:2]',
'CC(C)Oc1ccc(C[NH2+][2*:1])cc1.CC(C)(CCO)C[2*:2]',
'Cc1ccc(NC(=O)[2*:1])cc1S(C)(=O)=O.CC(C)CCNC(=O)N[2*:2]',
'NS(=O)(=O)Cc1cccc([2*:1])c1.CC(C)(O)CC[2*:2]']
fused = ['Cc1c(CS(C)(=O)=O)cccc1NC(=O)Nc1cnc2c(C(=O)Cl)cncc2c1',
'CC(C)Oc1ccc(C[NH2+]c2cnc3c(CC(C)(C)CCO)cncc3c2)cc1',
'Cc1ccc(NC(=O)c2cnc3c(NC(=O)NCCC(C)C)cncc3c2)cc1S(C)(=O)=O',
'CC(C)(O)CCc1cncc2cc(-c3cccc(CS(N)(=O)=O)c3)cnc12']
outputs = block_template.recurse_fragments(frag_strings, add_constant=True)
assert [i[1] for i in outputs] == fused
assert [i[2] for i in outputs] == [True, True, True, True]
assert [i[3] for i in outputs] == [1.0, 2.0, 1.0, 2.0]
# one constant and one variabe
# scaffod
scaffold_smile = 'c1nc2c([1*:2])cncc2cc1[1*:1]'
scaffold_block = ConstantMolBlock(scaffold_smile, 'scaffold')
# R1 has 3 groups - constant carbonyl, variable ring, variabe ring attachment
r1_carbonyl_block = ConstantMolBlock('C(O)(=O)[1*:4]', 'carbonyl')
r1_ring_substitution_tempate = Template(
[RotBondFilter(0,3),
RingFilter(None,0)],
[RotBondFilter(0,2,1)],
fail_score=-1,
log=True
)
r1_ring_substitution_block = MolBlock(r1_ring_substitution_tempate, ['1*:3'], 'r1 ring substitution')
r1_ring_template = Template(
[RingFilter(1,1),
RotBondFilter(0,1)],
[RingFilter(1,1,1)],
fail_score=-1,
log=True
)
r1_ring_block = MolBlock(r1_ring_template, ['2*:3', '2*:4', '2*:2'], 'r1_ring')
r1_full_group_template = Template(
[MolWtFilter(50,350)],
[MolWtFilter(100,200,1)],
fail_score=-1,
log=True
)
r1_block = MolBlock(r1_full_group_template, ['2*:2'], 'r1_full',
subblocks=[r1_carbonyl_block, r1_ring_substitution_block, r1_ring_block])
# R1, must have no rings, be between 0-200 g/mol. must have 0 rings. ideally less thn 50-150 g/mol
r2_template = Template(
[MolWtFilter(0,200)],
[RingFilter(0,0)],
fail_score=-1,
log=True
)
r2_block = MolBlock(r2_template, ['2*:1'], 'r2')
# full compound, must be between 200 and 550 g/mol
full_template = Template(
[MolWtFilter(200,550),
StructureFilter(['[#6](-[#8])(=[#8])-[*]',
'[#6]1:[#7]:[#6]2:[#6](-[*]):[#6]:[#7]:[#6]:[#6]:2:[#6]:[#6]:1-[*]'
], criteria='all', exclude=False)],
[MolWtFilter(250,500,1)],
fail_score=-1,
log=True
)
main_block = MolBlock(full_template, [], 'full_molecule', subblocks=[scaffold_block, r1_block, r2_block])
block_template = BlockTemplate(main_block)
frag_strings = ['CC(C)(C)SC[1*:3].Cc1cc([2*:3])c([2*:2])n1C[2*:4].Cc1noc(C)c1CO[2*:1]',
'C=C(C)C[1*:3].Nc1c(N([2*:2])[2*:4])ncnc1[2*:3].Cc1ccc([2*:1])cc1[N+](=O)[O-]',
'CCSCC[1*:3].Cc1ccc(N(C(=O)[2*:4])[2*:3])c([2*:2])c1.Clc1ncccc1N[2*:1]',
'C#CCO[1*:3].O=c1n([2*:3])nc([2*:2])n1[2*:4].O=c1ccccn1C[2*:1]']
fused = ['Cc1noc(C)c1COc1cnc2c(-c3c(CSC(C)(C)C)cc(C)n3CC(=O)O)cncc2c1',
'C=C(C)Cc1ncnc(N(C(=O)O)c2cncc3cc(-c4ccc(C)c([N+](=O)[O-])c4)cnc23)c1N',
'CCSCCN(C(=O)C(=O)O)c1ccc(C)cc1-c1cncc2cc(Nc3cccnc3Cl)cnc12',
'C#CCOn1nc(-c2cncc3cc(Cn4ccccc4=O)cnc23)n(C(=O)O)c1=O']
outputs = block_template.recurse_fragments(frag_strings, add_constant=True)
assert [i[1] for i in outputs] == fused
assert [i[2] for i in outputs] == [True, True, True, True]
assert [i[3] for i in outputs] == [3.0, 3.0, 1.0, 4.0]
outputs[0]
# scaffod
scaffold_smile = '*c1cnc2ccncc2c1'
# R-group, must have ring, be between 50-250 g/mol. ideally less thn 100-200 g/mol
r_template = Template(
[MolWtFilter(50, 250),
RingFilter(1,1)],
[MolWtFilter(100, 200, 1)],
fail_score=-1, log=True
)
full_template = Template(
[MolWtFilter(200, 550)],
fail_score=-1, log=True)
block_template = RGroupBlockTemplate(scaffold_smile, r_template,
full_molecule_template=full_template)
test = block_template(['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N*',
'Cc1c(CS(C)(=O)=O)ccc1NC(=O)N*'], filter_type='hard')
assert test == [True, False]
block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NCN[2*:1]'], add_constant=True)
block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NCN*'], add_constant=True)
block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N*'], add_constant=True)