/home/dmai/miniconda3/envs/mrl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

Overview

Templates define a chemical space using a set of hard and soft filters to evaluate a molecule. For greater control, we may wish to apply different templates to different parts of the molecule. With the Block and BlockTemplate classes and some modifications to how we represent SMILES strings, we can do this.

For a detailed example of using Blocks, see the Block Tutorial page

Motivating Example: R-group Optimization

Say we have a compound of the form R1-scaffold-R2, where the scaffold is a defined structure. We want to run a generative screen on the two R-groups while keeping the scaffold constant. We also have preferences on the chemical structures present at each R-group. We want R1 to contain at least 1 ring, and we want R2 to have no rings.

One way to approach this would be to build a dataset of compounds with the format R1-scaffold-R2 where all R1 and R2 structures conform to our desired properties. We would then train a generative model on this dataset and run a standard generative screen. However, this runs the risk of generated compounds deviating from our desired chemotype. This means we need to create a template that enforces the R1-scaffold-R2 chemotype with our stated constraints on R1 and R2.

This leads to the problem of factoring arbitrary chemical structures into our R1-scaffold-R2, with all sorts of nasty problems and edge cases cropping up if the model deviates from this chemotype during training. Doing so would require defining a wide variety of SMARTS strings and fragmentation rules to deal with matching arbitrary compounds to the R1-scaffold-R2 chemotype.

Comparatively, it is much easier to generate a string of molecular fragments and assemble them into a single compound. Rather than generating compounds of the form R1-scaffold-R2, we generate fragments of the form *R1.*scaffold*.*R2. Since our scaffold is defined, we can simplify this to generating *R1.*R2, which is helpful because shorter sequence spaces are easier to learn. In doing so, we have embedded our chemotype prior into the generation process.

Now we can evaluate and assemble compounds to compare them to our desired constraints. We create different templates for R1, R2 and the full compound R1-scaffold-R2.

The model generates some sequence of the form *R1.*R2. We split this into *R1 and *R2, and evaluate each R group with their respective template. Then we add in the scaffold, fuse the fragments into the full molecule, and evaluate the full molecule template.

When we generate and evaluate compounds in this way, we can build in a strong guarantee that compounds passing this tiered templating process will conform to our specified chemotype.

The Block Class

The process described above is implemented using the Block class. A block is a general abstraction for a chunk of a molecule. The Block class holds a Template describing desired molecular properties and additional information specifying how the block links to other blocks.

Blocks are assembled into a tree structure, with some blocks nested within other blocks. For the above example, we would have four blocks in total. We would have two MolBlock blocks for R1 and R2 and a ConstantMolBlock for the pre-defined scaffold. The block structure would look like this:

Block tree:

Block 1 - full molecule
    Block 2 - scaffold (constant)
    Block 3 - R1 + R1 Template
    Block 4 - R2 + R2 Template

If we passed our *R1.*R2 generated sequence to the head block, Block 1, the sequence would be broken down into fragments and routed to subblocks. R1 would be screened against the R1 Template in Block 3. R2 would be screened against the R2 template in Block 4. Then R1 and R2 would be merged with the constant scaffold in Block 2 to create the full R1-scaffold-R2 compound. Then the full compound would be screened by the full molecule template in Block 1.

Fragment Routing

One important detail is how fragments are routed to different blocks. If the model generates *R1.*R2, how do we know that R1 should be routed to one block and R2 should be routed to a different block? We specify block assignments using isotope and atom map numbers on the wildcard atoms.

Map numbers are used to determine which atoms are fused together. If we had a fragment sequence *X.*Y*.*Z, we don't know how to assemble the fragments. If we add atom map numbers to get [*:2]X.[*:1]Y[*:2].[*:1]Z, then we can definitely assemble the sequence into [*:2]X.[*:1]Y[*:2].[*:1]Z >> Z[*:1].[*:1]Y[*:2].[*:2]X >> Z-Y-X.

Isotope numbers are used to differentiate fragments with the same map number. Say we have a fragment sequence of [*:1]X.[*:1]Y, and we want X to be evaluated by one template and Y to be evaluated by another. We need a way of uniquely distinguishing one fragment from another. By adding isotopes to wildcards in addition to map numbers, ie [*:1]X >> [1*:1]X, we can uniquely match fragments to blocks.

Going back to the previous example, our framework of *R1.*scaffold*.*R2 would become [2*:1]R1.[1*:1]scaffold[1*:2].[2*:2]R2, and our generative model would create sequences of the form [2*:1]R1.[2*:2]R2. This is easily implemented by adding new tokens to the vocabulary of our generative model.

These links would be specified in the block structure:

Block tree:

Block 1 - full molecule, links=[]
    Block 2 - scaffold (constant), links=['[1*:1]', '[1*:2]']
    Block 3 - R1 + R1 Template, links=['[2*:1]']
    Block 4 - R2 + R2 Template, links=['[2*:2]']

When analyzing a fragment sequence, a block will extract the isotope/map number links of {isotope}*:{map_num} from each fragment. The links present in a fragment are compared to the links specified in the Block, as well as any subblocks contained within the block.

Fragment Routing Convention

Wildcards should be expressed as [{isotope}*:{map_number}]. The isotope value sould be 1 or 2. 0 should not be used as RDKit canonicalization removes zero isotope values from wildcards.

The BlockTemplate Class

Once the block tree is assembled, it can be used to create a BlockTemplate. The BlockTemplate class is a wrapper for a series of nested Blocks designed to behave like the Template class. BlockTemplate has the same __call__ API as Template, and provides saving/loading functions and data logging, as well as easy access to nodes via unnested lists and dictionaries.

Code Example

For the R1-scaffold-R2 example, setting up the blocks would look something like this

r1_template = Template(r1_hard_filters, r1_soft_filters)
r1_links = ['2*:1']
r1_block = MolBlock(r1_template, r1_links, name='r1')

r2_template = Template(r2_hard_filters, r2_soft_filters)
r2_links = ['2*:2']
r2_block = MolBlock(r2_template, r2_links, name='r2')

scaffold_links = ['1*:1', '1*:2']
scaffold_block = ConstantMolBlock(scaffold_smile, name='scaffold', links=scaffold_links)

full_molecule_template = Template(full_hard_filters, full_soft_filters)
main_block = MolBlock(full_molecule_template, [], name='full_molecule', subblocks=[r1_block, r2_block, scaffold_block])

block_template = BlockTemplate(main_block)

1. Fragments are decomposed based on `self.decompose_fragments`
2. Fragments are routed to subblocks if present using `self.match_fragment_recursive`
3. Any fragments matching a subblock are first evaluated by that subblock's template
4. If `add_constant=True`, constant sequences from any [`ConstantBlock`](/mrl/template.blocks.html#ConstantBlock) subblocks are added
5. Fragments are joined and fused using `self.join_fragments` and `self.fuse_fragments`
6. The fused fragments are processed by `self.eval_mol`

# scaffod
scaffold_smile = 'c1nc2c([1*:2])cncc2cc1[1*:1]'
scaffold_block = ConstantMolBlock(scaffold_smile, 'scaffold')

# R1, must have ring, be between 50-250 g/mol. must have 1 ring. ideally less thn 100-200 g/mol

r1_template = Template(
                    [MolWtFilter(50, 250),
                     RingFilter(1,1)],
                    [MolWtFilter(100, 200, 1)],
                    fail_score=-1
                    )

# R2, must have no rings, be between 0-200 g/mol. must have 0 rings. ideally less thn 50-150 g/mol

r2_template = Template(
                    [MolWtFilter(0, 200),
                     RingFilter(None,0)],
                    [MolWtFilter(50,150,1)],
                    fail_score=-1
                    )

# full compound, must be between 200 and 550 g/mol

full_template = Template(
                    [MolWtFilter(200, 550)],
                    fail_score=-1)


block_template = DoubleRGroupBlockTemplate(scaffold_smile, r1_template, r2_template, full_template)

frag_strings = ['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N[2*:1].O=C(Cl)[2*:2]',
             'CC(C)Oc1ccc(C[NH2+][2*:1])cc1.CC(C)(CCO)C[2*:2]',
             'Cc1ccc(NC(=O)[2*:1])cc1S(C)(=O)=O.CC(C)CCNC(=O)N[2*:2]',
             'NS(=O)(=O)Cc1cccc([2*:1])c1.CC(C)(O)CC[2*:2]']

fused = ['Cc1c(CS(C)(=O)=O)cccc1NC(=O)Nc1cnc2c(C(=O)Cl)cncc2c1',
     'CC(C)Oc1ccc(C[NH2+]c2cnc3c(CC(C)(C)CCO)cncc3c2)cc1',
     'Cc1ccc(NC(=O)c2cnc3c(NC(=O)NCCC(C)C)cncc3c2)cc1S(C)(=O)=O',
     'CC(C)(O)CCc1cncc2cc(-c3cccc(CS(N)(=O)=O)c3)cnc12']

outputs = block_template.recurse_fragments(frag_strings, add_constant=True)

assert [i[1] for i in outputs] == fused
assert [i[2] for i in outputs] == [True, True, True, True]
assert [i[3] for i in outputs] == [1.0, 2.0, 1.0, 2.0]

# one constant and one variabe

# scaffod
scaffold_smile = 'c1nc2c([1*:2])cncc2cc1[1*:1]'
scaffold_block = ConstantMolBlock(scaffold_smile, 'scaffold')

# R1 has 3 groups - constant carbonyl, variable ring, variabe ring attachment

r1_carbonyl_block = ConstantMolBlock('C(O)(=O)[1*:4]', 'carbonyl')

r1_ring_substitution_tempate = Template(
                                        [RotBondFilter(0,3),
                                         RingFilter(None,0)],
                                        [RotBondFilter(0,2,1)],
                                        fail_score=-1,
                                        log=True
                                        )

r1_ring_substitution_block = MolBlock(r1_ring_substitution_tempate, ['1*:3'], 'r1 ring substitution')

r1_ring_template = Template(
                            [RingFilter(1,1),
                             RotBondFilter(0,1)],
                            [RingFilter(1,1,1)],
                            fail_score=-1,
                            log=True
                            )

r1_ring_block = MolBlock(r1_ring_template, ['2*:3', '2*:4', '2*:2'], 'r1_ring')


r1_full_group_template = Template(
                            [MolWtFilter(50,350)],
                            [MolWtFilter(100,200,1)],
                            fail_score=-1,
                            log=True
                            )

r1_block = MolBlock(r1_full_group_template, ['2*:2'], 'r1_full', 
                     subblocks=[r1_carbonyl_block, r1_ring_substitution_block, r1_ring_block])


# R1, must have no rings, be between 0-200 g/mol. must have 0 rings. ideally less thn 50-150 g/mol

r2_template = Template(
                        [MolWtFilter(0,200)],
                        [RingFilter(0,0)],
                        fail_score=-1,
                        log=True
                        )

r2_block = MolBlock(r2_template, ['2*:1'], 'r2')


# full compound, must be between 200 and 550 g/mol

full_template = Template(
                        [MolWtFilter(200,550),
                         StructureFilter(['[#6](-[#8])(=[#8])-[*]',
                             '[#6]1:[#7]:[#6]2:[#6](-[*]):[#6]:[#7]:[#6]:[#6]:2:[#6]:[#6]:1-[*]'
                             ], criteria='all', exclude=False)],
                        [MolWtFilter(250,500,1)],
                        fail_score=-1,
                        log=True
                        )

main_block = MolBlock(full_template, [], 'full_molecule', subblocks=[scaffold_block, r1_block, r2_block])

block_template = BlockTemplate(main_block)


frag_strings = ['CC(C)(C)SC[1*:3].Cc1cc([2*:3])c([2*:2])n1C[2*:4].Cc1noc(C)c1CO[2*:1]',
 'C=C(C)C[1*:3].Nc1c(N([2*:2])[2*:4])ncnc1[2*:3].Cc1ccc([2*:1])cc1[N+](=O)[O-]',
 'CCSCC[1*:3].Cc1ccc(N(C(=O)[2*:4])[2*:3])c([2*:2])c1.Clc1ncccc1N[2*:1]',
 'C#CCO[1*:3].O=c1n([2*:3])nc([2*:2])n1[2*:4].O=c1ccccn1C[2*:1]']

fused = ['Cc1noc(C)c1COc1cnc2c(-c3c(CSC(C)(C)C)cc(C)n3CC(=O)O)cncc2c1',
 'C=C(C)Cc1ncnc(N(C(=O)O)c2cncc3cc(-c4ccc(C)c([N+](=O)[O-])c4)cnc23)c1N',
 'CCSCCN(C(=O)C(=O)O)c1ccc(C)cc1-c1cncc2cc(Nc3cccnc3Cl)cnc12',
 'C#CCOn1nc(-c2cncc3cc(Cn4ccccc4=O)cnc23)n(C(=O)O)c1=O']

outputs = block_template.recurse_fragments(frag_strings, add_constant=True)

assert [i[1] for i in outputs] == fused
assert [i[2] for i in outputs] == [True, True, True, True]
assert [i[3] for i in outputs] == [3.0, 3.0, 1.0, 4.0]

outputs[0]

['CC(C)(C)SC[1*:3].Cc1cc([2*:3])c([2*:2])n1C[2*:4].Cc1noc(C)c1CO[2*:1]',
 'Cc1noc(C)c1COc1cnc2c(-c3c(CSC(C)(C)C)cc(C)n3CC(=O)O)cncc2c1',
 True,
 3.0]

# scaffod
scaffold_smile = '*c1cnc2ccncc2c1'


# R-group, must have ring, be between 50-250 g/mol. ideally less thn 100-200 g/mol

r_template = Template(
                    [MolWtFilter(50, 250),
                     RingFilter(1,1)],
                    [MolWtFilter(100, 200, 1)],
                    fail_score=-1, log=True
                    )


full_template = Template(
                    [MolWtFilter(200, 550)],
                    fail_score=-1, log=True)

block_template = RGroupBlockTemplate(scaffold_smile, r_template, 
                                    full_molecule_template=full_template)

test = block_template(['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N*', 
                         'Cc1c(CS(C)(=O)=O)ccc1NC(=O)N*'], filter_type='hard')
assert test == [True, False]

block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NCN[2*:1]'], add_constant=True)

[['Cc1c(CS(C)(=O)=O)cccc1NCN[2*:1]',
  'Cc1c(CS(C)(=O)=O)cccc1NCNc1cnc2ccncc2c1',
  True,
  0.0]]

block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NCN*'], add_constant=True)

[['Cc1c(CS(C)(=O)=O)cccc1NCN[2*:1]',
  'Cc1c(CS(C)(=O)=O)cccc1NCNc1cnc2ccncc2c1',
  True,
  0.0]]

block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N*'], add_constant=True)

[['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N[2*:1]',
  'Cc1c(CS(C)(=O)=O)cccc1NC(=O)Nc1cnc2ccncc2c1',
  True,
  0.0]]

Blocks

Overview

Motivating Example: R-group Optimization

The Block Class

Fragment Routing

Fragment Routing Convention

The BlockTemplate Class

Code Example

`class` `Block`[source]

`class` `MolBlock`[source]

`MolBlock.recurse_fragments`[source]

`class` `ConstantBlock`[source]

`class` `ConstantMolBlock`[source]

`class` `BlockTemplate`[source]

`class` `RGroupBlockTemplate`[source]

`class` `DoubleRGroupBlockTemplate`[source]

`class` `LinkerBlockTemplate`[source]

`class` `ScaffoldBlockTemplate`[source]

Blocks

Overview

Motivating Example: R-group Optimization

The Block Class

Fragment Routing

Fragment Routing Convention

The BlockTemplate Class

Code Example

class Block[source]

class MolBlock[source]

MolBlock.recurse_fragments[source]

class ConstantBlock[source]

class ConstantMolBlock[source]

class BlockTemplate[source]

class RGroupBlockTemplate[source]

class DoubleRGroupBlockTemplate[source]

class LinkerBlockTemplate[source]

class ScaffoldBlockTemplate[source]

`class` `Block`[source]

`class` `MolBlock`[source]

`MolBlock.recurse_fragments`[source]

`class` `ConstantBlock`[source]

`class` `ConstantMolBlock`[source]

`class` `BlockTemplate`[source]

`class` `RGroupBlockTemplate`[source]

`class` `DoubleRGroupBlockTemplate`[source]

`class` `LinkerBlockTemplate`[source]

`class` `ScaffoldBlockTemplate`[source]