Blocks are used for advanced templating where different templates are applied to different sections of the molecule
/home/dmai/miniconda3/envs/mrl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

Overview

Templates define a chemical space using a set of hard and soft filters to evaluate a molecule. For greater control, we may wish to apply different templates to different parts of the molecule. With the Block and BlockTemplate classes and some modifications to how we represent SMILES strings, we can do this.

For a detailed example of using Blocks, see the Block Tutorial page

Motivating Example: R-group Optimization

Say we have a compound of the form R1-scaffold-R2, where the scaffold is a defined structure. We want to run a generative screen on the two R-groups while keeping the scaffold constant. We also have preferences on the chemical structures present at each R-group. We want R1 to contain at least 1 ring, and we want R2 to have no rings.

One way to approach this would be to build a dataset of compounds with the format R1-scaffold-R2 where all R1 and R2 structures conform to our desired properties. We would then train a generative model on this dataset and run a standard generative screen. However, this runs the risk of generated compounds deviating from our desired chemotype. This means we need to create a template that enforces the R1-scaffold-R2 chemotype with our stated constraints on R1 and R2.

This leads to the problem of factoring arbitrary chemical structures into our R1-scaffold-R2, with all sorts of nasty problems and edge cases cropping up if the model deviates from this chemotype during training. Doing so would require defining a wide variety of SMARTS strings and fragmentation rules to deal with matching arbitrary compounds to the R1-scaffold-R2 chemotype.

Comparatively, it is much easier to generate a string of molecular fragments and assemble them into a single compound. Rather than generating compounds of the form R1-scaffold-R2, we generate fragments of the form *R1.*scaffold*.*R2. Since our scaffold is defined, we can simplify this to generating *R1.*R2, which is helpful because shorter sequence spaces are easier to learn. In doing so, we have embedded our chemotype prior into the generation process.

Now we can evaluate and assemble compounds to compare them to our desired constraints. We create different templates for R1, R2 and the full compound R1-scaffold-R2.

The model generates some sequence of the form *R1.*R2. We split this into *R1 and *R2, and evaluate each R group with their respective template. Then we add in the scaffold, fuse the fragments into the full molecule, and evaluate the full molecule template.

When we generate and evaluate compounds in this way, we can build in a strong guarantee that compounds passing this tiered templating process will conform to our specified chemotype.

The Block Class

The process described above is implemented using the Block class. A block is a general abstraction for a chunk of a molecule. The Block class holds a Template describing desired molecular properties and additional information specifying how the block links to other blocks.

Blocks are assembled into a tree structure, with some blocks nested within other blocks. For the above example, we would have four blocks in total. We would have two MolBlock blocks for R1 and R2 and a ConstantMolBlock for the pre-defined scaffold. The block structure would look like this:

Block tree:

Block 1 - full molecule
    Block 2 - scaffold (constant)
    Block 3 - R1 + R1 Template
    Block 4 - R2 + R2 Template

If we passed our *R1.*R2 generated sequence to the head block, Block 1, the sequence would be broken down into fragments and routed to subblocks. R1 would be screened against the R1 Template in Block 3. R2 would be screened against the R2 template in Block 4. Then R1 and R2 would be merged with the constant scaffold in Block 2 to create the full R1-scaffold-R2 compound. Then the full compound would be screened by the full molecule template in Block 1.

Fragment Routing

One important detail is how fragments are routed to different blocks. If the model generates *R1.*R2, how do we know that R1 should be routed to one block and R2 should be routed to a different block? We specify block assignments using isotope and atom map numbers on the wildcard atoms.

Map numbers are used to determine which atoms are fused together. If we had a fragment sequence *X.*Y*.*Z, we don't know how to assemble the fragments. If we add atom map numbers to get [*:2]X.[*:1]Y[*:2].[*:1]Z, then we can definitely assemble the sequence into [*:2]X.[*:1]Y[*:2].[*:1]Z >> Z[*:1].[*:1]Y[*:2].[*:2]X >> Z-Y-X.

Isotope numbers are used to differentiate fragments with the same map number. Say we have a fragment sequence of [*:1]X.[*:1]Y, and we want X to be evaluated by one template and Y to be evaluated by another. We need a way of uniquely distinguishing one fragment from another. By adding isotopes to wildcards in addition to map numbers, ie [*:1]X >> [1*:1]X, we can uniquely match fragments to blocks.

Going back to the previous example, our framework of *R1.*scaffold*.*R2 would become [2*:1]R1.[1*:1]scaffold[1*:2].[2*:2]R2, and our generative model would create sequences of the form [2*:1]R1.[2*:2]R2. This is easily implemented by adding new tokens to the vocabulary of our generative model.

These links would be specified in the block structure:

Block tree:

Block 1 - full molecule, links=[]
    Block 2 - scaffold (constant), links=['[1*:1]', '[1*:2]']
    Block 3 - R1 + R1 Template, links=['[2*:1]']
    Block 4 - R2 + R2 Template, links=['[2*:2]']

When analyzing a fragment sequence, a block will extract the isotope/map number links of {isotope}*:{map_num} from each fragment. The links present in a fragment are compared to the links specified in the Block, as well as any subblocks contained within the block.

Fragment Routing Convention

Wildcards should be expressed as [{isotope}*:{map_number}]. The isotope value sould be 1 or 2. 0 should not be used as RDKit canonicalization removes zero isotope values from wildcards.

The BlockTemplate Class

Once the block tree is assembled, it can be used to create a BlockTemplate. The BlockTemplate class is a wrapper for a series of nested Blocks designed to behave like the Template class. BlockTemplate has the same __call__ API as Template, and provides saving/loading functions and data logging, as well as easy access to nodes via unnested lists and dictionaries.

Code Example

For the R1-scaffold-R2 example, setting up the blocks would look something like this

r1_template = Template(r1_hard_filters, r1_soft_filters)
r1_links = ['2*:1']
r1_block = MolBlock(r1_template, r1_links, name='r1')

r2_template = Template(r2_hard_filters, r2_soft_filters)
r2_links = ['2*:2']
r2_block = MolBlock(r2_template, r2_links, name='r2')

scaffold_links = ['1*:1', '1*:2']
scaffold_block = ConstantMolBlock(scaffold_smile, name='scaffold', links=scaffold_links)

full_molecule_template = Template(full_hard_filters, full_soft_filters)
main_block = MolBlock(full_molecule_template, [], name='full_molecule', subblocks=[r1_block, r2_block, scaffold_block])

block_template = BlockTemplate(main_block)

class Block[source]

Block(template, links, name, subblocks=None)

Block - base class for Blocks

Inputs:

  • template Template: Template subclass

  • links list[str]: list, defines links between this block and other blocks

  • name str: block name

  • subblocks list[Block]: list of Block classes nested within this block

class MolBlock[source]

MolBlock(template, links, name, subblocks=None) :: Block

MolBlock - Block subclass specific to working with smiles strings. This class expects links between SMILES fragments to be wildcard atoms of the form {isotope}*:{map_number}. Note that '0' should not be used as an isotope number because RDKit removes '0' isotopes from SMILES strings automatically.

MolBlock.recurse_fragments[source]

MolBlock.recurse_fragments(fragments, add_constant=True)

recurse_fragments - recursively evlauates fragments against self.template and all blocks in self.subblocks

Inputs:

  • fragments [str, list[str]]: fragments to process. Can either be a single string of the form 'f1.f2.f3' or a list of the form ['f1','f2','f3']. All items in fragments should correspond to the same final molecule.

  • add_constant bool: If True, constant sequences in any ConstantBlock subclasses are added to fragments during evaluation. Should be True if constant sequences are missing from fragments or False if they are present

Returns:

  • fused str: fragments fused at this stage

  • total_pass bool: True if fragments passed all subblock templates and fused passed self.template

  • total_score float: sum of scores from self.template.soft_filters and subblock template soft filters

  • output_dicts list[dict]: list of dictionaries holding information from this block and subblocks

Recurse fragments works in the following way:

1. Fragments are decomposed based on `self.decompose_fragments`
2. Fragments are routed to subblocks if present using `self.match_fragment_recursive`
3. Any fragments matching a subblock are first evaluated by that subblock's template
4. If `add_constant=True`, constant sequences from any [`ConstantBlock`](/mrl/template.blocks.html#ConstantBlock) subblocks are added
5. Fragments are joined and fused using `self.join_fragments` and `self.fuse_fragments`
6. The fused fragments are processed by `self.eval_mol`

class ConstantBlock[source]

ConstantBlock(constant, name)

ConstantBlock - base block class for constant sequence

class ConstantMolBlock[source]

ConstantMolBlock(smile, name, links=None) :: ConstantBlock

ConstantMolBlock - constant block for SMILES sequence

class BlockTemplate[source]

BlockTemplate(head_block, lookup=True)

BlockTemplate - base class for handling nested blocks. Takes care of running fragment strings and logging outputs

class RGroupBlockTemplate[source]

RGroupBlockTemplate(base_smile, rgroup_template, full_molecule_template=None, replace_wildcard=True, lookup=True) :: BlockTemplate

RGroupBlockTemplate - block template for r-group screening

Inputs:

  • base_smile str: base smile to attach r-group to. Should have a single unmapped wildcard atom, ie *CCCC

  • rgroup_template Template: template for screening r-groups

  • full_molecule_template Optional[Template]: Optional template for full molecule

class DoubleRGroupBlockTemplate[source]

DoubleRGroupBlockTemplate(base_smile, r1_template, r2_template, full_molecule_template=None, lookup=True) :: BlockTemplate

DoubleRGroupBlockTemplate - block template for screening two r-groups

Inputs:

  • base_smile str: base smile to attach r-group to. Should have two mapped wildcard atoms, ie 'c1nc2c([1*:1])cncc2cc1[1*:2]'. Rgroup1 will be fused to wildcard 1*:1 and Rgroup 2 will be fused to wildcard 1*:2

  • r1_template Template: template for screening rgroup 1

  • r2_template Template: template for screening rgroup 2

  • full_molecule_template Optional[Template]: Optional template for full molecule

class LinkerBlockTemplate[source]

LinkerBlockTemplate(smile1, smile2, linker_template, full_molecule_template=None, lookup=True) :: BlockTemplate

LinkerBlockTemplate - block template for screening linkers

Inputs:

  • smile1 str: left-side linker attachment. Should have a single unmapped wildcard, ie *CCCC

  • smile2 str: right-side linker attachment. Should have a single unmapped wildcard, ie *CCCC

  • linker_template Template: template for screening the linker

  • full_molecule_template Optional[Template]: Optional template for full molecule

class ScaffoldBlockTemplate[source]

ScaffoldBlockTemplate(attachments, scaffold_template, full_molecule_template=None, lookup=True) :: BlockTemplate

ScaffoldBlockTemplate - block template for screening scaffolds or rings with multiple attachments

Inputs:

  • attachments list: list of mapped attachments. All attachments should have a single wildcard atom mapped following the format [{isotope}*:{map_num}], ie ['[1:*1]CC', '[1:*2]CCC']

  • scaffold_template Template: template for screening the scaffold

  • full_molecule_template Optional[Template]: Optional template for full molecule

# scaffod
scaffold_smile = 'c1nc2c([1*:2])cncc2cc1[1*:1]'
scaffold_block = ConstantMolBlock(scaffold_smile, 'scaffold')

# R1, must have ring, be between 50-250 g/mol. must have 1 ring. ideally less thn 100-200 g/mol

r1_template = Template(
                    [MolWtFilter(50, 250),
                     RingFilter(1,1)],
                    [MolWtFilter(100, 200, 1)],
                    fail_score=-1
                    )

# R2, must have no rings, be between 0-200 g/mol. must have 0 rings. ideally less thn 50-150 g/mol

r2_template = Template(
                    [MolWtFilter(0, 200),
                     RingFilter(None,0)],
                    [MolWtFilter(50,150,1)],
                    fail_score=-1
                    )

# full compound, must be between 200 and 550 g/mol

full_template = Template(
                    [MolWtFilter(200, 550)],
                    fail_score=-1)


block_template = DoubleRGroupBlockTemplate(scaffold_smile, r1_template, r2_template, full_template)

frag_strings = ['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N[2*:1].O=C(Cl)[2*:2]',
             'CC(C)Oc1ccc(C[NH2+][2*:1])cc1.CC(C)(CCO)C[2*:2]',
             'Cc1ccc(NC(=O)[2*:1])cc1S(C)(=O)=O.CC(C)CCNC(=O)N[2*:2]',
             'NS(=O)(=O)Cc1cccc([2*:1])c1.CC(C)(O)CC[2*:2]']

fused = ['Cc1c(CS(C)(=O)=O)cccc1NC(=O)Nc1cnc2c(C(=O)Cl)cncc2c1',
     'CC(C)Oc1ccc(C[NH2+]c2cnc3c(CC(C)(C)CCO)cncc3c2)cc1',
     'Cc1ccc(NC(=O)c2cnc3c(NC(=O)NCCC(C)C)cncc3c2)cc1S(C)(=O)=O',
     'CC(C)(O)CCc1cncc2cc(-c3cccc(CS(N)(=O)=O)c3)cnc12']

outputs = block_template.recurse_fragments(frag_strings, add_constant=True)

assert [i[1] for i in outputs] == fused
assert [i[2] for i in outputs] == [True, True, True, True]
assert [i[3] for i in outputs] == [1.0, 2.0, 1.0, 2.0]
# one constant and one variabe

# scaffod
scaffold_smile = 'c1nc2c([1*:2])cncc2cc1[1*:1]'
scaffold_block = ConstantMolBlock(scaffold_smile, 'scaffold')

# R1 has 3 groups - constant carbonyl, variable ring, variabe ring attachment

r1_carbonyl_block = ConstantMolBlock('C(O)(=O)[1*:4]', 'carbonyl')

r1_ring_substitution_tempate = Template(
                                        [RotBondFilter(0,3),
                                         RingFilter(None,0)],
                                        [RotBondFilter(0,2,1)],
                                        fail_score=-1,
                                        log=True
                                        )

r1_ring_substitution_block = MolBlock(r1_ring_substitution_tempate, ['1*:3'], 'r1 ring substitution')

r1_ring_template = Template(
                            [RingFilter(1,1),
                             RotBondFilter(0,1)],
                            [RingFilter(1,1,1)],
                            fail_score=-1,
                            log=True
                            )

r1_ring_block = MolBlock(r1_ring_template, ['2*:3', '2*:4', '2*:2'], 'r1_ring')


r1_full_group_template = Template(
                            [MolWtFilter(50,350)],
                            [MolWtFilter(100,200,1)],
                            fail_score=-1,
                            log=True
                            )

r1_block = MolBlock(r1_full_group_template, ['2*:2'], 'r1_full', 
                     subblocks=[r1_carbonyl_block, r1_ring_substitution_block, r1_ring_block])


# R1, must have no rings, be between 0-200 g/mol. must have 0 rings. ideally less thn 50-150 g/mol

r2_template = Template(
                        [MolWtFilter(0,200)],
                        [RingFilter(0,0)],
                        fail_score=-1,
                        log=True
                        )

r2_block = MolBlock(r2_template, ['2*:1'], 'r2')


# full compound, must be between 200 and 550 g/mol

full_template = Template(
                        [MolWtFilter(200,550),
                         StructureFilter(['[#6](-[#8])(=[#8])-[*]',
                             '[#6]1:[#7]:[#6]2:[#6](-[*]):[#6]:[#7]:[#6]:[#6]:2:[#6]:[#6]:1-[*]'
                             ], criteria='all', exclude=False)],
                        [MolWtFilter(250,500,1)],
                        fail_score=-1,
                        log=True
                        )

main_block = MolBlock(full_template, [], 'full_molecule', subblocks=[scaffold_block, r1_block, r2_block])

block_template = BlockTemplate(main_block)


frag_strings = ['CC(C)(C)SC[1*:3].Cc1cc([2*:3])c([2*:2])n1C[2*:4].Cc1noc(C)c1CO[2*:1]',
 'C=C(C)C[1*:3].Nc1c(N([2*:2])[2*:4])ncnc1[2*:3].Cc1ccc([2*:1])cc1[N+](=O)[O-]',
 'CCSCC[1*:3].Cc1ccc(N(C(=O)[2*:4])[2*:3])c([2*:2])c1.Clc1ncccc1N[2*:1]',
 'C#CCO[1*:3].O=c1n([2*:3])nc([2*:2])n1[2*:4].O=c1ccccn1C[2*:1]']

fused = ['Cc1noc(C)c1COc1cnc2c(-c3c(CSC(C)(C)C)cc(C)n3CC(=O)O)cncc2c1',
 'C=C(C)Cc1ncnc(N(C(=O)O)c2cncc3cc(-c4ccc(C)c([N+](=O)[O-])c4)cnc23)c1N',
 'CCSCCN(C(=O)C(=O)O)c1ccc(C)cc1-c1cncc2cc(Nc3cccnc3Cl)cnc12',
 'C#CCOn1nc(-c2cncc3cc(Cn4ccccc4=O)cnc23)n(C(=O)O)c1=O']

outputs = block_template.recurse_fragments(frag_strings, add_constant=True)

assert [i[1] for i in outputs] == fused
assert [i[2] for i in outputs] == [True, True, True, True]
assert [i[3] for i in outputs] == [3.0, 3.0, 1.0, 4.0]
outputs[0]
['CC(C)(C)SC[1*:3].Cc1cc([2*:3])c([2*:2])n1C[2*:4].Cc1noc(C)c1CO[2*:1]',
 'Cc1noc(C)c1COc1cnc2c(-c3c(CSC(C)(C)C)cc(C)n3CC(=O)O)cncc2c1',
 True,
 3.0]
# scaffod
scaffold_smile = '*c1cnc2ccncc2c1'


# R-group, must have ring, be between 50-250 g/mol. ideally less thn 100-200 g/mol

r_template = Template(
                    [MolWtFilter(50, 250),
                     RingFilter(1,1)],
                    [MolWtFilter(100, 200, 1)],
                    fail_score=-1, log=True
                    )


full_template = Template(
                    [MolWtFilter(200, 550)],
                    fail_score=-1, log=True)

block_template = RGroupBlockTemplate(scaffold_smile, r_template, 
                                    full_molecule_template=full_template)

test = block_template(['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N*', 
                         'Cc1c(CS(C)(=O)=O)ccc1NC(=O)N*'], filter_type='hard')
assert test == [True, False]
block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NCN[2*:1]'], add_constant=True)
[['Cc1c(CS(C)(=O)=O)cccc1NCN[2*:1]',
  'Cc1c(CS(C)(=O)=O)cccc1NCNc1cnc2ccncc2c1',
  True,
  0.0]]
block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NCN*'], add_constant=True)
[['Cc1c(CS(C)(=O)=O)cccc1NCN[2*:1]',
  'Cc1c(CS(C)(=O)=O)cccc1NCNc1cnc2ccncc2c1',
  True,
  0.0]]
block_template.recurse_fragments(['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N*'], add_constant=True)
[['Cc1c(CS(C)(=O)=O)cccc1NC(=O)N[2*:1]',
  'Cc1c(CS(C)(=O)=O)cccc1NC(=O)Nc1cnc2ccncc2c1',
  True,
  0.0]]