Overview

An important requirement in using generative models for compound design is that the structures produced by the model should fit the desired compound profile. The best way to meet this requirement is to curate training datasets of compounds that match the compound profile.

You could approach this problem by penalizing the generative model for creating compounds that don't match the profile, but this process tends to be slow. Trying to "learn" a compound profile can also be very unstable. If the generative model is trained on a very general dataset compared to a very refined compound profile, you could be in a situation where 99+% of generated compounds fail to match the profile. This will cause learning to be very slow, and will likely cause the model to suffer mode collapse.

This tutorial shows how to use MRL functions to build and curate chemical datasets

import sys
sys.path.append('..')

from mrl.imports import *
from mrl.core import *
from mrl.chem import *
from mrl.templates.all import *
from rdkit import Chem

Template Design

The first step in building a targeted dataset is to design a template that matches the target compound profile. For this tutorial, we set the following constraints:

Compounds must be chemically valid
Compounds must be single molecules
Compounds must have 8 or fewer rotatable bonds
Compounds must have 8 or fewer heteroatoms
Compounds must have 0 net charge
Compounds must have no rings larger than 6 atoms or smaller than 5 atoms
Compounds must have 5 or fewer hydrogen bond donors
Compounds must have 10 or fewer hydrogen bond acceptors
Compounds must be less than 500 g/mol
Compounds must have a calculated LogP of 5 or lower
Compounds must have a synthetic accessibility score of 7 or lower
Compounds must have no bridge head carbons
Compounds must pass PAINS A filters
Compounds must not contain any of the listed exclusion smarts
Compounds must not have a flexible chain section of longer than 7 atoms

smarts = ['[#6](=[#16])(-[#7])-[#7]',
        '[H]-[#6](-[H])=[#6]-[*]',
        '[#6]=[#6]=[#6]',
        '[#7][F,Cl,Br,I]',
        '[#6;!R]=[#6;!R]-[#6;!R]=[#6;!R]']

template = Template([ValidityFilter(), 
                     SingleCompoundFilter(), 
                     RotBondFilter(None, 8),
                     HeteroatomFilter(None, 8),
                     ChargeFilter(None, 0),
                     MaxRingFilter(None, 6),
                     MinRingFilter(5, None),
                     HBDFilter(None, 5),
                     HBAFilter(None, 10),
                     MolWtFilter(None, 500),
                     LogPFilter(None, 5),
                     SAFilter(None, 7),
                     BridgeheadFilter(None,0),
                     PAINSAFilter(),
                     ExclusionFilter(smarts, criteria='any'),
                     RotChainFilter(None, 7)
                    ],
                    [], 
                    fail_score=-1., log=True, use_lookup=True)

Dataset Filtering

We can use our template to screen a pre-defined chemical dataset

df = pd.read_csv('../files/smiles.csv')

# if in Collab:
# download_files()
# df = pd.read_csv('files/smiles.csv')

df.shape

(2000, 1)

hp = template(df.smiles.values)

df = df[hp]

df.shape

(1175, 1)

This gives us a refined dataset of compounds that match our profile. Training a generrative model on a refined dataset will yield much higher quality compounds

Combichem

Coming soon! Combichem functionalities to mutate libraries

Fragment Datasets

Often we want to optimize a section of a molecule while keeping the rest constant. The best way to do this is to have the model generate only the part of the compound that needs to be changed.

If you want to use a generative screen to optimize an R-group, it is more efficient to train an R-group specific model rather than a full-molecule compound.

We will look at curating datasets for R-group, linker and scaffold generation tasks.

We will use the fragment_smile function to slice up compounds into fragments. Then we will screen the fragments

df = pd.read_csv('../files/smiles.csv')

# if in Collab:
# download_files()
# df = pd.read_csv('files/smiles.csv')

cuts = [1,2,3,4,5,6]
fragment_function = partial(fragment_smile, cuts=cuts)

fragments = new_pool_parallel(fragment_function, df.smiles.values, cpus=0)

fragments = flatten_list_of_lists(fragments)

fragments = list(set(fragments))

len(fragments)

59054

draw_mols(to_mols(fragments[:3]))

Now we have ~87000 new fragments. These fragments will have a wide variety of sizes and attachment configurations. Now we want to filter these fragments into R-groups, linkers and scaffolds.

First we define the filter criteria

R-groups:

One * attachment point
Less than 200 g/mol

Linkers:

Two * attachment points
No rings
10 or fewer heavy atoms

Scaffolds:

At least two * attachment points
At least one ring
No flexible chains longer than 5 atoms

We'll define a new property filter to check the number of attachment points

def count_attachments(smile):
    smile = to_smile(smile)
    
    return smile.count('*')

class AttachmentFilter(PropertyFilter):
    "Attachment Filter"
    def __init__(self, min_val, max_val, score=None, name=None, **kwargs):
        super().__init__(count_attachments, min_val=min_val, max_val=max_val, 
                         score=score, name=name, **kwargs)

Now we make templates for the different compound groups

rgroup_template = Template([ValidityFilter(), 
                            SingleCompoundFilter(), 
                            AttachmentFilter(1, 1),
                            MolWtFilter(None, 200)
                           ],
                           [],
                           fail_score=-1, 
                           log=True, 
                           use_lookup=True)
                           
linker_template = Template([ValidityFilter(), 
                            SingleCompoundFilter(), 
                            AttachmentFilter(2, 2),
                            RingFilter(None,0),
                            HeavyAtomsFilter(None, 10)
                           ],
                           [],
                           fail_score=-1, 
                           log=True, 
                           use_lookup=True)
                           
scaffold_template = Template([ValidityFilter(), 
                              SingleCompoundFilter(), 
                              AttachmentFilter(2, None),
                              RingFilter(1,None),
                              RotChainFilter(None, 5)
                             ],
                             [],
                             fail_score=-1, 
                             log=True, 
                             use_lookup=True)

frag_df = pd.DataFrame(fragments, columns=['smiles'])
frag_df['rgroup_hp'] = rgroup_template(frag_df.smiles.values)
frag_df['linker_hp'] = linker_template(frag_df.smiles.values)
frag_df['scaffold_hp'] = scaffold_template(frag_df.smiles.values)

frag_df[['rgroup_hp', 'linker_hp', 'scaffold_hp']].mean()

rgroup_hp      0.094219
linker_hp      0.017679
scaffold_hp    0.635977
dtype: float64

draw_mols(to_mols(frag_df[frag_df.rgroup_hp].sample(n=5).smiles.values), mols_per_row=5)

draw_mols(to_mols(frag_df[frag_df.linker_hp].sample(n=5).smiles.values), mols_per_row=5)

draw_mols(to_mols(frag_df[frag_df.scaffold_hp].sample(n=5).smiles.values), mols_per_row=5)

Tutorial - Dataset Construction

Overview

Template Design

Dataset Filtering

Combichem

Fragment Datasets