Overview
An important requirement in using generative models for compound design is that the structures produced by the model should fit the desired compound profile. The best way to meet this requirement is to curate training datasets of compounds that match the compound profile.
You could approach this problem by penalizing the generative model for creating compounds that don't match the profile, but this process tends to be slow. Trying to "learn" a compound profile can also be very unstable. If the generative model is trained on a very general dataset compared to a very refined compound profile, you could be in a situation where 99+% of generated compounds fail to match the profile. This will cause learning to be very slow, and will likely cause the model to suffer mode collapse.
This tutorial shows how to use MRL functions to build and curate chemical datasets
import sys
sys.path.append('..')
from mrl.imports import *
from mrl.core import *
from mrl.chem import *
from mrl.templates.all import *
from rdkit import Chem
Template Design
The first step in building a targeted dataset is to design a template that matches the target compound profile. For this tutorial, we set the following constraints:
- Compounds must be chemically valid
- Compounds must be single molecules
- Compounds must have 8 or fewer rotatable bonds
- Compounds must have 8 or fewer heteroatoms
- Compounds must have 0 net charge
- Compounds must have no rings larger than 6 atoms or smaller than 5 atoms
- Compounds must have 5 or fewer hydrogen bond donors
- Compounds must have 10 or fewer hydrogen bond acceptors
- Compounds must be less than 500 g/mol
- Compounds must have a calculated LogP of 5 or lower
- Compounds must have a synthetic accessibility score of 7 or lower
- Compounds must have no bridge head carbons
- Compounds must pass PAINS A filters
- Compounds must not contain any of the listed exclusion smarts
- Compounds must not have a flexible chain section of longer than 7 atoms
smarts = ['[#6](=[#16])(-[#7])-[#7]',
'[H]-[#6](-[H])=[#6]-[*]',
'[#6]=[#6]=[#6]',
'[#7][F,Cl,Br,I]',
'[#6;!R]=[#6;!R]-[#6;!R]=[#6;!R]']
template = Template([ValidityFilter(),
SingleCompoundFilter(),
RotBondFilter(None, 8),
HeteroatomFilter(None, 8),
ChargeFilter(None, 0),
MaxRingFilter(None, 6),
MinRingFilter(5, None),
HBDFilter(None, 5),
HBAFilter(None, 10),
MolWtFilter(None, 500),
LogPFilter(None, 5),
SAFilter(None, 7),
BridgeheadFilter(None,0),
PAINSAFilter(),
ExclusionFilter(smarts, criteria='any'),
RotChainFilter(None, 7)
],
[],
fail_score=-1., log=True, use_lookup=True)
df = pd.read_csv('../files/smiles.csv')
# if in Collab:
# download_files()
# df = pd.read_csv('files/smiles.csv')
df.shape
hp = template(df.smiles.values)
df = df[hp]
df.shape
This gives us a refined dataset of compounds that match our profile. Training a generrative model on a refined dataset will yield much higher quality compounds
Fragment Datasets
Often we want to optimize a section of a molecule while keeping the rest constant. The best way to do this is to have the model generate only the part of the compound that needs to be changed.
If you want to use a generative screen to optimize an R-group, it is more efficient to train an R-group specific model rather than a full-molecule compound.
We will look at curating datasets for R-group, linker and scaffold generation tasks.
We will use the fragment_smile
function to slice up compounds into fragments. Then we will screen the fragments
df = pd.read_csv('../files/smiles.csv')
# if in Collab:
# download_files()
# df = pd.read_csv('files/smiles.csv')
cuts = [1,2,3,4,5,6]
fragment_function = partial(fragment_smile, cuts=cuts)
fragments = new_pool_parallel(fragment_function, df.smiles.values, cpus=0)
fragments = flatten_list_of_lists(fragments)
fragments = list(set(fragments))
len(fragments)
draw_mols(to_mols(fragments[:3]))
Now we have ~87000 new fragments. These fragments will have a wide variety of sizes and attachment configurations. Now we want to filter these fragments into R-groups, linkers and scaffolds.
First we define the filter criteria
R-groups:
- One
*
attachment point - Less than 200 g/mol
Linkers:
- Two
*
attachment points - No rings
- 10 or fewer heavy atoms
Scaffolds:
- At least two
*
attachment points - At least one ring
- No flexible chains longer than 5 atoms
We'll define a new property filter to check the number of attachment points
def count_attachments(smile):
smile = to_smile(smile)
return smile.count('*')
class AttachmentFilter(PropertyFilter):
"Attachment Filter"
def __init__(self, min_val, max_val, score=None, name=None, **kwargs):
super().__init__(count_attachments, min_val=min_val, max_val=max_val,
score=score, name=name, **kwargs)
Now we make templates for the different compound groups
rgroup_template = Template([ValidityFilter(),
SingleCompoundFilter(),
AttachmentFilter(1, 1),
MolWtFilter(None, 200)
],
[],
fail_score=-1,
log=True,
use_lookup=True)
linker_template = Template([ValidityFilter(),
SingleCompoundFilter(),
AttachmentFilter(2, 2),
RingFilter(None,0),
HeavyAtomsFilter(None, 10)
],
[],
fail_score=-1,
log=True,
use_lookup=True)
scaffold_template = Template([ValidityFilter(),
SingleCompoundFilter(),
AttachmentFilter(2, None),
RingFilter(1,None),
RotChainFilter(None, 5)
],
[],
fail_score=-1,
log=True,
use_lookup=True)
frag_df = pd.DataFrame(fragments, columns=['smiles'])
frag_df['rgroup_hp'] = rgroup_template(frag_df.smiles.values)
frag_df['linker_hp'] = linker_template(frag_df.smiles.values)
frag_df['scaffold_hp'] = scaffold_template(frag_df.smiles.values)
frag_df[['rgroup_hp', 'linker_hp', 'scaffold_hp']].mean()
draw_mols(to_mols(frag_df[frag_df.rgroup_hp].sample(n=5).smiles.values), mols_per_row=5)
draw_mols(to_mols(frag_df[frag_df.linker_hp].sample(n=5).smiles.values), mols_per_row=5)
draw_mols(to_mols(frag_df[frag_df.scaffold_hp].sample(n=5).smiles.values), mols_per_row=5)