This tutorial gives an introduction to the MRL combichem module
Performance Notes
The workflow in this notebook is CPU-constrained due to the need to evaluate samples on CPU. If you have a multi-core machine, it is recommended that you uncomment and run the set_global_pool
cells in the notebook. This will trigger the use of multiprocessing, which will result in 2-4x speedups.
This notebook may run slow on Collab due to CPU limitations.
This notebook does not require a GPU
import sys
sys.path.append('..')
from mrl.imports import *
from mrl.core import *
from mrl.chem import *
from mrl.templates.all import *
from mrl.combichem import *
# if in the repo
df = pd.read_csv('../files/smiles.csv')
# if in Collab
# download_files
# df = pd.read_csv('files/smiles.csv')
smiles = df.smiles.values
mols = to_mols(smiles)
len(mols)
Combichem Operations
Combichem methods use stochastic rules-based molecular changes to generate new compounds. Combichem consists of the following steps:
- Library generation - create the next iteration of the library
- Library scoring - apply a numeric score to each item in the library
- Library pruning - remove low scoring compounds
Library generation consists of two main steps - mutation and crossover.
A mutation is a process that maps a single molecule to a new molecule. Custom mutations can be created by subclassing Mutator
Crossover is a process where a pair of molelcules is used to generate a new molecule that contains features of each parent molecule. Custom crossovers can be created by subclassing Crossover
m = InsertAtom()
outputs = m(mols[0])
outputs = to_mols(outputs)
outputs = [i for i in outputs if i is not None]
draw_mols([mols[0]]+to_mols(outputs[:7]), legends=['Input']+['Output']*8, mols_per_row=4)
cs = FragmentCrossover()
outputs = cs.crossover([mols[0], mols[1]])
outputs = to_mols(outputs)
outputs = [i for i in outputs if i is not None]
draw_mols([mols[0], mols[1]]+outputs[:6], legends=['Input 1', 'Input 2'] + ['Output']*6,
mols_per_row=4)
MPO Optimization
To look at how the combichem module orchestrates different mutations and crossovers, we first need a score function to optimize against. We'll use a multiparameter optimization goal of maximizing QED, LogP and SA score. This score function is defined in the following template
We set the following hard filters:
ValidityFilter
: Only valid compoundsSingleCompoundFilter
: Only single compoundsMolWtFilter
: Molecular weight less than 500 g/molHBDFilter
: Less than or equal to 5 hydrogen bond donorsHBAFilter
: Less than or equal to 10 hydrogen bond acceptorsLogPFilter
: LogP less than 5
Next we set up our soft filters which will serve as our score. For each property (QED, LogP, SA), we scale the value:
- QED scaled between
[0,2]
- LogP scaled between
[0,1]
- SA score scaled between
[0,1]
This weights LogP and SA score evenly, giving double weight to QED score
def scale_sa(sa):
return (10-sa)/9
def scale_logp(logp):
logp = logp/5
logp = min(max(logp,0),1)
return logp
def scale_qed(qed):
return 2*qed
template = Template([ValidityFilter(),
SingleCompoundFilter(),
MolWtFilter(None, 500),
HBDFilter(None, 5),
HBAFilter(None, 10),
LogPFilter(None, 5)
],
[QEDFilter(None, None, score=PropertyFunctionScore(scale_qed)),
SAFilter(None, None, score=PropertyFunctionScore(scale_sa)),
LogPFilter(None, None, score=PropertyFunctionScore(scale_logp))],
fail_score=-1., log=False)
Combichem Module
The CombiChem
module holds the mutators, crossovers, template and rewards associated with a combichem run.
We will use a series of SMARTS based mutators with fragment based crossover. Every iteration we will prune 70% of the library and keep a maximum of 500 compounds
mutators = [
ChangeAtom(),
AppendAtom(),
DeleteAtom(),
ChangeBond(),
InsertAtom(),
AddRing()
]
mc = MutatorCollection(mutators)
crossovers = [FragmentCrossover()]
cbc = CombiChem(mc, crossovers, template=template,
prune_percentile=70, max_library_size=500, log=True, p_explore=0.2)
set_global_pool(min(48, os.cpu_count()))
cbc.add_data(df.smiles.values[:50])
for i in range(12):
cbc.step()
print(cbc.library.score.mean())
subset = cbc.library[cbc.library.score>3.657]
samples = subset.smiles.values
values = subset.score.values
mols = [to_mol(i) for i in samples]
qeds = [qed(i) for i in mols]
logps = [logp(i) for i in mols]
sas = [sa_score(i) for i in mols]
legends = [f'QED: {qeds[i]:.3f}\tLogP: {logps[i]:.3f}\nSA Score: {sas[i]:.3f}\tOverall: {values[i]:.3f}'
for i in range(len(samples))]
draw_mols(mols, legends=legends)
mutators = [
SelfiesInsert(50),
SelfiesReplace(50),
SelfiesRemove(50)
]
mc = MutatorCollection(mutators)
crossovers = [FragmentCrossover()]
cbc = CombiChem(mc, crossovers, template=template,
prune_percentile=70, max_library_size=500, log=True, p_explore=0.2)
set_global_pool(min(48, os.cpu_count()))
cbc.add_data(df.smiles.values[:50])
for i in range(12):
cbc.step()
print(cbc.library.score.mean())
subset = cbc.library[cbc.library.score>3.657]
samples = subset.smiles.values
values = subset.score.values
mols = [to_mol(i) for i in samples]
qeds = [qed(i) for i in mols]
logps = [logp(i) for i in mols]
sas = [sa_score(i) for i in mols]
legends = [f'QED: {qeds[i]:.3f}\tLogP: {logps[i]:.3f}\nSA Score: {sas[i]:.3f}\tOverall: {values[i]:.3f}'
for i in range(len(samples))]
draw_mols(mols, legends=legends)
target = 'Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2)cc1'
target_mol = to_mol(target)
target_mol
template = Template([ValidityFilter(),
SingleCompoundFilter(),
],
[FPFilter([ECFP4(target_mol)], 'ECFP4', 'tanimoto', score=PassThroughScore())],
fail_score=-1., log=False)
mutators = [
ChangeAtom(),
AppendAtom(),
DeleteAtom(),
ChangeBond(),
InsertAtom(),
AddRing()
]
mc = MutatorCollection(mutators)
crossovers = [FragmentCrossover()]
cbc = CombiChem(mc, crossovers, template=template,
prune_percentile=70, max_library_size=500, log=True, p_explore=0.2)
set_global_pool(min(48, os.cpu_count()))
cbc.add_data(df.smiles.values)
cbc.library.score.max()
for i in range(60):
cbc.step()
print(cbc.library.score.mean())
if cbc.library.score.max()==1.:
print('Rediscovered')
break
Reward Incorporation
The CombiChem
module can also incorporate Reward
modules. This section gives an example of how to do this.
For the reward, we will load a scikit-learn linear regression model trained to predict affinity against erbB1 using molecular fingerprints.
This score function is extremely simple and won't translate well to affinity. It is used as a lightweight example
from mrl.train.reward import Reward
import torch
class FP_Regression_Score():
def __init__(self, fname):
self.model = torch.load(fname)
self.fp_function = partial(failsafe_fp, fp_function=ECFP6)
def __call__(self, samples):
mols = to_mols(samples)
fps = maybe_parallel(self.fp_function, mols)
fps = [fp_to_array(i) for i in fps]
x_vals = np.stack(fps)
preds = self.model.predict(x_vals)
return preds
# if in the repo
reward_function = FP_Regression_Score('../files/erbB1_regression.sklearn')
# if in Collab
# download_files()
# reward_function = FP_Regression_Score('files/erbB1_regression.sklearn')
reward = Reward(reward_function, weight=1.)
template = Template([ValidityFilter(),
SingleCompoundFilter(),
MolWtFilter(None, 550)
],
[],
fail_score=-1., log=False)
mutators = [
ChangeAtom(),
AppendAtom(),
DeleteAtom(),
ChangeBond(),
InsertAtom(),
AddRing()
]
mc = MutatorCollection(mutators)
crossovers = [FragmentCrossover()]
cbc = CombiChem(mc, crossovers, template=template, rewards=[reward],
prune_percentile=70, max_library_size=500, log=True, p_explore=0.2)
set_global_pool(min(48, os.cpu_count()))
cbc.add_data(df.smiles.values)
for i in range(5):
cbc.step()
print(cbc.library.score.mean())