Overview
Structure enumeration is a useful way to explore constrained chemical spaces. You might want to try swapping a ring in a molecule with every possible 6 member ring, or try every linker with four atoms or less.
The StructureEnumerator
class provides a principled way of enumerating constrained chemical spaces. The gist of it is this. We create a SMARTs template outlining the general structure we want to enumerate. Then, for every atom and bond we want to vary, we define a list of allowed substitutions. The StructureEnumerator
then generates all combinations based on these specifications and filters them for chemical validity.
import sys
sys.path.append('..')
from mrl.imports import *
from mrl.core import *
from mrl.chem import *
from rdkit import Chem
Basic Example
Lets say we want to generate all variants of a 5 or 6 membered ring with 3 attachment points and at most 3 heteroatoms.
First we start with the base smiles string for a 6-membered ring.
C1CCCCC1
We convert this to a smarts string
'[#6]1-[#6]-[#6]-[#6]-[#6]-[#6]-1'
The StructureEnumerator
relies on atom mapping to avoid worrying about atom order. We add a mapping number to every atom in the smarts
'[#6:1]1-[#6:2]-[#6:3]-[#6:4]-[#6:5]-1'
This process can be automated with the add_map_nums
function:
mol = to_mol('C1CCCCC1')
smart = to_smart(add_map_nums(mol))
smart
Now we define the specifications for enumerating atoms and bonds.
Atom specifications are defined in a dictionary of the form map_num : [possible_atoms]
. So atom_spec = {2 : ['C', 'N', 'O']}
would denote enumerating carbon, nitrogen and oxygen at the atom defined by map number 1.
Bond specifications are defined in a dictionary of the form (start_map_num, end_map_num) : [possible_bond_types]
. So bond_spec = {(1,2) : ['single', 'double']}
would denote enumerating single and double bonds at the bond between the atom with map number 1 and the atom with map number 2. The code checks both (start_map_num, end_map_num)
and (end_map_num, start_map_num)
so the index order doesn't matter.
Any atoms or bonds not included in atom_spec
or bond_spec
are left as-is based on the smarts definition. Note that the end state for generated compounds is SMILES strings, so atoms or bonds left unchanged should be SMILES compatible.
To make things easy, the generate_spec_template
function creates a bank atom_spec
and bond_spec
from a given mol
, along with the matching mapped smarts string.
atom_spec, bond_spec, smarts = generate_spec_template(mol)
print(atom_spec)
print(bond_spec)
print(smarts)
For this task, we want to enumerate ['C', 'N', 'O']
at all atom positions and ['aromatic','single','double']
at all bond positions. That gives us the following:
for key in atom_spec.keys():
atom_spec[key] = ['C','N','O']
for key in bond_spec.keys():
bond_spec[key] = ['aromatic','single','double']
print(atom_spec)
print(bond_spec)
We specified earlier we want to enumerate 5-member and 6-member rings. This means we need to denote an atom to be optionally removed during enumeration.
The structure enumerator supports atom removal and bond removal during enumeration by appending -1 (int)
to the matching entry in the atom_spec
or bond_spec
.
Bond removal is implemented simply - the bond is removed and that's that.
Atom removal is a bit more complex. If the atom being removed is bound to only one other heavy atom, the atom is simply removed. If the atom is bound to two heavy atoms, a new bond is created between these atoms, ie R1-atom_to_remove-R2 >> R1-R2
.
This raises the question of what bond type to insert at the new bond. By default, the code generates versions of the coupling with single, double, triple and aromatic bonds. If you have a preference for the substitute bond types, pass a list of desired bond types to the substitute_bonds
parameter, ie substitute_bonds=['single','double']
. Passing substitute_bonds=[-1]
results in no coupling, ie R1-atom_to_remove-R2 >> R1.R2
.
For this example, we select atom 3 to be optionally removed.
atom_spec[3].append(-1)
Why atom 3 instead of some other atom, or multiple other? For this case, it doesn't really matter because we're generating rotationally symmetric rings. Enumerating examples where all atoms are set for optional removal just blows up our enumeration space for no benefit.
When designing enumeration specs, it's important to consider the tradeoffs from expanding a combinatorial space.
StructureEnumerator
follows the convention that all possible atoms should be present in the initial SMARTS, with optional removals specified. This allows greater control over which atoms are removed relative to substitution patterns or other chemical features
Now we use the StructureEnumerator
to enumerate based on our specification.
Note the max_num
parameter. By default, the code generates the Cartesian product of all specification choices. The chemical space represented by the specification can quickly grow to billions of molecules if you're not careful. For this reason, enumeration is best done on chemical spaces less than ~10,000,000 compounds (your compute and memory resources depending). The max_num
parameter puts a limit on how many compounds will be tested.
enum = StructureEnumerator(smart, atom_spec, bond_spec, max_num=10000000)
enum.num_combos # possible combinations
len(enum.combos) # actual combinations after factoring in atom removals
We start with 755973 enumerations. These are expressed as dictionaries with one specification choice per atom map number or bond map number pair
enum.combos[0]
The next step is to create all these enumerations as molecules, and check them for validity. Before we do this, it's a good idea to cut down on the number of enumerations if we can.
For this specific example, we specified we wanted at most 3 heteroatoms in the final rings. This is an easy criteria to filter on. The StructureEnumerator.filter_combos
supports doing this. All we need is a function that takes in one enumeration dictionary and returns True or False
def enum_filter(combo):
nhet = 0.
for k,v in combo.items():
if (v=='O') or (v=='N'): # heteroatoms
nhet += 1
if nhet > 3:
output = False
else:
output = True
return output
enum.filter_combos(enum_filter)
len(enum.combos)
This cuts things down to 397141 combinations. Now we run StructureEnumerator.filter_combos
, which converts the 397141 combinations into a set of unique, chemically valid (by RDKit canonicalization) SMILES strings
smiles = enum.create_mols(max_num=None)
len(smiles) # unique
This results in 269 unique smiles.
From here we want to do some QC on the resulting sequences. Our previous filter should have removed all combinations that violated our 3 or fewer heteroatom rule, but we can double check here. This is also a good step to incorporate other types of filtering if applicable
valid_smiles = []
for s in smiles:
if heteroatoms(to_mol(s))<4:
valid_smiles.append(s)
len(valid_smiles)
This gives us 269 smiles that pass our criteria. We can also see the impact of removing atom map number 3 in enumeration by looking at the number of 5 member rings vs 6 member rings
print(len([i for i in valid_smiles if heavy_atoms(to_mol(i))==5]),
len([i for i in valid_smiles if heavy_atoms(to_mol(i))==6]))
Now we want to add attachment points to the smiles strings. These attachments will be added as wildcard atoms (*
).
We specified earlier we want 3 attachments on each ring. For each smiles string, the code checks to see if the molecule supports that number of attachments, defined by the number of atoms with a least 1 implicit hydrogen. If the molecule supports the attachment number, the code generates all valid attachment combinations. Compounds that cannot support the desired number of attachments are removed from the output.
decorated = decorate_smiles(valid_smiles, 3)
len(decorated)
This gives us a final output of 1272 attachment-decorated 5-member rings with 3 or fewer heteroatoms. Note that the wildcard attachment points are not mapped, so we could further generate 6 configuration variants per output
to_mol('FC1=CC(c2cnccc2)C=N1')
Lets add mapping so we can refer to specific structures
to_mol(add_map_nums(to_mol('FC1=CC(c2cnccc2)C=N1')))
For this compound we want the following:
- Experiment with adding N,O atoms at atoms 3, 11, 12
- Experiment with adding N atoms at 5,6,8,10
- Keep atom 7 N, keep atom 9 C
- Experiment with removing carbon 10 or carbon 8 but not both
- Experiment with F, Cl, Br, None Halogens at atom 1
- Enumerate bond configurations in the 6 member ring but leave the 5 member ring as is
- We also have the following constraints:
- No more than 4 N,O heteroatoms (not counting halogens)
- No
[N,O]-[N,O]
direct linkages
We proceed generating atom and bond spec templates
mol = to_mol('FC1=CC(c2cnccc2)C=N1')
atom_spec, bond_spec, smarts = generate_spec_template(mol)
print(atom_spec)
print(bond_spec)
print(smarts)
And we fill in based on our above criteria
atom_spec[1] = ['F', 'Cl', 'Br', -1]
atom_spec[3] = ['C','N', 'O']
atom_spec[5] = ['C','N']
atom_spec[6] = ['C','N']
atom_spec[8] = ['C','N', -1]
atom_spec[10] = ['C','N', -1]
atom_spec[11] = ['C','N','O']
atom_spec[12] = ['C','N','O']
bond_spec[(5,6)] = ['single', 'double', 'aromatic']
bond_spec[(6,7)] = ['single', 'double', 'aromatic']
bond_spec[(7,8)] = ['single', 'double', 'aromatic']
bond_spec[(8,9)] = ['single', 'double', 'aromatic']
bond_spec[(9,10)] = ['single', 'double', 'aromatic']
bond_spec[(10,5)] = ['single', 'double', 'aromatic']
os.environ['ncpus'] = '3'
enum = StructureEnumerator(smarts, atom_spec, bond_spec, max_num=20000000,
substitute_bonds=['single', 'double', 'aromatic'])
enum.num_combos # possible combinations
len(enum.combos) # after adjusting for removed atoms
Now we make a filter function to remove combinations based on our criteria of no more than 4 non-halogen heteroatoms and at most 1 atom removal
def enum_filter(combo):
n_o_count = 0.
remove_count = 0.
for k,v in combo.items():
if (v=='O') or (v=='N'): # heteroatoms
n_o_count += 1
if v==-1:
remove_count += 1
output = True
if n_o_count>4:
output = False
if remove_count>1:
output = False
return output
gc.collect()
enum.filter_combos(enum_filter)
len(enum.combos)
This leaves us with 2513856 combinations to enumerate
smiles = enum.create_mols_chunks(500000)
len(smiles) # unique
Now we filter the outputs based on check_ring_bonds
(removing SP atoms in rings) and our criteria for no [N,O]-[N,O]
linkages and fewer than 5 N,O heteroatoms
def check_linkage(smile):
mol = to_mol(smile)
pattern = Chem.MolFromSmarts('[#7,#8]~[#7,#8]')
return not mol.HasSubstructMatch(pattern)
def check_heteroatoms(smile):
mol = to_mol(smile)
count = 0.
for atom in mol.GetAtoms():
if atom.GetSymbol() in ['N','O']:
count += 1
return count
valid_smiles = []
for s in smiles:
if check_heteroatoms(s)<5 and check_linkage(s):
valid_smiles.append(s)
len(valid_smiles)
From the enumeration we get 1556 structures, from an initial set of 2513856 possible combinations. This highlights the inefficiencies of structure enumeration. This technique is best left for expoiting highly constrained chemical spaces.
subset = [i for i in valid_smiles if aromaticrings(to_mol(i))>0]
Chem.Draw.MolsToGridImage([Chem.MolFromSmiles(i) for i in np.random.choice(subset, 16)], molsPerRow=4)