Functions related to converting strings into tensors

Tokenization

Tokenzation defines how we break text strings (ie SMILES strings) down into subunits that are fed to the model. The standard process goes as follows:

  1. A tokenization process breaks a string down into tokens
  2. Tokens are mapped to integers
  3. The token integers are sent to the model

This brings up the problem of how best to tokenize smiles. The following methods are implemented out of the box:

Character Tokenization

Character Tokenization is when we break down SMILES by character. This is implemented with the tokenize_by_character function.

tokenize_by_character('CC[NH]CC')
> > ['C', 'C', '[', 'N', 'H', ']', 'C', 'C']

This form of tokenization is quick and simple. One drawback of this approach is some characters might be overloaded. For example, Br is tokenized to ['B', 'r'], leading to the B token meaning both boron (in the standard context) and Bromine (in the Br context). In practice, this isn't much of an issue. Language models are particularly adept at learning co-location of tokens.

Character Tokenization with Replacement

Character tokenization with replacement is the same as character tokenization except we add a dictionary of multi-character tokens to be replaced with single-character tokens. This dictionary has the form {multi_character_token :single_character_token}. Before tokenizing by character, all instances of multi_character_token are replaced with single_character_token. Character Tokenization with Replacement is implemented with the tokenize_with_replacements function.

replacement_dict = {'Br' : 'R', 'Cl' : 'L'}
tokenize_with_replacements('[Cl]CC[Br]', replacement_dict)
> > ['[', 'L', ']', 'C', 'C', '[', 'R', ']']

Regex Tokenization

Regex tokenization uses a regex string to decompose SMILES. This is mainly used to keep bracketed terms (ie [O-]) as single tokens. This method avoids character overloading by keeping all bracketed terms as individual tokens, but has issues with generating a large number of low frequency tokens. Regex tokenization is implemented with the regex_tokenize function

SMILE_REGEX = """(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|H|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"""regex_tokenize('CCC[Br]', re.compile(SMILE_REGEX))
>>['C', 'C', 'C', '[Br]']

pad_vocab[source]

pad_vocab(vocab)

pads vocab to have a length divisible by 8 - improves fp16 performance

These are regex patterns to decompose smiles into tokens

SMILE_REGEX is based off this work. The pattern decomposes SMILES into individual characters, but keeps Cl, Br, and any term in brackets (ie [O-]) intact.

MAPPING_REGEX is a derivative of SMILE_REGEX designed to work with the mapping framework used with the Block class. MAPPING_REGEX keeps Cl, Br, and any string of the form [{isotope}*:{map_num}] intact

tokenize_by_character[source]

tokenize_by_character(input)

Splits input into inividual characters

tokenize_with_replacements[source]

tokenize_with_replacements(input, replacement_dict)

Replaces substrings in input using replacement_dict, then tokenizes by character

regex_tokenize[source]

regex_tokenize(input, regex)

Uses regex to tokenize input

tokenize_by_kmer[source]

tokenize_by_kmer(input, kmer, stride)

assert tokenize_by_character('CCC[Br]') == ['C', 'C', 'C', '[', 'B', 'r', ']']
assert tokenize_with_replacements('CCC[Br]', HALOGEN_REPLACE) == ['C', 'C', 'C', '[', 'R', ']']
assert regex_tokenize('CCC[Br]', re.compile(SMILE_REGEX)) == ['C', 'C', 'C', '[Br]']
assert regex_tokenize('[1*:1]CCC[Br]', re.compile(MAPPING_REGEX)) == ['[1*:1]', 'C', 'C', 'C', '[', 'Br', ']']

Vocabulary

The Vocab class handles tokenization. Vocab.tokenize breaks strings down into tokens. Vocab.numericalize maps tokens to integers. Vocab.reconstruct converts integers back into strings.

Vocab holds itos, a list of tokens, and stoi, a dictionary mapping tokens to integers. Vocab automatically adds four special tokens ['bos', 'eos', 'pad', 'unk'] indicating beginning of sentence, end of sentence, padding and unknown.

Custom Vocbulary

To implement custom tokenization, subclass Vocab and update the tokenize, numericalize and reconstruct methods. Use the test_reconstruction function to verify your custom vocab can successfully reconstruct sequences.

Vocabs also have prefunc and postfunc hooks for added flexibility. prefunc is called on the inputs to tokenize before tokenization. postfunc is called during reconstruct after tokens are joined.

class Vocab[source]

Vocab(itos, prefunc=None, postfunc=None)

Vocab - base vocabulary class

Inputs:

  • itos list: list of tokens in vocabulary

  • prefunc Optional[Callable]: function applied to input before tokenization

  • postfunc Optional[Callable]: function applied to input after reconstruction

class CharacterVocab[source]

CharacterVocab(itos, prefunc=None, postfunc=None) :: Vocab

CharacterVocab - tokenize by character

Inputs:

  • itos list: list of tokens in vocabulary

  • prefunc Optional[Callable]: function applied to input before tokenization

  • postfunc Optional[Callable]: function applied to input after reconstruction

class FuncVocab[source]

FuncVocab(itos, tok_func, prefunc=None, postfunc=None) :: Vocab

FuncVocab - tokenize by tok_func

Inputs:

  • itos list: list of tokens in vocabulary

  • tok_func Callable: tokenization function

  • prefunc Optional[Callable]: function applied to input before tokenization

  • postfunc Optional[Callable]: function applied to input after reconstruction

class SelfiesVocab[source]

SelfiesVocab(itos) :: FuncVocab

SelfiesVocab - converts smiles to selfies

Inputs:

  • itos list: list of tokens in vocabulary

class CharacterReplaceVocab[source]

CharacterReplaceVocab(itos, replace_dict, prefunc=None, postfunc=None) :: Vocab

CharacterReplaceVocab - tokenize by character with replacement

Inputs:

  • itos list: list of tokens in vocabulary

  • replace_dict dict: replacement dictionary of the form {multi_character_token : single_character_token}. ie replace_dict={'Br':'R', 'Cl':'L'}

  • prefunc Optional[Callable]: function applied to input before tokenization

  • postfunc Optional[Callable]: function applied to input after reconstruction

class RegexVocab[source]

RegexVocab(itos, pattern, prefunc=None, postfunc=None) :: Vocab

RegexVocab - tokenize using pattern

Inputs:

  • itos list: list of tokens in vocabulary

  • pattern str: regex string

  • prefunc Optional[Callable]: function applied to input before tokenization

  • postfunc Optional[Callable]: function applied to input after reconstruction

class KmerVocab[source]

KmerVocab(itos, kmer, stride=None, prefunc=None, postfunc=None) :: Vocab

KmerVocab - Kmer tokenization vocabulary

Inputs:

  • itos list: list of tokens in vocabulary

  • kmer int: kmer size

  • stride Optional[int]: kmer stride. If not passed, stride will be the same as kmer. Using a stride value different from the kmer value will prevent proper reconstruction

  • prefunc Optional[Callable]: function applied to input before tokenization

  • postfunc Optional[Callable]: function applied to input after reconstruction

test_reconstruction[source]

test_reconstruction(vocab, inputs)

Returns all items in inputs that can't be correctly reconstructed using vocab

df = pd.read_csv('files/smiles.csv')
smiles = df.smiles.values
vocab = CharacterVocab(SMILES_CHAR_VOCAB)
assert test_reconstruction(vocab, smiles)==[]
vocab = FuncVocab(SMILES_CHAR_VOCAB, tokenize_by_character)
assert test_reconstruction(vocab, smiles)==[]
vocab = CharacterReplaceVocab(SMILES_CHAR_VOCAB, HALOGEN_REPLACE)
assert vocab.tokenize('CC[Br]') == ['bos', 'C', 'C', '[', 'R', ']', 'eos']
assert test_reconstruction(vocab, smiles)==[]
vocab = RegexVocab(SMILES_CHAR_VOCAB, SMILE_REGEX)
assert vocab.tokenize('CC[Br]') == ['bos', 'C', 'C', '[Br]', 'eos']
vocab.update_vocab_from_data(smiles)
assert test_reconstruction(vocab, smiles)==[]

Vocab Example - Removing Stereochemistry

Stereochemistry is represented in SMILES strings using the @ character. For example, C[C@H](N)C(=O)O and C[C@@H](N)C(=O)O represent two different stereoisomers for the same compound.

Stereochemistry in SMILES strings can lead to interesting outcomes for ML models. Predictive models can overfit to specific stereocenters present in training data, and generative models can fall into a form of mode collapse of predicting different stereoisomers of the same compound, or generating excessive stereocenters.

For these reasons, we may wish to deal with generic SMILES strings without stereochemistry information.

One way to do this is to bulk preprocess datasets to remove stereochemistry information, but this requires storing a copy of the stereochemistry-free data, which can be prohibitive for large datasets.

Another way would be to apply stereochemistry removal as a prefunc in our Vocab, which will remove stereochemistry on the fly before tokenization.

from mrl.chem import *

V1 = CharacterVocab(SMILES_CHAR_VOCAB)
V2 = CharacterVocab(SMILES_CHAR_VOCAB, prefunc=remove_stereo)

assert '@' in V1.tokenize('C[C@H](N)C(=O)O')
assert not '@' in V2.tokenize('C[C@H](N)C(=O)O')
assert V1.reconstruct(V1.numericalize(V1.tokenize('C[C@H](N)C(=O)O'))) == 'C[C@H](N)C(=O)O'
assert V2.reconstruct(V2.numericalize(V2.tokenize('C[C@H](N)C(=O)O'))) == 'CC(N)C(=O)O'
/home/dmai/miniconda3/envs/mrl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

Vocab Example - SELFIES Vocab

SELFIES is molecule text representation alternative to SMILES strings. One advantage of using SELFIES representations is token swaps in SELFIES strings always result in valid compounds.

MRL is standardized around using SMILES representations for working with compounds, so we can't use SELFIES representations for RL training (unless you want to write a parallel SELFIES-compatible version of all the comp chem functions).

Luckily, we can make use of prefunc and postfunc utilities to use SELFIES representations in our generative models. We can use smile_to_selfie as a prefunc and selfie_to_smile as a postfunc.

This means we can keep all our data in SMILES forms for compatibility. When we process SMILES for the model, we first use the prefunc to convert SMILES to SELFIES. Then we tokenize, numericalize and train models in selfies space. Then when we reconstruct a sequence, we use the postfunc to convert it back into SMILES strings.

From the outside perspective, the model takes in and produces SMILES strings. But internally, everything is in SELFIES space.

from mrl.chem import *

vocab = FuncVocab(SELFIES_VOCAB, split_selfie, 
                  prefunc=smile_to_selfie, postfunc=selfie_to_smile)

smile = 'COc1ccc2[nH]cc(CCNC(C)=O)c2c1'
print(f'SMILE: {smile}')
selfie = vocab.prefunc(smile)
print(f'SELFIE: {selfie}')
assert vocab.postfunc(vocab.prefunc(smile)) == smile # note - only works of smile is canonicalized

full_recon = vocab.reconstruct(vocab.numericalize(vocab.tokenize(smile)))
partial_recon = vocab.join_tokens(vocab._reconstruct(vocab.numericalize(vocab.tokenize(smile))))

assert full_recon == smile
assert partial_recon == selfie
SMILE: COc1ccc2[nH]cc(CCNC(C)=O)c2c1
SELFIE: [C][O][C][=C][C][=C][NH1][C][=C][Branch1][=Branch2][C][C][N][C][Branch1][C][C][=O][C][Ring1][O][=C][Ring1][#C]