Tokenization
Tokenzation defines how we break text strings (ie SMILES strings) down into subunits that are fed to the model. The standard process goes as follows:
- A tokenization process breaks a string down into tokens
- Tokens are mapped to integers
- The token integers are sent to the model
This brings up the problem of how best to tokenize smiles. The following methods are implemented out of the box:
Character Tokenization
Character Tokenization is when we break down SMILES by character. This is implemented with the tokenize_by_character function.
tokenize_by_character('CC[NH]CC')
> > ['C', 'C', '[', 'N', 'H', ']', 'C', 'C']
This form of tokenization is quick and simple. One drawback of this approach is some characters might be overloaded. For example, Br is tokenized to ['B', 'r'], leading to the B token meaning both boron (in the standard context) and Bromine (in the Br context). In practice, this isn't much of an issue. Language models are particularly adept at learning co-location of tokens.
Character Tokenization with Replacement
Character tokenization with replacement is the same as character tokenization except we add a dictionary of multi-character tokens to be replaced with single-character tokens. This dictionary has the form {multi_character_token :single_character_token}. Before tokenizing by character, all instances of multi_character_token are replaced with single_character_token. Character Tokenization with Replacement is implemented with the tokenize_with_replacements function.
replacement_dict = {'Br' : 'R', 'Cl' : 'L'}
tokenize_with_replacements('[Cl]CC[Br]', replacement_dict)
> > ['[', 'L', ']', 'C', 'C', '[', 'R', ']']
Regex Tokenization
Regex tokenization uses a regex string to decompose SMILES. This is mainly used to keep bracketed terms (ie [O-]) as single tokens. This method avoids character overloading by keeping all bracketed terms as individual tokens, but has issues with generating a large number of low frequency tokens. Regex tokenization is implemented with the regex_tokenize function
SMILE_REGEX = """(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|H|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"""regex_tokenize('CCC[Br]', re.compile(SMILE_REGEX))
>>['C', 'C', 'C', '[Br]']
These are regex patterns to decompose smiles into tokens
SMILE_REGEX is based off this work. The pattern decomposes SMILES into individual characters, but keeps Cl, Br, and any term in brackets (ie [O-]) intact.
MAPPING_REGEX is a derivative of SMILE_REGEX designed to work with the mapping framework used with the Block class. MAPPING_REGEX keeps Cl, Br, and any string of the form [{isotope}*:{map_num}] intact
assert tokenize_by_character('CCC[Br]') == ['C', 'C', 'C', '[', 'B', 'r', ']']
assert tokenize_with_replacements('CCC[Br]', HALOGEN_REPLACE) == ['C', 'C', 'C', '[', 'R', ']']
assert regex_tokenize('CCC[Br]', re.compile(SMILE_REGEX)) == ['C', 'C', 'C', '[Br]']
assert regex_tokenize('[1*:1]CCC[Br]', re.compile(MAPPING_REGEX)) == ['[1*:1]', 'C', 'C', 'C', '[', 'Br', ']']
Vocabulary
The Vocab class handles tokenization. Vocab.tokenize breaks strings down into tokens. Vocab.numericalize maps tokens to integers. Vocab.reconstruct converts integers back into strings.
Vocab holds itos, a list of tokens, and stoi, a dictionary mapping tokens to integers. Vocab automatically adds four special tokens ['bos', 'eos', 'pad', 'unk'] indicating beginning of sentence, end of sentence, padding and unknown.
Custom Vocbulary
To implement custom tokenization, subclass Vocab and update the tokenize, numericalize and reconstruct methods. Use the test_reconstruction function to verify your custom vocab can successfully reconstruct sequences.
Vocabs also have prefunc and postfunc hooks for added flexibility. prefunc is called on the inputs to tokenize before tokenization. postfunc is called during reconstruct after tokens are joined.
df = pd.read_csv('files/smiles.csv')
smiles = df.smiles.values
vocab = CharacterVocab(SMILES_CHAR_VOCAB)
assert test_reconstruction(vocab, smiles)==[]
vocab = FuncVocab(SMILES_CHAR_VOCAB, tokenize_by_character)
assert test_reconstruction(vocab, smiles)==[]
vocab = CharacterReplaceVocab(SMILES_CHAR_VOCAB, HALOGEN_REPLACE)
assert vocab.tokenize('CC[Br]') == ['bos', 'C', 'C', '[', 'R', ']', 'eos']
assert test_reconstruction(vocab, smiles)==[]
vocab = RegexVocab(SMILES_CHAR_VOCAB, SMILE_REGEX)
assert vocab.tokenize('CC[Br]') == ['bos', 'C', 'C', '[Br]', 'eos']
vocab.update_vocab_from_data(smiles)
assert test_reconstruction(vocab, smiles)==[]
Vocab Example - Removing Stereochemistry
Stereochemistry is represented in SMILES strings using the @ character. For example, C[C@H](N)C(=O)O and C[C@@H](N)C(=O)O represent two different stereoisomers for the same compound.
Stereochemistry in SMILES strings can lead to interesting outcomes for ML models. Predictive models can overfit to specific stereocenters present in training data, and generative models can fall into a form of mode collapse of predicting different stereoisomers of the same compound, or generating excessive stereocenters.
For these reasons, we may wish to deal with generic SMILES strings without stereochemistry information.
One way to do this is to bulk preprocess datasets to remove stereochemistry information, but this requires storing a copy of the stereochemistry-free data, which can be prohibitive for large datasets.
Another way would be to apply stereochemistry removal as a prefunc in our Vocab, which will remove stereochemistry on the fly before tokenization.
from mrl.chem import *
V1 = CharacterVocab(SMILES_CHAR_VOCAB)
V2 = CharacterVocab(SMILES_CHAR_VOCAB, prefunc=remove_stereo)
assert '@' in V1.tokenize('C[C@H](N)C(=O)O')
assert not '@' in V2.tokenize('C[C@H](N)C(=O)O')
assert V1.reconstruct(V1.numericalize(V1.tokenize('C[C@H](N)C(=O)O'))) == 'C[C@H](N)C(=O)O'
assert V2.reconstruct(V2.numericalize(V2.tokenize('C[C@H](N)C(=O)O'))) == 'CC(N)C(=O)O'
Vocab Example - SELFIES Vocab
SELFIES is molecule text representation alternative to SMILES strings. One advantage of using SELFIES representations is token swaps in SELFIES strings always result in valid compounds.
MRL is standardized around using SMILES representations for working with compounds, so we can't use SELFIES representations for RL training (unless you want to write a parallel SELFIES-compatible version of all the comp chem functions).
Luckily, we can make use of prefunc and postfunc utilities to use SELFIES representations in our generative models. We can use smile_to_selfie as a prefunc and selfie_to_smile as a postfunc.
This means we can keep all our data in SMILES forms for compatibility. When we process SMILES for the model, we first use the prefunc to convert SMILES to SELFIES. Then we tokenize, numericalize and train models in selfies space. Then when we reconstruct a sequence, we use the postfunc to convert it back into SMILES strings.
From the outside perspective, the model takes in and produces SMILES strings. But internally, everything is in SELFIES space.
from mrl.chem import *
vocab = FuncVocab(SELFIES_VOCAB, split_selfie,
prefunc=smile_to_selfie, postfunc=selfie_to_smile)
smile = 'COc1ccc2[nH]cc(CCNC(C)=O)c2c1'
print(f'SMILE: {smile}')
selfie = vocab.prefunc(smile)
print(f'SELFIE: {selfie}')
assert vocab.postfunc(vocab.prefunc(smile)) == smile # note - only works of smile is canonicalized
full_recon = vocab.reconstruct(vocab.numericalize(vocab.tokenize(smile)))
partial_recon = vocab.join_tokens(vocab._reconstruct(vocab.numericalize(vocab.tokenize(smile))))
assert full_recon == smile
assert partial_recon == selfie