Tokenization
Tokenzation defines how we break text strings (ie SMILES strings) down into subunits that are fed to the model. The standard process goes as follows:
- A tokenization process breaks a string down into tokens
- Tokens are mapped to integers
- The token integers are sent to the model
This brings up the problem of how best to tokenize smiles. The following methods are implemented out of the box:
Character Tokenization
Character Tokenization is when we break down SMILES by character. This is implemented with the tokenize_by_character
function.
tokenize_by_character('CC[NH]CC')
> > ['C', 'C', '[', 'N', 'H', ']', 'C', 'C']
This form of tokenization is quick and simple. One drawback of this approach is some characters might be overloaded. For example, Br
is tokenized to ['B', 'r']
, leading to the B
token meaning both boron (in the standard context) and Bromine (in the Br
context). In practice, this isn't much of an issue. Language models are particularly adept at learning co-location of tokens.
Character Tokenization with Replacement
Character tokenization with replacement is the same as character tokenization except we add a dictionary of multi-character tokens to be replaced with single-character tokens. This dictionary has the form {multi_character_token :single_character_token}
. Before tokenizing by character, all instances of multi_character_token
are replaced with single_character_token
. Character Tokenization with Replacement is implemented with the tokenize_with_replacements
function.
replacement_dict = {'Br' : 'R', 'Cl' : 'L'}
tokenize_with_replacements('[Cl]CC[Br]', replacement_dict)
> > ['[', 'L', ']', 'C', 'C', '[', 'R', ']']
Regex Tokenization
Regex tokenization uses a regex string to decompose SMILES. This is mainly used to keep bracketed terms (ie [O-]
) as single tokens. This method avoids character overloading by keeping all bracketed terms as individual tokens, but has issues with generating a large number of low frequency tokens. Regex tokenization is implemented with the regex_tokenize
function
SMILE_REGEX = """(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|H|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"""regex_tokenize('CCC[Br]', re.compile(SMILE_REGEX))
>>['C', 'C', 'C', '[Br]']
These are regex patterns to decompose smiles into tokens
SMILE_REGEX
is based off this work. The pattern decomposes SMILES into individual characters, but keeps Cl
, Br
, and any term in brackets (ie [O-]
) intact.
MAPPING_REGEX
is a derivative of SMILE_REGEX
designed to work with the mapping framework used with the Block
class. MAPPING_REGEX
keeps Cl
, Br
, and any string of the form [{isotope}*:{map_num}]
intact
assert tokenize_by_character('CCC[Br]') == ['C', 'C', 'C', '[', 'B', 'r', ']']
assert tokenize_with_replacements('CCC[Br]', HALOGEN_REPLACE) == ['C', 'C', 'C', '[', 'R', ']']
assert regex_tokenize('CCC[Br]', re.compile(SMILE_REGEX)) == ['C', 'C', 'C', '[Br]']
assert regex_tokenize('[1*:1]CCC[Br]', re.compile(MAPPING_REGEX)) == ['[1*:1]', 'C', 'C', 'C', '[', 'Br', ']']
Vocabulary
The Vocab
class handles tokenization. Vocab.tokenize
breaks strings down into tokens. Vocab.numericalize
maps tokens to integers. Vocab.reconstruct
converts integers back into strings.
Vocab
holds itos
, a list of tokens, and stoi
, a dictionary mapping tokens to integers. Vocab
automatically adds four special tokens ['bos', 'eos', 'pad', 'unk']
indicating beginning of sentence, end of sentence, padding and unknown.
Custom Vocbulary
To implement custom tokenization, subclass Vocab
and update the tokenize
, numericalize
and reconstruct
methods. Use the test_reconstruction
function to verify your custom vocab can successfully reconstruct sequences.
Vocabs also have prefunc
and postfunc
hooks for added flexibility. prefunc
is called on the inputs to tokenize
before tokenization. postfunc
is called during reconstruct
after tokens are joined.
df = pd.read_csv('files/smiles.csv')
smiles = df.smiles.values
vocab = CharacterVocab(SMILES_CHAR_VOCAB)
assert test_reconstruction(vocab, smiles)==[]
vocab = FuncVocab(SMILES_CHAR_VOCAB, tokenize_by_character)
assert test_reconstruction(vocab, smiles)==[]
vocab = CharacterReplaceVocab(SMILES_CHAR_VOCAB, HALOGEN_REPLACE)
assert vocab.tokenize('CC[Br]') == ['bos', 'C', 'C', '[', 'R', ']', 'eos']
assert test_reconstruction(vocab, smiles)==[]
vocab = RegexVocab(SMILES_CHAR_VOCAB, SMILE_REGEX)
assert vocab.tokenize('CC[Br]') == ['bos', 'C', 'C', '[Br]', 'eos']
vocab.update_vocab_from_data(smiles)
assert test_reconstruction(vocab, smiles)==[]
Vocab Example - Removing Stereochemistry
Stereochemistry is represented in SMILES strings using the @
character. For example, C[C@H](N)C(=O)O
and C[C@@H](N)C(=O)O
represent two different stereoisomers for the same compound.
Stereochemistry in SMILES strings can lead to interesting outcomes for ML models. Predictive models can overfit to specific stereocenters present in training data, and generative models can fall into a form of mode collapse of predicting different stereoisomers of the same compound, or generating excessive stereocenters.
For these reasons, we may wish to deal with generic SMILES strings without stereochemistry information.
One way to do this is to bulk preprocess datasets to remove stereochemistry information, but this requires storing a copy of the stereochemistry-free data, which can be prohibitive for large datasets.
Another way would be to apply stereochemistry removal as a prefunc
in our Vocab
, which will remove stereochemistry on the fly before tokenization.
from mrl.chem import *
V1 = CharacterVocab(SMILES_CHAR_VOCAB)
V2 = CharacterVocab(SMILES_CHAR_VOCAB, prefunc=remove_stereo)
assert '@' in V1.tokenize('C[C@H](N)C(=O)O')
assert not '@' in V2.tokenize('C[C@H](N)C(=O)O')
assert V1.reconstruct(V1.numericalize(V1.tokenize('C[C@H](N)C(=O)O'))) == 'C[C@H](N)C(=O)O'
assert V2.reconstruct(V2.numericalize(V2.tokenize('C[C@H](N)C(=O)O'))) == 'CC(N)C(=O)O'
Vocab Example - SELFIES Vocab
SELFIES is molecule text representation alternative to SMILES strings. One advantage of using SELFIES representations is token swaps in SELFIES strings always result in valid compounds.
MRL is standardized around using SMILES representations for working with compounds, so we can't use SELFIES representations for RL training (unless you want to write a parallel SELFIES-compatible version of all the comp chem functions).
Luckily, we can make use of prefunc
and postfunc
utilities to use SELFIES representations in our generative models. We can use smile_to_selfie
as a prefunc and selfie_to_smile
as a postfunc.
This means we can keep all our data in SMILES forms for compatibility. When we process SMILES for the model, we first use the prefunc
to convert SMILES to SELFIES. Then we tokenize, numericalize and train models in selfies space. Then when we reconstruct a sequence, we use the postfunc
to convert it back into SMILES strings.
From the outside perspective, the model takes in and produces SMILES strings. But internally, everything is in SELFIES space.
from mrl.chem import *
vocab = FuncVocab(SELFIES_VOCAB, split_selfie,
prefunc=smile_to_selfie, postfunc=selfie_to_smile)
smile = 'COc1ccc2[nH]cc(CCNC(C)=O)c2c1'
print(f'SMILE: {smile}')
selfie = vocab.prefunc(smile)
print(f'SELFIE: {selfie}')
assert vocab.postfunc(vocab.prefunc(smile)) == smile # note - only works of smile is canonicalized
full_recon = vocab.reconstruct(vocab.numericalize(vocab.tokenize(smile)))
partial_recon = vocab.join_tokens(vocab._reconstruct(vocab.numericalize(vocab.tokenize(smile))))
assert full_recon == smile
assert partial_recon == selfie