Samplers

Sampling callbacks

/home/dmai/miniconda3/envs/mrl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

Sampler Callbacks

Samplers serve two main functions in the fit loop.

During the build_buffer event, samplers add samples to the Buffer

During the sample_batch event, samplers add samples to the current BatchState

Samplers generally have the ability to toggle which events they add samples to. For example if you wanted to do entirely offline RL, you could disable live sampling during the sample_batch event and only train off samples stored in the buffer

Sampler Size

Samplers have two main parameters that control sample size.

The buffer_size parameter is an integer value that control how many samples are generated during the build_buffer event.

The p_batch parameter is a float value between 0 and 1 that determines what percentage of a batch should be drawn from a specific sampler. When using multiple samplers, the sum of all p_batch values should be less than or equal to 1. The difference between the sum of p_batch values and the desired batch size will be made up by sampling from the buffer

`class` `Sampler`[source]

Sampler(name, buffer_size=0, p_batch=0.0, track=True) :: Callback

Sampler - base sampler callback

Inputs:

name str: sampler name
buffer_size int: how many samples to add during build_buffer
p_batch float: what percentage of batch samples should come from this sampler
track bool: if metrics from this sampler should be tracked

`class` `DatasetSampler`[source]

DatasetSampler(data, name, buffer_size=0, p_batch=0.0) :: Sampler

DatasetSampler - adds items from data to either the buffer or the current batch

Inputs:

data list: list of data points to sample from
name str: sampler name
buffer_size int: how many samples to add during build_buffer
p_batch float: what percentage of batch samples should come from this sampler

`class` `CombichemSampler`[source]

CombichemSampler(cbc, library_size, percentile, p_explore, start_iter, lookup_name, name) :: Sampler

CombichemSampler - uses combichem to add samples to the buffer

Inputs:

cbc CombiChem: Combichem module
library_size int: number of compounds to put into combichem
percentile int[1,100]: percentile of compounds to sample from
p_explore float[0.,1.]: percent of compounds below percentile to sample
start_iter int: iteration to start combichem
lookup_name str: colum in Log to use to calculate percentile
name str: callback name

Model Sampler

The ModelSampler sampler can be used to draw samples from any GenerativeModel subclass model. By default, it will track the following sample metrics:

diversity - how many duplicate samples were generated
valid - how many samples are left after filtering
rewards - average rewards from samples generated by the model sampler
new - how many samples are novel to the training run

`class` `ModelSampler`[source]

ModelSampler(vocab, model, name, buffer_size, p_batch, genbatch, track=True, temperature=1.0) :: Sampler

ModelSampler - sampler class to draw samples from a GenerativeModel

Inputs:

vocab Vocab: vocabulary for reconstructing samples
model GenerativeModel: model to sample from
name str: sampler name
buffer_size int: number of samples to generate during build_buffer
p_batch float: what percentage of batch samples should come from this sampler
genbatch int: generation batch size
track bool: if metrics from this sampler should be tracked
temperature float: sampeling temperature

Prior Sampler

PriorSampler allows for sampling based on latent vectors from a specific prior distribution. If desired, this prior can also be optimized during the fit loop

`class` `PriorSampler`[source]

PriorSampler(vocab, model, prior, name, buffer_size, p_batch, genbatch, track=True, track_losses=True, train=True, train_all=False, prior_loss=None, temperature=1.0, opt_kwargs={}) :: ModelSampler

PriorSampler - sampler class to draw samples from latent vectors generated by prior

Inputs:

vocab Vocab: vocabulary for reconstructing samples
model GenerativeModel: model to sample from
prior nn.Module: prior to sample latent vectors from
name str: sampler name
buffer_size int: number of samples to generate during build_buffer
p_batch float: what percentage of batch samples should come from this sampler
genbatch int: generation batch size
track bool: if metrics from this sampler should be tracked
track_losses bool: if prior losses should be tracked (ignored if no prior loss is given)
train bool: if the prior should be trained (requires prior loss)
train_all bool: if the prior should be trained on all samples in a batch or just ones from this specific sampler
prior_loss Optional: loss function for prior. See PriorLoss for an example
temperature float: sampeling temperature

Latent Sampler

LatentSampler allows for sampling based on specific latent vectors. If desired, the latent vectors can also be optimized during the fit loop

`class` `LatentSampler`[source]

LatentSampler(vocab, model, latents, name, buffer_size, p_batch, genbatch, track=True, train=True, temperature=1.0, opt_kwargs={}) :: ModelSampler

ModelSampler - sampler class to draw samples from a GenerativeModel

Inputs:

vocab Vocab: vocabulary for reconstructing samples
model GenerativeModel: model to sample from
latents torch.FloatTensor[n_latents, d_latents]: tensor of latent vectors. n_latents can be any value
name str: sampler name
buffer_size int: number of samples to generate during build_buffer
p_batch float: what percentage of batch samples should come from this sampler
genbatch int: generation batch size
track bool: if metrics from this sampler should be tracked
train bool: if the latent vectors should be trained
temperature float: sampeling temperature

Contrastive Sampler

So far samplers have focused on drawing individual samples from a model, dataset, or other source.

Contrastive sampling looks at the task of generating a new sample based on an old sample. For example, we could want to train a model to take in a compound and produce different compound with a high similarity to the original compound but with a better score based on some metric.

In these cases, the samples we create will be a tuple in the form (source_sample, target_sample). When training a contrastive metric, we may have a pre-made dataset of source, target pairs to use. However, if such paired data doesn't exist, we need some way to generate it on the fly. This is where the ContrastiveSampler class comes in.

ContrastiveSampler turns any normal Sampler into a contrastive sampler. The contrastive sampler uses a base_sampler to generate an initial set of source samples. Then the contrastive sampler uses an output_model to generate target samples on the fly.

This generation process can be run during build_buffer or sample_batch.

Note that the ContrastiveSampler does not do any batch or buffer filtering to ensure source, target pairs match external constraints like minimum similarity. This must be handled by other callbacks, like ContrastiveTemplate

`class` `ContrastiveSampler`[source]

ContrastiveSampler(base_sampler, vocab, dataset, output_model, bs, repeats=1) :: Sampler

ContrastiveSampler - contrastive sampling wrapper. Uses base_sampler to generate source sequences. Then uses output_model to generate target sequences. Adds tuple pairs of (source, target) to batch/buffer

Inputs:

base_sampler Sampler: base sampler to generate source samples
vocab Vocab: vocab for reconstruction
dataset Base_Dataset: dataset for tensorizing samples
output_moodel GenerativeModel: model to generate target samples
bs int: batch size for contrastive generation
repeats int: how many target samples to draw from each source sample

Log Samplers

Log samplers pull high scoring samples out of the log for retraining

`class` `LogSampler`[source]

LogSampler(sample_name, lookup_name, start_iter, percentile, buffer_size) :: Sampler

LogSampler - pulls samples from log based on percentile

Inputs:

sample_name str: what column in Log.df to pull samples from
lookup_name str: what column in Log.df to use to find high scoring samples
start_iter int: iteration to start drawing from log
percentile int: value 1-100 percentile of log data to sample from
buffer_size int: number of samples to generate during build_buffer

`class` `TokenSwapSampler`[source]

TokenSwapSampler(sample_name, lookup_name, start_iter, percentile, buffer_size, vocab, swap_percent) :: LogSampler

TokenSwapSampler - samples high scoring samples from Log.df and enumerates variants by swapping tokens. Note that token swapped samples are not guaranteed to be chemically valid. This sampler works best with SELFIES representation

Inputs:

sample_name str: what column in Log.df to pull samples from
lookup_name str: what column in Log.df to use to find high scoring samples
start_iter int: iteration to start drawing from log
percentile int: value 1-100 percentile of log data to sample from
buffer_size int: number of samples to generate during build_buffer
vocab Vocab: vocab to numericalize samples
swap_percent float: percent of tokens to swap per sample

`class` `LogEnumerator`[source]

LogEnumerator(sample_name, lookup_name, start_iter, percentile, buffer_size, atom_types=None) :: LogSampler

LogEnumerator - pulls high scoring samples from Log.df and performs simple enumeration by adding one atom or one bond to the sample. Note that this proccess can create a large number samples and the value of buffer_size should accordingly be low (3-5). See add_atom_combi and add_bond_combi for more details on how the enumeration is done

Inputs:

sample_name str: what column in Log.df to pull samples from
lookup_name str: what column in Log.df to use to find high scoring samples
start_iter int: iteration to start drawing from log
percentile int: value 1-100 percentile of log data to sample from
buffer_size int: number of samples to generate during build_buffer
atom_types list: list of allowed atom types to swap

`class` `Timeout`[source]

Timeout(timeout_length, timeout_function=None, track=True, name='timeout') :: Callback

Timeout - puts samples in "timeout" to prevent training on the same sample too frequently. Samples are only allowed to be trained on once every timeout_length batches

Inputs:

timeout_ength int: number of batches to put molecule in timeout
timeout_function Optional[Callable]: preprocessing function for samples
track bool: if metrics from this callback should be tracked
name str: callback name

`class` `MurckoTimeout`[source]

MurckoTimeout(timeout_length, generic=False, track=True, name='murcko_timeout') :: Timeout

MurckoTimeout - puts samples in "timeout" to prevent training on the same sample too frequently. Samples are only allowed to be trained on once every timeout_length batches. MurckoTimeout identifies samples by their Murcko scaffold

Inputs:

timeout_ength int: number of batches to put molecule in timeout
generic bool: if True, Murcko scaffolds will be made generic (all carbon, single bonds) before evaluuation
track bool: if metrics from this callback should be tracked
name str: callback name

Sampler Callbacks

Sampler Size

class Sampler[source]

class DatasetSampler[source]

class CombichemSampler[source]

Model Sampler

class ModelSampler[source]

Prior Sampler

class PriorSampler[source]

Latent Sampler

class LatentSampler[source]

Contrastive Sampler

class ContrastiveSampler[source]

Log Samplers

class LogSampler[source]

class TokenSwapSampler[source]

class LogEnumerator[source]

class Timeout[source]

class MurckoTimeout[source]

`class` `Sampler`[source]

`class` `DatasetSampler`[source]

`class` `CombichemSampler`[source]

`class` `ModelSampler`[source]

`class` `PriorSampler`[source]

`class` `LatentSampler`[source]

`class` `ContrastiveSampler`[source]

`class` `LogSampler`[source]

`class` `TokenSwapSampler`[source]

`class` `LogEnumerator`[source]

`class` `Timeout`[source]

`class` `MurckoTimeout`[source]