Sampling callbacks
/home/dmai/miniconda3/envs/mrl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry const> already registered; second conversion method ignored.
  return f(*args, **kwds)

Sampler Callbacks

Samplers serve two main functions in the fit loop.

During the build_buffer event, samplers add samples to the Buffer

During the sample_batch event, samplers add samples to the current BatchState

Samplers generally have the ability to toggle which events they add samples to. For example if you wanted to do entirely offline RL, you could disable live sampling during the sample_batch event and only train off samples stored in the buffer

Sampler Size

Samplers have two main parameters that control sample size.

The buffer_size parameter is an integer value that control how many samples are generated during the build_buffer event.

The p_batch parameter is a float value between 0 and 1 that determines what percentage of a batch should be drawn from a specific sampler. When using multiple samplers, the sum of all p_batch values should be less than or equal to 1. The difference between the sum of p_batch values and the desired batch size will be made up by sampling from the buffer

class Sampler[source]

Sampler(name, buffer_size=0, p_batch=0.0, track=True) :: Callback

Sampler - base sampler callback

Inputs:

  • name str: sampler name

  • buffer_size int: how many samples to add during build_buffer

  • p_batch float: what percentage of batch samples should come from this sampler

  • track bool: if metrics from this sampler should be tracked

class DatasetSampler[source]

DatasetSampler(data, name, buffer_size=0, p_batch=0.0) :: Sampler

DatasetSampler - adds items from data to either the buffer or the current batch

Inputs:

  • data list: list of data points to sample from

  • name str: sampler name

  • buffer_size int: how many samples to add during build_buffer

  • p_batch float: what percentage of batch samples should come from this sampler

class CombichemSampler[source]

CombichemSampler(cbc, library_size, percentile, p_explore, start_iter, lookup_name, name) :: Sampler

CombichemSampler - uses combichem to add samples to the buffer

Inputs:

  • cbc CombiChem: Combichem module

  • library_size int: number of compounds to put into combichem

  • percentile int[1,100]: percentile of compounds to sample from

  • p_explore float[0.,1.]: percent of compounds below percentile to sample

  • start_iter int: iteration to start combichem

  • lookup_name str: colum in Log to use to calculate percentile

  • name str: callback name

Model Sampler

The ModelSampler sampler can be used to draw samples from any GenerativeModel subclass model. By default, it will track the following sample metrics:

  • diversity - how many duplicate samples were generated
  • valid - how many samples are left after filtering
  • rewards - average rewards from samples generated by the model sampler
  • new - how many samples are novel to the training run

class ModelSampler[source]

ModelSampler(vocab, model, name, buffer_size, p_batch, genbatch, track=True, temperature=1.0) :: Sampler

ModelSampler - sampler class to draw samples from a GenerativeModel

Inputs:

  • vocab Vocab: vocabulary for reconstructing samples

  • model GenerativeModel: model to sample from

  • name str: sampler name

  • buffer_size int: number of samples to generate during build_buffer

  • p_batch float: what percentage of batch samples should come from this sampler

  • genbatch int: generation batch size

  • track bool: if metrics from this sampler should be tracked

  • temperature float: sampeling temperature

Prior Sampler

PriorSampler allows for sampling based on latent vectors from a specific prior distribution. If desired, this prior can also be optimized during the fit loop

class PriorSampler[source]

PriorSampler(vocab, model, prior, name, buffer_size, p_batch, genbatch, track=True, track_losses=True, train=True, train_all=False, prior_loss=None, temperature=1.0, opt_kwargs={}) :: ModelSampler

PriorSampler - sampler class to draw samples from latent vectors generated by prior

Inputs:

  • vocab Vocab: vocabulary for reconstructing samples

  • model GenerativeModel: model to sample from

  • prior nn.Module: prior to sample latent vectors from

  • name str: sampler name

  • buffer_size int: number of samples to generate during build_buffer

  • p_batch float: what percentage of batch samples should come from this sampler

  • genbatch int: generation batch size

  • track bool: if metrics from this sampler should be tracked

  • track_losses bool: if prior losses should be tracked (ignored if no prior loss is given)

  • train bool: if the prior should be trained (requires prior loss)

  • train_all bool: if the prior should be trained on all samples in a batch or just ones from this specific sampler

  • prior_loss Optional: loss function for prior. See PriorLoss for an example

  • temperature float: sampeling temperature

Latent Sampler

LatentSampler allows for sampling based on specific latent vectors. If desired, the latent vectors can also be optimized during the fit loop

class LatentSampler[source]

LatentSampler(vocab, model, latents, name, buffer_size, p_batch, genbatch, track=True, train=True, temperature=1.0, opt_kwargs={}) :: ModelSampler

ModelSampler - sampler class to draw samples from a GenerativeModel

Inputs:

  • vocab Vocab: vocabulary for reconstructing samples

  • model GenerativeModel: model to sample from

  • latents torch.FloatTensor[n_latents, d_latents]: tensor of latent vectors. n_latents can be any value

  • name str: sampler name

  • buffer_size int: number of samples to generate during build_buffer

  • p_batch float: what percentage of batch samples should come from this sampler

  • genbatch int: generation batch size

  • track bool: if metrics from this sampler should be tracked

  • train bool: if the latent vectors should be trained

  • temperature float: sampeling temperature

Contrastive Sampler

So far samplers have focused on drawing individual samples from a model, dataset, or other source.

Contrastive sampling looks at the task of generating a new sample based on an old sample. For example, we could want to train a model to take in a compound and produce different compound with a high similarity to the original compound but with a better score based on some metric.

In these cases, the samples we create will be a tuple in the form (source_sample, target_sample). When training a contrastive metric, we may have a pre-made dataset of source, target pairs to use. However, if such paired data doesn't exist, we need some way to generate it on the fly. This is where the ContrastiveSampler class comes in.

ContrastiveSampler turns any normal Sampler into a contrastive sampler. The contrastive sampler uses a base_sampler to generate an initial set of source samples. Then the contrastive sampler uses an output_model to generate target samples on the fly.

This generation process can be run during build_buffer or sample_batch.

Note that the ContrastiveSampler does not do any batch or buffer filtering to ensure source, target pairs match external constraints like minimum similarity. This must be handled by other callbacks, like ContrastiveTemplate

class ContrastiveSampler[source]

ContrastiveSampler(base_sampler, vocab, dataset, output_model, bs, repeats=1) :: Sampler

ContrastiveSampler - contrastive sampling wrapper. Uses base_sampler to generate source sequences. Then uses output_model to generate target sequences. Adds tuple pairs of (source, target) to batch/buffer

Inputs:

  • base_sampler Sampler: base sampler to generate source samples

  • vocab Vocab: vocab for reconstruction

  • dataset Base_Dataset: dataset for tensorizing samples

  • output_moodel GenerativeModel: model to generate target samples

  • bs int: batch size for contrastive generation

  • repeats int: how many target samples to draw from each source sample

Log Samplers

Log samplers pull high scoring samples out of the log for retraining

class LogSampler[source]

LogSampler(sample_name, lookup_name, start_iter, percentile, buffer_size) :: Sampler

LogSampler - pulls samples from log based on percentile

Inputs:

  • sample_name str: what column in Log.df to pull samples from

  • lookup_name str: what column in Log.df to use to find high scoring samples

  • start_iter int: iteration to start drawing from log

  • percentile int: value 1-100 percentile of log data to sample from

  • buffer_size int: number of samples to generate during build_buffer

class TokenSwapSampler[source]

TokenSwapSampler(sample_name, lookup_name, start_iter, percentile, buffer_size, vocab, swap_percent) :: LogSampler

TokenSwapSampler - samples high scoring samples from Log.df and enumerates variants by swapping tokens. Note that token swapped samples are not guaranteed to be chemically valid. This sampler works best with SELFIES representation

Inputs:

  • sample_name str: what column in Log.df to pull samples from

  • lookup_name str: what column in Log.df to use to find high scoring samples

  • start_iter int: iteration to start drawing from log

  • percentile int: value 1-100 percentile of log data to sample from

  • buffer_size int: number of samples to generate during build_buffer

  • vocab Vocab: vocab to numericalize samples

  • swap_percent float: percent of tokens to swap per sample

class LogEnumerator[source]

LogEnumerator(sample_name, lookup_name, start_iter, percentile, buffer_size, atom_types=None) :: LogSampler

LogEnumerator - pulls high scoring samples from Log.df and performs simple enumeration by adding one atom or one bond to the sample. Note that this proccess can create a large number samples and the value of buffer_size should accordingly be low (3-5). See add_atom_combi and add_bond_combi for more details on how the enumeration is done

Inputs:

  • sample_name str: what column in Log.df to pull samples from

  • lookup_name str: what column in Log.df to use to find high scoring samples

  • start_iter int: iteration to start drawing from log

  • percentile int: value 1-100 percentile of log data to sample from

  • buffer_size int: number of samples to generate during build_buffer

  • atom_types list: list of allowed atom types to swap

class Timeout[source]

Timeout(timeout_length, timeout_function=None, track=True, name='timeout') :: Callback

Timeout - puts samples in "timeout" to prevent training on the same sample too frequently. Samples are only allowed to be trained on once every timeout_length batches

Inputs:

  • timeout_ength int: number of batches to put molecule in timeout

  • timeout_function Optional[Callable]: preprocessing function for samples

  • track bool: if metrics from this callback should be tracked

  • name str: callback name

class MurckoTimeout[source]

MurckoTimeout(timeout_length, generic=False, track=True, name='murcko_timeout') :: Timeout

MurckoTimeout - puts samples in "timeout" to prevent training on the same sample too frequently. Samples are only allowed to be trained on once every timeout_length batches. MurckoTimeout identifies samples by their Murcko scaffold

Inputs:

  • timeout_ength int: number of batches to put molecule in timeout

  • generic bool: if True, Murcko scaffolds will be made generic (all carbon, single bonds) before evaluuation

  • track bool: if metrics from this callback should be tracked

  • name str: callback name