Rewards - non-differentiable scores for samples

Overview

Rewards are non-differentiable score functions for evaluating samples. Rewards should generally follow the format reward = reward_function(sample)

Rewards in MRL occupy five events in the fit loop:

  • before_compute_reward - set up necessary values prior to reward calculation (if needed)
  • compute_reward - compute reward
  • after_compute_reward - compute metrics (if needed)
  • reward_modification - adjust rewards
  • after_reward_modification - compute metrics (if needed)

Rewards vs Reward Modifications

MRl breaks rewards up into two phases - rewards and reward modifications. The difference between the two phases is that reward values are saved in the batch log, while reward_modifications are not.

In this framework, rewards are absolute scores for samples that are used to evaluate the sample relative to all other samples in the log. Reward modifications are transient scores that depend on the current training context.

A reward modification might be something like adding a score bonus to compounds the first time they are created during training to encourage diversity, or penalizing compounds if they appear more than 3 times in the last 5 batches. These types of reward modifications allow us to influence the behavior of the generative model without having these scores effect the true rewards we save in the log

Reward Class

As mentioned above, rewards generally follow the format reward = reward_function(sample). The Reward class acts as a wrapper around the reward_function to provide some convenience functions. Reward maintains a lookup table of sample : reward values to avoid repeat computation. Reward handles batching novel samples (ie not in the lookup table), sending them to reward_function, and merging the outputs with the lookup table values.

Creating a custom reward involves creating a callable function or object that can take in a list of samples and return a list of reward values. For example:

class MyRewardFunction():
    def __call__(self, samples):
        rewards = self.do_reward_calculation(samples)
        return rewards

reward_function = MyRewardFunction()
reward = Reward(reward_function, weight=0.5, log=True)

class Reward[source]

Reward(reward_function, weight=1, bs=None, device=True, log=True)

Reward - wrapper for reward_function. Handles batching and value lookup

Inputs:

  • reward_function Callable: function with the format rewards = reward_function(samples)

  • weight float: weight to scale rewards

  • bs Optional[int]: if given, samples will be batched into chunks of size bs and sent to reward_function as batches

  • device Optional[bool]: if True, reward function output is mapped to device. see to_device

  • log bool: if True, keeps aa lookup table of sample : reward values to avoid repeat computation

Reward Callback

RewardCallback handles it loop integration and metric logging for a given Reward

class RewardCallback[source]

RewardCallback(reward, name, sample_name='samples', order=10, track=True) :: Callback

RewardCallback - callback wrapper for Reward used during compute_reward event

Inputs:

  • reward Reward: reward to use

  • name str: reward name

  • sample_name str: sample name to grab from BatchState to send to reward

  • order int: callback order

  • track bool: if metrics should be tracked from this callback

For greater flexibility, GenericRewardCallback will pass the entire BatchState to reward

class GenericRewardCallback[source]

GenericRewardCallback(reward, name, order=10, track=True) :: RewardCallback

GenericRewardCallback - generic reward wrapper

Inputs:

  • reward Callable: reward function. Reward will be passed the entire batch state

  • name str: reward name

  • order int: callback order

  • track bool: if metrics should be tracked from this callback

Reward Modification

As discussed above, reward modifications apply changes to rewards based on some sort of transient batch context. These are rewards that will influence a given batch, but not the logged rewards.

Reward modifications should update the value BatchState.rewards_final

class RewardModification[source]

RewardModification() :: Event

RewardModification

This event is used to modify rewards before they are used to compute the model's loss. Reward modifications encompass changes to rewards in the context of the current training cycle. These are things like "give a score bonus to new samples that havent't been seen before" or "penalize the score of samples that have occurred in the last 5 batches".

These types of modifications are kept separate from the core reward for logging purposes. Samples are logged with their respective rewards. These logged scores are referenced later when samples are drawn from the log. This means we need the logged score to be independent from "batch context" type scores

All reward modifications should be applied to self.environmemnt.batch_state.rewards

class NoveltyReward[source]

NoveltyReward(weight=1.0, track=True) :: Callback

NoveltyReward - gives a reward bonus for new samples. Rewards are given a bonus of weight

Inputs:

  • weight float: novelty score weight

  • track bool: if metrics should be tracked from this callback

Contrastive Reward

Similar to ContrastiveTemplate, ContrastiveReward provides a wrapper around a RewardCallback to adapt it for the task of contrastive generation.

For contrastive generation, we want the model to ingest a source sample and produce a target sample that receives a higher reward than the source sample. ContrastiveReward takes some base_reward and computes the values of that base reward for both source and target samples, and returns the difference between those rewards.

Optionally, the contrastive reward will scale the relative reward based on a given max_score (ie reward = (target_reward - source_reward)/(max_reward - source_reward)). This scales the contrastive reward relative to the maximum possible reward

class ContrastiveReward[source]

ContrastiveReward(base_reward, max_score=None) :: RewardCallback

ContrastiveReward - contrastive wrapper for reward callbacks

Inputs:

  • base_reward RewardCallback: base reward callback

  • max_score Optional[float]: maximum possible score. If given, contrastive rewards are scaled following reward = (target_reward - source_reward)/(max_reward - source_reward)