Overview
Rewards are non-differentiable score functions for evaluating samples. Rewards should generally follow the format reward = reward_function(sample)
Rewards in MRL occupy five events in the fit loop:
before_compute_reward
- set up necessary values prior to reward calculation (if needed)compute_reward
- compute rewardafter_compute_reward
- compute metrics (if needed)reward_modification
- adjust rewardsafter_reward_modification
- compute metrics (if needed)
Rewards vs Reward Modifications
MRl breaks rewards up into two phases - rewards and reward modifications. The difference between the two phases is that reward values are saved in the batch log, while reward_modifications are not.
In this framework, rewards are absolute scores for samples that are used to evaluate the sample relative to all other samples in the log. Reward modifications are transient scores that depend on the current training context.
A reward modification might be something like adding a score bonus to compounds the first time they are created during training to encourage diversity, or penalizing compounds if they appear more than 3 times in the last 5 batches. These types of reward modifications allow us to influence the behavior of the generative model without having these scores effect the true rewards we save in the log
Reward Class
As mentioned above, rewards generally follow the format reward = reward_function(sample)
. The Reward
class acts as a wrapper around the reward_function
to provide some convenience functions. Reward
maintains a lookup table of sample : reward
values to avoid repeat computation. Reward
handles batching novel samples (ie not in the lookup table), sending them to reward_function
, and merging the outputs with the lookup table values.
Creating a custom reward involves creating a callable function or object that can take in a list of samples
and return a list of reward values. For example:
class MyRewardFunction():
def __call__(self, samples):
rewards = self.do_reward_calculation(samples)
return rewards
reward_function = MyRewardFunction()
reward = Reward(reward_function, weight=0.5, log=True)
Reward Callback
RewardCallback
handles it loop integration and metric logging for a given Reward
For greater flexibility, GenericRewardCallback
will pass the entire BatchState
to reward
Reward Modification
As discussed above, reward modifications apply changes to rewards based on some sort of transient batch context. These are rewards that will influence a given batch, but not the logged rewards.
Reward modifications should update the value BatchState.rewards_final
Contrastive Reward
Similar to ContrastiveTemplate
, ContrastiveReward
provides a wrapper around a RewardCallback
to adapt it for the task of contrastive generation.
For contrastive generation, we want the model to ingest a source sample and produce a target sample that receives a higher reward than the source sample. ContrastiveReward
takes some base_reward
and computes the values of that base reward for both source and target samples, and returns the difference between those rewards.
Optionally, the contrastive reward will scale the relative reward based on a given max_score
(ie reward = (target_reward - source_reward)/(max_reward - source_reward)
). This scales the contrastive reward relative to the maximum possible reward