Overview

Rewards are non-differentiable score functions for evaluating samples. Rewards should generally follow the format reward = reward_function(sample)

Rewards in MRL occupy five events in the fit loop:

before_compute_reward - set up necessary values prior to reward calculation (if needed)
compute_reward - compute reward
after_compute_reward - compute metrics (if needed)
reward_modification - adjust rewards
after_reward_modification - compute metrics (if needed)

Rewards vs Reward Modifications

MRl breaks rewards up into two phases - rewards and reward modifications. The difference between the two phases is that reward values are saved in the batch log, while reward_modifications are not.

In this framework, rewards are absolute scores for samples that are used to evaluate the sample relative to all other samples in the log. Reward modifications are transient scores that depend on the current training context.

A reward modification might be something like adding a score bonus to compounds the first time they are created during training to encourage diversity, or penalizing compounds if they appear more than 3 times in the last 5 batches. These types of reward modifications allow us to influence the behavior of the generative model without having these scores effect the true rewards we save in the log

Reward Class

As mentioned above, rewards generally follow the format reward = reward_function(sample). The Reward class acts as a wrapper around the reward_function to provide some convenience functions. Reward maintains a lookup table of sample : reward values to avoid repeat computation. Reward handles batching novel samples (ie not in the lookup table), sending them to reward_function, and merging the outputs with the lookup table values.

Creating a custom reward involves creating a callable function or object that can take in a list of samples and return a list of reward values. For example:

class MyRewardFunction():
    def __call__(self, samples):
        rewards = self.do_reward_calculation(samples)
        return rewards

reward_function = MyRewardFunction()
reward = Reward(reward_function, weight=0.5, log=True)

Reward Callback

RewardCallback handles it loop integration and metric logging for a given Reward

For greater flexibility, GenericRewardCallback will pass the entire BatchState to reward

Reward Modification

As discussed above, reward modifications apply changes to rewards based on some sort of transient batch context. These are rewards that will influence a given batch, but not the logged rewards.

Reward modifications should update the value BatchState.rewards_final

Contrastive Reward

Similar to ContrastiveTemplate, ContrastiveReward provides a wrapper around a RewardCallback to adapt it for the task of contrastive generation.

For contrastive generation, we want the model to ingest a source sample and produce a target sample that receives a higher reward than the source sample. ContrastiveReward takes some base_reward and computes the values of that base reward for both source and target samples, and returns the difference between those rewards.

Optionally, the contrastive reward will scale the relative reward based on a given max_score (ie reward = (target_reward - source_reward)/(max_reward - source_reward)). This scales the contrastive reward relative to the maximum possible reward

Reward

Overview

Rewards vs Reward Modifications

Reward Class

`class` `Reward`[source]

Reward Callback

`class` `RewardCallback`[source]

`class` `GenericRewardCallback`[source]

Reward Modification

`class` `RewardModification`[source]

`class` `NoveltyReward`[source]

Contrastive Reward

`class` `ContrastiveReward`[source]

Reward

Overview

Rewards vs Reward Modifications

Reward Class

class Reward[source]

Reward Callback

class RewardCallback[source]

class GenericRewardCallback[source]

Reward Modification

class RewardModification[source]

class NoveltyReward[source]

Contrastive Reward

class ContrastiveReward[source]

`class` `Reward`[source]

`class` `RewardCallback`[source]

`class` `GenericRewardCallback`[source]

`class` `RewardModification`[source]

`class` `NoveltyReward`[source]

`class` `ContrastiveReward`[source]