Policy gradient modules

Policy Gradients

The code in this module implements several policy gradient algorithms

Current Limitations

The implementations below are designed for the scenario where the output of the model is a series of actions over time. Importantly, rewards are discounted going backwards, meaning the discounted reward at very timestep contains some of the future rewards. If your model does not predict a series of rewards (ie predicts a single graph), you may need to revisit these assumptions

class BasePolicy[source]

BasePolicy(gamma=1.0)

BasePolicy - base policy class

Inputs:

  • gamma float: discount factor

BasePolicy.discount_rewards[source]

BasePolicy.discount_rewards(rewards, mask, traj_rewards=None)

discount_rewards - discounts rewards

Inputs:

  • rewards torch.Tensor[bs]: reward tensor (one reward per batch item)

  • mask torch.BoolTensor[bs, sl]: mask (ie for padding). True indicates values that will be kept, False indicates values that will be masked

  • traj_rewards Optional[torch.Tensor[bs, sl]]: trajectory rewards. Has a reward value for each time point

Policy Gradients

PolicyGradient implements standard Policy Gradients following:

$$ \nabla_\theta J(\theta) = \mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)] $$

When we generate a sample through autoregressive Monte Carlo sampling, we create a sequence of actions which we represent as a tensor of size (bs, sl).

For each step in this series, we have a probability distribution over all possible actions. This give us a tensor of log probabilities of size (bs, sl, n_actions). We can then gather the log probabilities for the actions we actually took, giving us a tensor of gathered log probabilities of size (bs, sl).

We also have a set of rewards associated with each sample. In the context of generating compounds, we most often have a single reward for each sampling trajectory that represents the final score of he whole molecule. This would be a tensor of size (bs). If applicable, we can also have a tensor of trajectory rewards which has a reward for each sampling timestep. This trajectory reward tensor would be of size (bs, sl).

These rewards are discounted over all timesteps using discount_rewards, then scaled using whiten. This gives is our final tensor of rewards of size (bs, sl).

Now we can compute the empirical expectation $\mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)]$ by multiplying the gathered log probabilities by the discounted rewards and taking the mean over the batch.

Then of course we want to maximize this expectation, so we use gradient descent to minimize $-\mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)]$

This basically tells the model to increase the probability of sample paths that had above-average rewards within the batch, and decrease the probability of sample paths with below-average rewards.

class PolicyGradient[source]

PolicyGradient(discount=True, gamma=0.97, ratio=False, scale_rewards=True, ratio_clip=None) :: BasePolicy

PolicyGradient - Basic policy gradient implementation

papers.nips.cc/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html

Inputs:

  • discount bool: if True, rewards are discounted over all timesteps

  • gamma float: discount factor (ignored if discount=False)

  • ratio True: if True, model log probbilities are replaced with the ratio between main model log probabilities and baseline model log probabilities, a technique used in more sophistocated policy gradient algorithms. This can improve stability

  • scale_rewards bool: if True, rewards are mean-scaled before discounting. This can lead to quicker convergence

  • ratio_clip Optional[float]: if not None, value passed will be used to clamp log probability ratios to [-ratio_clip, ratio_clip] to prevent extreme values

PolicyGradient.__call__[source]

PolicyGradient.__call__(lps, mask, rewards, base_lps=None, traj_rewards=None)

Inputs:

  • lps torch.FloatTensor[bs, sl]: gathered log probabilities

  • mask torch.BoolTensor[bs, sl]: padding mask. True indicates values that will be kept, False indicates values that will be masked

  • rewards torch.FloatTensor[bs]: reward tensor (one reward per batch item)

  • base_lps Optional[torch.FloatTensor[bs, sl]]: optional base model gathered log probabilities

  • traj_rewards Optional[torch.FloatTensor[bs, sl]]: optional tensor of trajectory rewards with one reward value per timestep

Trust Region Policy Optimization

Trust Region Policy Optimization (TRPO) adapts the policy gradient algorithm by constraining the maximum update size based on how far the current agent has deviated from the baseline agent.

$$ J(\theta) = \mathbb{E}_{s \sim \rho^{\pi_{\theta_\text{old}}}, a \sim \pi_{\theta_\text{old}}} \big[ \frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)} \hat{A}_{\theta_\text{old}}(s, a) \big] $$

Subject to a KL constraint between the current policy and the baseline policy

$$ \mathbb{E}_{s \sim \rho^{\pi_{\theta_\text{old}}}} [D_\text{KL}(\pi_{\theta_\text{old}}(.\vert s) \| \pi_\theta(.\vert s)] \leq \delta $$

class TRPO[source]

TRPO(gamma, kl_target, beta=1.0, eta=50, lam=0.95, v_coef=0.5, scale_rewards=True, ratio_clip=None) :: BasePolicy

TRPO - Trust Region Policy Optimization

arxiv.org/pdf/1502.05477.pdf

Inputs:

  • gamma float: discount factor

  • kl_target float: target maximum KL divergence from baseline policy

  • beta float: coefficient for the KL loss

  • eta float: coefficient for penalizing KL higher than 2*kl_target

  • lam float: lambda coefficient for advantage calculation

  • v_coef float: value function loss coefficient

  • scale_rewards bool: if True, rewards are mean-scaled before discounting. This can lead to quicker convergence

  • ratio_clip Optional[float]: if not None, value passed will be used to clamp log probability ratios to [-ratio_clip, ratio_clip] to prevent extreme values

TRPO.__call__[source]

TRPO.__call__(lps_g, base_lps_g, lps, base_lps, mask, rewards, values, traj_rewards=None)

Inputs:

  • lps_g torch.FloatTensor[bs, sl]: model gathered log probabilities

  • base_lps_g torch.FloatTensor[bs, sl]: baseline model gathered log probabilities

  • lps torch.FloatTensor[bs, sl, n_actions]: model full log probabilities

  • base_lps torch.FloatTensor[bs, sl, n_actions]: baseline model full log probabilities

  • mask torch.BoolTensor[bs, sl]: padding mask. True indicates values that will be kept, False indicates values that will be masked

  • rewards torch.FloatTensor[bs]: reward tensor (one reward per batch item)

  • values torch.FloatTensor[bs, sl]: state value predictions

  • traj_rewards Optional[torch.FloatTensor[bs, sl]]: optional tensor of trajectory rewards with one reward value per timestep

Proximal Policy Optimization

Proximal Policy Optimization (PPO) applies clipping to the surrogate objective along with the KL constraints

$$ r(\theta) = \frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)} $$

$$ J(\theta) = \mathbb{E} [ r(\theta) \hat{A}_{\theta_\text{old}}(s, a) ] $$

$$ J^\text{CLIP} (\theta) = \mathbb{E} [ \min( r(\theta) \hat{A}_{\theta_\text{old}}(s, a), \text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{\theta_\text{old}}(s, a))] $$

class PPO[source]

PPO(gamma, kl_coef, lam=0.95, v_coef=0.5, cliprange=0.2, v_cliprange=0.2, ent_coef=0.01, kl_target=None, kl_horizon=None, scale_rewards=True) :: BasePolicy

PPO - Proximal policy optimization

arxiv.org/pdf/1707.06347.pdf

Inputs:

  • gamma float: discount factor

  • kl_coef float: KL reward coefficient

  • lam float: lambda coefficient for advantage calculation

  • v_coef float: value function loss coefficient

  • cliprange float: clip value for surrogate loss

  • v_cliprange float: clip value for value function predictions

  • ent_coef float: entropy regularization coefficient

  • kl_target Optional[float]: target value for adaptive KL penalty

  • kl_horizon Optional[float]: horizon for adaptive KL penalty

  • scale_rewards bool: if True, rewards are mean-scaled before discounting. This can lead to quicker convergence

PPO.__call__[source]

PPO.__call__(lps_g, base_lps_g, lps, mask, rewards, values, ref_values, traj_rewards=None)

Inputs:

  • lps_g torch.FloatTensor[bs, sl]: model gathered log probabilities

  • base_lps_g torch.FloatTensor[bs, sl]: baseline model gathered log probabilities

  • lps torch.FloatTensor[bs, sl, n_actions]: model full log probabilities

  • mask torch.BoolTensor[bs, sl]: padding mask. True indicates values that will be kept, False indicates values that will be masked

  • rewards torch.FloatTensor[bs]: reward tensor (one reward per batch item)

  • values torch.FloatTensor[bs, sl]: state value predictions

  • ref_values torch.FloatTensor[bs, sl]: baseline state value predictions

  • traj_rewards Optional[torch.FloatTensor[bs, sl]]: optional tensor of trajectory rewards with one reward value per timestep