Policy gradient modules

Policy Gradients

The code in this module implements several policy gradient algorithms

PolicyGradient - implements Policy Gradients
TRPO - implements Trust Region Policy Optimization
PPO - implemeents Proximal Policy Optimization

Current Limitations

The implementations below are designed for the scenario where the output of the model is a series of actions over time. Importantly, rewards are discounted going backwards, meaning the discounted reward at very timestep contains some of the future rewards. If your model does not predict a series of rewards (ie predicts a single graph), you may need to revisit these assumptions

`class` `BasePolicy`[source]

BasePolicy(gamma=1.0)

BasePolicy - base policy class

Inputs:

gamma float: discount factor

`BasePolicy.discount_rewards`[source]

BasePolicy.discount_rewards(rewards, mask, traj_rewards=None)

discount_rewards - discounts rewards

Inputs:

rewards torch.Tensor[bs]: reward tensor (one reward per batch item)
mask torch.BoolTensor[bs, sl]: mask (ie for padding). True indicates values that will be kept, False indicates values that will be masked
traj_rewards Optional[torch.Tensor[bs, sl]]: trajectory rewards. Has a reward value for each time point

Policy Gradients

PolicyGradient implements standard Policy Gradients following:

$$ \nabla_\theta J(\theta) = \mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)] $$

When we generate a sample through autoregressive Monte Carlo sampling, we create a sequence of actions which we represent as a tensor of size (bs, sl).

For each step in this series, we have a probability distribution over all possible actions. This give us a tensor of log probabilities of size (bs, sl, n_actions). We can then gather the log probabilities for the actions we actually took, giving us a tensor of gathered log probabilities of size (bs, sl).

We also have a set of rewards associated with each sample. In the context of generating compounds, we most often have a single reward for each sampling trajectory that represents the final score of he whole molecule. This would be a tensor of size (bs). If applicable, we can also have a tensor of trajectory rewards which has a reward for each sampling timestep. This trajectory reward tensor would be of size (bs, sl).

These rewards are discounted over all timesteps using discount_rewards, then scaled using whiten. This gives is our final tensor of rewards of size (bs, sl).

Now we can compute the empirical expectation $\mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)]$ by multiplying the gathered log probabilities by the discounted rewards and taking the mean over the batch.

Then of course we want to maximize this expectation, so we use gradient descent to minimize $-\mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)]$

This basically tells the model to increase the probability of sample paths that had above-average rewards within the batch, and decrease the probability of sample paths with below-average rewards.

`class` `PolicyGradient`[source]

PolicyGradient(discount=True, gamma=0.97, ratio=False, scale_rewards=True, ratio_clip=None) :: BasePolicy

PolicyGradient - Basic policy gradient implementation

papers.nips.cc/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html

Inputs:

discount bool: if True, rewards are discounted over all timesteps
gamma float: discount factor (ignored if discount=False)
ratio True: if True, model log probbilities are replaced with the ratio between main model log probabilities and baseline model log probabilities, a technique used in more sophistocated policy gradient algorithms. This can improve stability
scale_rewards bool: if True, rewards are mean-scaled before discounting. This can lead to quicker convergence
ratio_clip Optional[float]: if not None, value passed will be used to clamp log probability ratios to [-ratio_clip, ratio_clip] to prevent extreme values

`PolicyGradient.call`[source]

PolicyGradient.__call__(lps, mask, rewards, base_lps=None, traj_rewards=None)

Inputs:

lps torch.FloatTensor[bs, sl]: gathered log probabilities
mask torch.BoolTensor[bs, sl]: padding mask. True indicates values that will be kept, False indicates values that will be masked
rewards torch.FloatTensor[bs]: reward tensor (one reward per batch item)
base_lps Optional[torch.FloatTensor[bs, sl]]: optional base model gathered log probabilities
traj_rewards Optional[torch.FloatTensor[bs, sl]]: optional tensor of trajectory rewards with one reward value per timestep

Trust Region Policy Optimization

Trust Region Policy Optimization (TRPO) adapts the policy gradient algorithm by constraining the maximum update size based on how far the current agent has deviated from the baseline agent.

$$ J(\theta) = \mathbb{E}_{s \sim \rho^{\pi_{\theta_\text{old}}}, a \sim \pi_{\theta_\text{old}}} \big[ \frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)} \hat{A}_{\theta_\text{old}}(s, a) \big] $$

Subject to a KL constraint between the current policy and the baseline policy

$$ \mathbb{E}_{s \sim \rho^{\pi_{\theta_\text{old}}}} [D_\text{KL}(\pi_{\theta_\text{old}}(.\vert s) \| \pi_\theta(.\vert s)] \leq \delta $$

`class` `TRPO`[source]

TRPO(gamma, kl_target, beta=1.0, eta=50, lam=0.95, v_coef=0.5, scale_rewards=True, ratio_clip=None) :: BasePolicy

TRPO - Trust Region Policy Optimization

arxiv.org/pdf/1502.05477.pdf

Inputs:

gamma float: discount factor
kl_target float: target maximum KL divergence from baseline policy
beta float: coefficient for the KL loss
eta float: coefficient for penalizing KL higher than 2*kl_target
lam float: lambda coefficient for advantage calculation
v_coef float: value function loss coefficient
scale_rewards bool: if True, rewards are mean-scaled before discounting. This can lead to quicker convergence
ratio_clip Optional[float]: if not None, value passed will be used to clamp log probability ratios to [-ratio_clip, ratio_clip] to prevent extreme values

`TRPO.call`[source]

TRPO.__call__(lps_g, base_lps_g, lps, base_lps, mask, rewards, values, traj_rewards=None)

Inputs:

lps_g torch.FloatTensor[bs, sl]: model gathered log probabilities
base_lps_g torch.FloatTensor[bs, sl]: baseline model gathered log probabilities
lps torch.FloatTensor[bs, sl, n_actions]: model full log probabilities
base_lps torch.FloatTensor[bs, sl, n_actions]: baseline model full log probabilities
mask torch.BoolTensor[bs, sl]: padding mask. True indicates values that will be kept, False indicates values that will be masked
rewards torch.FloatTensor[bs]: reward tensor (one reward per batch item)
values torch.FloatTensor[bs, sl]: state value predictions
traj_rewards Optional[torch.FloatTensor[bs, sl]]: optional tensor of trajectory rewards with one reward value per timestep

Proximal Policy Optimization

Proximal Policy Optimization (PPO) applies clipping to the surrogate objective along with the KL constraints

$$ r(\theta) = \frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)} $$

$$ J(\theta) = \mathbb{E} [ r(\theta) \hat{A}_{\theta_\text{old}}(s, a) ] $$

$$ J^\text{CLIP} (\theta) = \mathbb{E} [ \min( r(\theta) \hat{A}_{\theta_\text{old}}(s, a), \text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{\theta_\text{old}}(s, a))] $$

`class` `PPO`[source]

PPO(gamma, kl_coef, lam=0.95, v_coef=0.5, cliprange=0.2, v_cliprange=0.2, ent_coef=0.01, kl_target=None, kl_horizon=None, scale_rewards=True) :: BasePolicy

PPO - Proximal policy optimization

arxiv.org/pdf/1707.06347.pdf

Inputs:

gamma float: discount factor
kl_coef float: KL reward coefficient
lam float: lambda coefficient for advantage calculation
v_coef float: value function loss coefficient
cliprange float: clip value for surrogate loss
v_cliprange float: clip value for value function predictions
ent_coef float: entropy regularization coefficient
kl_target Optional[float]: target value for adaptive KL penalty
kl_horizon Optional[float]: horizon for adaptive KL penalty
scale_rewards bool: if True, rewards are mean-scaled before discounting. This can lead to quicker convergence

`PPO.call`[source]

PPO.__call__(lps_g, base_lps_g, lps, mask, rewards, values, ref_values, traj_rewards=None)

Inputs:

lps_g torch.FloatTensor[bs, sl]: model gathered log probabilities
base_lps_g torch.FloatTensor[bs, sl]: baseline model gathered log probabilities
lps torch.FloatTensor[bs, sl, n_actions]: model full log probabilities
mask torch.BoolTensor[bs, sl]: padding mask. True indicates values that will be kept, False indicates values that will be masked
rewards torch.FloatTensor[bs]: reward tensor (one reward per batch item)
values torch.FloatTensor[bs, sl]: state value predictions
ref_values torch.FloatTensor[bs, sl]: baseline state value predictions
traj_rewards Optional[torch.FloatTensor[bs, sl]]: optional tensor of trajectory rewards with one reward value per timestep

Policy Gradient

Policy Gradients

Current Limitations

class BasePolicy[source]

BasePolicy.discount_rewards[source]

Policy Gradients

class PolicyGradient[source]

PolicyGradient.__call__[source]

Trust Region Policy Optimization

class TRPO[source]

TRPO.__call__[source]

Proximal Policy Optimization

class PPO[source]

PPO.__call__[source]

`class` `BasePolicy`[source]

`BasePolicy.discount_rewards`[source]

`class` `PolicyGradient`[source]

`PolicyGradient.call`[source]

`class` `TRPO`[source]

`TRPO.call`[source]

`class` `PPO`[source]

`PPO.call`[source]