Policy Gradients
The code in this module implements several policy gradient algorithms
PolicyGradient
- implements Policy GradientsTRPO
- implements Trust Region Policy OptimizationPPO
- implemeents Proximal Policy Optimization
Current Limitations
The implementations below are designed for the scenario where the output of the model is a series of actions over time. Importantly, rewards are discounted going backwards, meaning the discounted reward at very timestep contains some of the future rewards. If your model does not predict a series of rewards (ie predicts a single graph), you may need to revisit these assumptions
Policy Gradients
PolicyGradient
implements standard Policy Gradients following:
$$ \nabla_\theta J(\theta) = \mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)] $$
When we generate a sample through autoregressive Monte Carlo sampling, we create a sequence of actions which we represent as a tensor of size (bs, sl)
.
For each step in this series, we have a probability distribution over all possible actions. This give us a tensor of log probabilities of size (bs, sl, n_actions)
. We can then gather the log probabilities for the actions we actually took, giving us a tensor of gathered log probabilities of size (bs, sl)
.
We also have a set of rewards associated with each sample. In the context of generating compounds, we most often have a single reward for each sampling trajectory that represents the final score of he whole molecule. This would be a tensor of size (bs)
. If applicable, we can also have a tensor of trajectory rewards which has a reward for each sampling timestep. This trajectory reward tensor would be of size (bs, sl)
.
These rewards are discounted over all timesteps using discount_rewards
, then scaled using whiten
. This gives is our final tensor of rewards of size (bs, sl)
.
Now we can compute the empirical expectation $\mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)]$ by multiplying the gathered log probabilities by the discounted rewards and taking the mean over the batch.
Then of course we want to maximize this expectation, so we use gradient descent to minimize $-\mathbb{E}_\pi [R(s,a) \nabla_\theta \ln \pi_\theta(a \vert s)]$
This basically tells the model to increase the probability of sample paths that had above-average rewards within the batch, and decrease the probability of sample paths with below-average rewards.
Trust Region Policy Optimization
Trust Region Policy Optimization (TRPO) adapts the policy gradient algorithm by constraining the maximum update size based on how far the current agent has deviated from the baseline agent.
$$ J(\theta) = \mathbb{E}_{s \sim \rho^{\pi_{\theta_\text{old}}}, a \sim \pi_{\theta_\text{old}}} \big[ \frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)} \hat{A}_{\theta_\text{old}}(s, a) \big] $$
Subject to a KL constraint between the current policy and the baseline policy
$$ \mathbb{E}_{s \sim \rho^{\pi_{\theta_\text{old}}}} [D_\text{KL}(\pi_{\theta_\text{old}}(.\vert s) \| \pi_\theta(.\vert s)] \leq \delta $$
Proximal Policy Optimization
Proximal Policy Optimization (PPO) applies clipping to the surrogate objective along with the KL constraints
$$ r(\theta) = \frac{\pi_\theta(a \vert s)}{\pi_{\theta_\text{old}}(a \vert s)} $$
$$ J(\theta) = \mathbb{E} [ r(\theta) \hat{A}_{\theta_\text{old}}(s, a) ] $$
$$ J^\text{CLIP} (\theta) = \mathbb{E} [ \min( r(\theta) \hat{A}_{\theta_\text{old}}(s, a), \text{clip}(r(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_{\theta_\text{old}}(s, a))] $$