Thanks for your comments!

Jan 31, 2023

1. Indeed I deviated from the default ratio notation by using rho_t, the main reason being to avoid confusion with the typical reward notation r_t. However, the use of r_t in Algorithm 5 might be confusing. I decided to keep the rho_t notation, but added a note to the caption.

2. You're absolutely correct, thanks for spotting. Fixed it now.

3. No special type of estimator, in my experience. Typically, it is just a value function (obtained via a critic network) deducted from the observed rewards. I don't think PPO really deviates from other actor-critic methods in this regard.

Written by Wouter van Heeswijk, PhD

No responses yet