--

Thanks for your comments!

1. Indeed I deviated from the default ratio notation by using rho_t, the main reason being to avoid confusion with the typical reward notation r_t. However, the use of r_t in Algorithm 5 might be confusing. I decided to keep the rho_t notation, but added a note to the caption.

2. You're absolutely correct, thanks for spotting. Fixed it now.

3. No special type of estimator, in my experience. Typically, it is just a value function (obtained via a critic network) deducted from the observed rewards. I don't think PPO really deviates from other actor-critic methods in this regard.

--

--

Wouter van Heeswijk, PhD
Wouter van Heeswijk, PhD

Written by Wouter van Heeswijk, PhD

Assistant professor in Financial Engineering and Operations Research. Writing about reinforcement learning, optimization problems, and data science.

No responses yet