Sep 14, 2022
Hi Daniel, thanks for your comment. As the problem setting is a one-shot game, there is indeed no backpropagation of rewards over time (or technically, just one step). If you are looking for an code example that includes a time horizon and backpropagation of rewards, the following article might interest you: https://towardsdatascience.com/cliff-walking-problem-with-the-discrete-policy-gradient-algorithm-59d1900d80d8