Thoughts and Theory

A comprehensive classification of solution strategies for reinforcement learning

Image by Hans Braxmeier via Pixabay

Policies in Reinforcement Learning (RL) are shrouded in a certain mystique. Simply stated, a policy is any function that returns a feasible action for a problem. No less, no more. For instance, you could simply take the first action that comes to mind, select an action at random, or run a heuristic. However, what makes RL special is that we actively anticipate the downstream impact of decisions and learn from our observations; we therefore expect some intelligence in our policies. In his framework on sequential decision-making[1], Warren Powell argues there are four policy classes for RL. Techniques…


Start using continuous integration services to test your new code every time you push to GitHub.

Photo by Science in HD on Unsplash

As every programmer knows, testing and debugging can be a nuisance. We want to spend our time building, creating, solving — not performing mind-numbing tests and staring at error messages. When adding teams and end-users to the mix, small mistakes can have catastrophic consequences. It is not enough for code to work correctly on your local machine; it should work in all environments it is supposed to work. Automatic testing is a must in such settings.

Unit testing

In an earlier article, I wrote about writing the concept of . If you don’t know what unit testing is and can’t bothered…


A multi-armed bandit example to train a Q-network. The update procedure takes just a few lines of code using TensorFlow

Deep, just like Deep Q-learning. Photo by Kris Mikael Krister on Unsplash

Deep Q-learning is a staple in the arsenal of any Reinforcement Learning (RL) practitioner. It neatly circumvents some shortcomings of traditional Q-learning, and leverages the power of neural network for complex value function approximations.

This article shows how to implement and train a deep Q-network in TensorFlow 2.0, illustrated at the hand of the multi-armed bandit problem (a terminating one-shot game). Some extensions towards temporal difference learning are provided as well. I take the in minimal working example quite literal though, so the focus is really on a first-ever implementation of deep Q-learning.

Some background

Before diving into deep learning, I…


Google’s top headlines for every day of the year

Photo by Jorge Vasconez on Unsplash

Most people will have heard of the Florida Man meme. You type in your birthday on Google, add the term , and a whole list of bizarre headlines pops up. Wouldn’t it be great to get an overview of the best headlines without the manual effort?

In another article, I detailed the procedure to acquire the list using Google’s Custom Search API and a series of Python operations. Such operations include selecting only headlines that start with , removing duplicates, filtering overview articles, etc. Note that the list is not perfect in this regard (with headlines occasionally covering…


A convenient JetBrains plug-in to program remotely with your teamb

Photo by Duy Pham on Unsplash

If you are a programmer, chances are you’ve been working from home for a long time. Chances also are that — once the pandemic is finally under control — remote work will remain a considerable part of your work life. With virtual solutions such as Zoom calls, Discord, GitHub, etc. most of us managed just fine.

Sometimes, however, you just need to sit behind the same screen with one or more team members, discuss changes, find bugs, make changes on the spot. You simply cannot replace that real-life interaction, right?

Code With Me

Until now. Enter the Jetbrains (CwM) plug-in…


Thoughts and Theory

On the (not so) subtle differences between state-action pairs and post-decision states in Reinforcement Learning.

Photo by Ray Hennessy on Unsplash

My previous article on post-decision states didn’t receive a lot of traction, so I decided to write another one (something with foxes and snares, I suppose).

In that previous article, I made a case for the similarities between state-action pairs and post-decision states in Reinforcement Learning. Very briefly, the post-decision state is the concatenation of state and action (e.g., a Tic-Tac-Toe board adding a mark, but your opponent does), a sort of limbo state before the world starts moving again.

While focusing on the similarities, however, I also buried some important differences between both concepts. Leaving them unmentioned…


A quick demo to compute the correlation matrix for your portfolio assets, visualize the dependencies, and compute the overall variance.

Photo by Sophie Backes on Unsplash

As we all know, portfolio management ultimately boils down to balancing risk and expected return. If you want to earn more, you have to take more risks. Unfortunately, the market does not reward us for simply taking risk. Placing all your money in a single stock is risky, no doubt, but not the sort of risk that earns you money. The market only rewards risk — the risk that remains after properly diversifying your portfolio.

How do we our portfolio? The financial engineering community wrote many books about that, but the ‘no transaction costs, perfect information’…


Take your debugging skills to the next level.

Photo by Nina Mercado on Unsplash

For the individual coder, unit tests might be a foreign concept. In development teams, you can’t live without them. This article will give a brief introduction to the concept of unit testing. Before you know, you’ll be able to do it by yourself!

Manual and automated testing

At some point every programmer experiences them — bugs. Your script crashes or simply does not give the desired output: it’s time to debug. If you are fortunate, you either get a clear error message or know by heart where to look. …


An in-depth comparison of off-policy and on-policy reinforcement learning

A cliff, to set the scene. Photo by Alan Carrillo on Unsplash

To learn the basics of Reinforcement Learning (RL), Sutton & Barto’s [1] textbook example of cliff walking is an excellent start, providing an illustration of both Q-learning and SARSA. As you might know, SARSA is an acronym for State, Action, Reward, State, Action — the trajectory used to update the value functions. To preserve the analogy, Q-learning may be summarized as State, Action, Reward, State or SARS (note the second action doesn’t matter!). The difference is that SARSA is on-policy and Q-learning is off-policy. What on earth does that mean?

Maneuvering the cliff

Before going into that question, let’s first define our cliff…


A Data Science microproject in Python to compile the top headlines per day

Miami Beach, potential spotting location of the Florida Man. Image by tammon via Pixabay

You probably know the meme. Google, and the strangest search results pop up. Every day of the year, the enigmatic Florida Man seems to try his hand at something even more peculiar, violent, tragic or downright bizarre than the day before. Often, the headlines are so gaudy they can’t help but bring a smile to your face.

Googling what the Florida Man did on your birthday is fun, but you must be curious what he does the rest of the year as well (at least I was). Googling 366 days is a bit too much…

Wouter van Heeswijk, PhD

Assistant professor in Financial Engineering and Operations Research. Writing about reinforcement learning, optimization problems, and data science.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store