Visualising the Q-function for the RL Flappy Bird hack (link in text)

(Link to album)[http://imgur.com/a/mlLbb].

After reading the original post I went ahead and bodged a version of the TD-lambda learning rule into the same javascript framework (technically a strictly on-greedy-policy version of Watkin's Q-lambda rule, so not strictly TD-lambda), and took a look at the results of that compared to the results of the original model.

Short version is that the Q-lambda rule learns a lot faster with the right parameters (a lambda of about 0.5 seems to do really well, and a but under still does ok) and performs abysmally with the wrong parameters (much over 0.5). The scoring is the same as in the original implementation (+1 for every frame of life, -1000 for death) so a score of -945 corresponds to just falling to your death, -730 corresponds to passing the first pipe, and then it's another 200 or so per pipe after that. I've only got 1 trial for each rule in the top plot, but a lambda of 0.5 reliably gets to well over 1000 pipes passed in under 150 games.

Below that are three plots showing a few different views of the learned Q function for different learning rules; top to bottom are: the original Q-learning with alpha=0.7, Q-lambda with lambda=0.5, and Q-lambda with lambda=0.7. The leftmost in each is the greedy policy; red means do nothing, green means click/jump, and black is 'nothing learned' and will default to 'do nothing'. Next is the difference in score between clicking and not clicking. Third is the score for 'clicks' in each state, and last is the score for 'do nothing' in each state. Important note: because of the way coordinates get stored, the bird actually moves from right to left across these (see the outline of the pipe on the left).

It's interesting to compare the Q-learning with Q-lambda; things that jumped out at me:

they both sort of settle on the same general strategy (cruise just below the pipe edge, and then hop over it when you get close)
the eligibility traces in the Q-lambda rule let it update values in a much broader space; it's actually managed to assign positive scores to 'do nothing' in a noticeably larger range above the lip of the pipe. The standard Q-learning can only back out one step at a time, so it ends up taking longer to fill out the space.
The Q-learning has learned a lot more negative values for clicking than the Q-lambda (at least the 'good' Q-lambda one), especially right up against the lower pipe. The slow backout means that it gets itself into unwinnable positions a lot more readily, resulting in it getting a lot of experience about what not to do, that it then takes a while to learn from, since it can only back out one step from where it's most recently visited.

The bottom-most plot I find curious; this shows the Q-function for lambda=0.7 in Q-lambda learning. When you run it, the bird crashes a few times, and then around the 10th or 11th game, suddenly decides that the top of the board is the best place to be, and never seems to learn to come down from there (I've run it for ~1000 games, and never had it get past the first pipe). Once in a long while it will decide to drop to its death, but it never seems to try clicking mid-drop. Looking at the plot, it's clearly learned that what it's doing isn't working, since the scores are all red in the two rightmost visualisations, and looking at the difference it seems like the scores are roughly balanced until it gets close to the pipe, but then for some reason it seems to think that clicking is on balance less negative than not clicking.

The high lambda may be making it "overthink" cause and effect, and so update traces that aren't actually relevant, perhaps? I'm still not satisfied with the answer.

At any rate, hope someone finds it interesting.

submitted by rcwll
[link][comment]