Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62733

Has anyone ever implemented NFQCA reinforcement learning?

$
0
0

Hi there,

I'm trying to implement NFQCA as described in Reinforcement Learning in Feedback Control by Hafner and Riedmiller. If you don't have access to Springer, NFQCA stands for Neurally Fitted Q-Learning with Continuous Actions, and it's an adaptation of the actor-critic architecture that uses a neural network for both the actor and critic (totaling two networks). Some lecture slides that describe it are here. Slide 11 gives an overview of the technique.

In a nutshell, you have a feed-forward neural network (FFNN) that serves as the policy Pi(x), where x is a state vector. The output of Pi(x) is u, an action. Another FFNN serves as the critic, or Q function Q(x,u), and it produces a value v. The output of Pi is part of the input of Q. The state vector x is an input to both.

From the paper,

In iteration step k of the NFQCA algorithm we assume that the recent policy π_k represents the greedy evaluation of the Q-function: π_k (x) ≈ argmin_u (Q_k (x,u)). With this assumption we can formulate the Q-update without a minimization step over all actions as Q-update(x, u) = c(x, u) + Q_k (x′, π_k (x′)) 

...Where c(x, u) is a special cost function that smoothly approaches 0 from a constant C as the desired set-point (defined as some desired state) is reached, and x' is the action at time t+1. The update for Q is regular backpropagation (the authors use RProp), but the update for Pi is a little different-- You simply carry the backpropagation from Q back through the linked input/output nodes, calculating a delta for Q's input layer as if it were hidden, and using this information to continue the backpropagation algorithm through Pi.

That seemed like a really neat trick, and I wanted to try it. I have a simple cart-pole simulator set up and I'm using batch RMSProp (very similar to RProp) to update my Pi and Q networks. Now that I have my learner all set up, I'm running into some trouble and I don't think it's a simple programming error. First of all, doesn't the Q update rule look sort of odd for actor-critic? Why are we summing costs instead of rewards? In my algorithm, if learning start with the pole facing downward (furthest away from the reward state), then the cost for one action is 1/epoch_size (my value for C). Q spits out some arbitrary, untrained value for the duration of the epoch, and the target values used to update the Q network all end up as more-or-less the same arbitrary value plus 1/epoch_size. This results in the Pi network being encouraged to apply force in some direction, but iterating for another epoch never changes the situation-- as the cart slowly accelerates, the pole stays un-inverted and the Q/Pi networks get the same update over and over again, encouraging Pi to slowly apply more force in one direction or another. The Q network's value estimations are pushed to 1.0 + (1/epoch_size), since each epoch encourages Q's output value to increase. Something is wrong here. There's no mechanism for exploration as with discrete-action Q-learning. When (near) maximum cost for an entire epoch is reached, what about this algorithm is supposed to make behavior change? Even resetting the simulation (but not the networks) on occasion and setting the pole to a random angle doesn't seem to change this. Is something missing from this paper, or am I interpreting the update rule incorrectly?

submitted by eubarch
[link][comment]

Viewing all articles
Browse latest Browse all 62733

Trending Articles