Hey all, I've been struggling to learn how to apply Q-learning to ANN's. I understand that they work mostly by using MLP feed forward neural nets using gradient descent back propagation. My problem is understanding the right way to use the Q-values I get to update the neural network.
Take for instance the mountain car problem, it is continuous states with 3 actions
Car_position = [-1.2 0.6] Car_velocity= [-0.07 0.07] Possible actions =[Rev, Neutral(do nothing), Fwd] the car starts every episode in state -0.5 position and 0.0 velocity
Now the idea is to create a neural network to replace the Q-table that I would normally have right? Therefor a neural network with 2 inputs(real numbers for position and velocity), a hidden layer of nodes( 5-25 or so) and 3 output nodes corresponding to the actions seems like a good idea.
Is this the right process now:
Run the network( Feed forward the state -0.5, 0.0 ) to get 3 q values, one for each action . These are Q-values for state s (the current state)
Choose an action a using E-greedy, either pick the highest Q-value or random
Simulate the Mountain Car one step and obtain a reward and new state s' from the executed action
Run the network with state s' to get 3 new Q-values for s'
Calculate QTarget = reward + gamma* Max Q-value for s'
The Target pattern for the update of the weights is then either 0;0;QTarget or 0;QTarget or QTarget;0;0 since we don't know how good the Q-values are of the actions we did not take, and we want to move the Q-value of s corresponding to action taken
set s = s' and repeat the process until # of learning episodes elapsed
I'm using Matlab with NN toolbox to create, init and update the weights. So therefore I use the newff with sigmoid in the hidden and linear in the output.
Updating is done using the net=train(net,s,Targets) function? The parameter s is a matrix like so [-0.5; 0.0]. I selected the traingdm as the training function
Thanks
[link][1 comment]