I have implemented Q-Learning as described in,
http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf
In order to approx. Q(S,A) I use a neural network structure like the following,
- Activation sigmoid
- Inputs, number of inputs + 1 extra for the Action (All Inputs Scaled 0-1)
- Outputs, single output. Q-Value
- N number of M Hidden Layers.
- Exploration method random 0 < rand() < propExplore
At each learning iteration I run all action/state pairs through the neural network either pick one at random or choose the one with the highest Q-Value then using the following formula,
http://i.stack.imgur.com/e3hgc.png
I calculate a Q-Target value then calculate an error using,
error = QTarget - LastQValueReturnedFromNN
and the neural network using this error.
Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.
Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)
Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations. Is this because random is random or am I missing something. (I've tried using boltzmann exploration instead of pure random but still iteration count needed for xor fluctuates.)
[link][5 comments]