My professor gave me a copy of a journal paper to help me with a personal project requiring Reinforcement Learning and I'm having trouble understanding a small part of one of the algorithms, and while he is very knowledgeable about supervised learning, he has informed me that he has never attempted Reinforcement Learning and that the extent of his ability to help me would mostly consist of theoretical explanations. Unfortunately the journal in question is behind a paywall so it's difficult to distribute, however I can provide an alternative (free) paper that deals with essentially the same equation.
(from now on anything in {} is considered subscript) The part I'm having trouble understanding is: (P{t+1} - P{t}) How is P{t+1} calculated exactly? My assumption is that P{t} is the output of the state at the current time (i.e. before the action is taken) and P{t+1} is the output of what the next time-step would be if the weights were to stay the same (i.e. the action is taken, then the resulting state is fed as the input to the network and P{t+1} is the resulting output) Is my assumption correct?
I've read both Journals (the actual journal I'm talking about is: http://dl.acm.org/citation.cfm?id=2298811 and the formulas are (5) and (6) in case some of you have access to journals through academic facilities etc.) and I've read through the TD (and several other) sections in: http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html however it seems none of them actually explain how the prediction for {t+1} is calculated (and I assume it's probably common knowledge among RL circles hence their not explaining it).
Any help would be greatly appreciated, and if this is in the wrong sub I am deeply sorry. I don't visit this sub often.
[link][comment]