Hi guys, I'm struggling with implementing a long short-term memory network. I have the forward pass done, but I'm having trouble deriving the activation functions in order to get the error terms because I suck at math. The original LSTM paper uses a combination of truncated BPTT and RTRL but the paper I'm trying to follow claims to use BPTT only (sidenote: does calculating the full gradient imply not updating the weights at every timestep?). If someone could walk me through how to calculate the derivative of the cell I'd greatly appreciate it.
TLDR: How do I calculate a LSTM cell's derivatives?
[link][2 comments]