I'm trying to create a network that predicts handwriting strokes. I've got a network consisting of a 3d input vector (x coord, y coord, and the probability that the point is the last point in a stroke), a hidden layer of 900 long short term memory cells, and an output layer of size 121 (20 bivariate mixture components and 1 for the end-of-stroke probability).
I think my code accurately reflects the algorithm outlined in the paper I'm reading, http://arxiv.org/pdf/1308.0850v3.pdf. Assuming my code is correct, I'm still having an extremely difficult time getting the numbers to stay within a normal range.
One problem I've been having is that an equation involves raising e to the power of the summed inputs of an output node. With 900 incoming inputs, it's very likely that the resulting number will be too large. I sidestepped this problem in a previous architecture by dividing the input by 1000 or taking the log of the whole thing instead.
Another problem I'm having is that the correlation (between the x1 and x2 of a given mixture component - I'm honestly not too sure because I'm not good at stats) starts to approach -1 or 1 which results in a division-by-zero error in calculating the probability density. I've tried looking at other papers on mixture density networks, but their notation is kind of confusing and there are certainly major differences in their equations and the one in the paper I'm using.
Upon further review, it appears that the problem of converging correlations is a symptom of the increasingly large summed inputs to the output nodes. The correlations = tanh(sum_input). Since the sums continue to increase, the correlations converge to (+/-)1. I don't know how to prevent the sums from increasing so much, should I use a log sum or is this just pushing the problem further down the line?
A somewhat unrelated problem (I think at least) that I've experienced with pretty much any project I've worked on lately is how much time it takes to train the network. With this current network there are about 3 million weights that need to be updated every timestep (as per the network's design; I don't know what would happen if I did updates every 1000 steps etc). A single pen stroke corresponds mostly to a single letter or at most a small word written in cursive. This single pen stroke looks like it'll take my network 25000 seconds to train on. Is it impossible for me to somehow expedite this process?
TLDR:How can I guarantee that my weights and outputs remain within a certain numerical range without ruining the integrity of the network? Also, how can I best get around time bottlenecks caused both by network size and data size?
I'm sorry if my problem is poorly worded or if I didn't provide enough information. Please let me know if you want me to clarify anything.
[link][comment]