Hello,
For the last few weeks I've been working on a backprop network and posting a few questions to this forum; I thank you for all the help so far. I've gone from concept, to buggy implementation, to something that works.
As a quick recap of my network - my network takes input/feature vectors of length 43, has 25 nodes in the hidden layer (arbitrary parameter choice I can change), and has a single output node. I want to train my network to take the 43 features and output a single value between 0 and 100.
Unfortunately, I currently only have a very small pool or training data - 162 sets of feature vectors with corresponding scores out of 100 (I have to manually label this lol! Working on creating more data though obviously). So I take this limited training set, and here's a snapshot of how well my network adapts to it:
Output value:0.90406 | Test value:0.9 (pretend to multiply all values by 100)
Output value:0.21558 | Test value:0.2
Output value:0.60394 | Test value:0.6
Output value:0.79604 | Test value:0.8
Output value:0.99846 | Test value:0.85
Output value:0.23444 | Test value:0.2
Output value:0.19609 | Test value:0.2
Output value:0.88889 | Test value:0.9
Output value:0.19178 | Test value:0.2
Output value:0.20549 | Test value:0.2
Output value:0.63248 | Test value:0.64
Output value:0.74367 | Test value:0.74
Output value:0.15477 | Test value:0.17
Output value:0.17084 | Test value:0.18
Output value:0.21143 | Test value:0.19
Output value:0.16179 | Test value:0.17
Output value:0.081413 | Test value:0.18
Output value:0.18287 | Test value:0.19
Output value:0.19118 | Test value:0.17
Output value:0.20018 | Test value:0.18
Output value:0.19222 | Test value:0.19
Output value:0.20719 | Test value:0.2
Output value:0.18718 | Test value:0.2
Output value:0.18064 | Test value:0.2
Output value:0.20925 | Test value:0.2
Output value:0.20731 | Test value:0.2
Output value:0.19914 | Test value:0.2
Output value:0.6033 | Test value:0.6
Output value:0.63723 | Test value:0.64
Output value:0.77831 | Test value:0.78
Output value:0.23468 | Test value:0.2
Output value:0.87713 | Test value:0.9
Output value:0.23822 | Test value:0.2
Output value:0.18954 | Test value:0.15
Output value:0.19912 | Test value:0.2
At first I'm like, "wow this is sick!" The results are much, much better than when I originally tried gradient descent on its own. Like, this is too good to be true. Hmm, maybe it is. So I decide to try something - use the same test/target values, but create 162 completely random feature vectors.
Uh oh - my network was able to fit the random training data even better than my actual training data! In fact, it fit the random data perfectly. Shit:
Output value:0.92 | Test value:0.92
Output value:0.2 | Test value:0.2
Output value:0.2 | Test value:0.2
Output value:0.2 | Test value:0.2
Output value:0.2 | Test value:0.2
Output value:0.2 | Test value:0.2
Output value:0.2 | Test value:0.2
Output value:0.2 | Test value:0.2
Output value:0.2 | Test value:0.2
Output value:0.2 | Test value:0.2
Output value:0.62 | Test value:0.62
Output value:0.7 | Test value:0.7
Output value:0.77 | Test value:0.77
Now I'm thinking one of two possibilities:
1) Because I have so few training samples (only 162), my 3-layer network of 43->25->1 is able to over-fit the data with all its weights.
2) My original feature vectors are absolutely worthless, and just as good as inputting plain garbage. These feature vectors I hand-coded based on what I researched would be appropriate to my problem domain.
What do you guys think is going on, and will I only know once I have more training data? Given the topology of my network, any idea how much data I'll actually need?
Cheers.
[link][2 comments]