I have a data set that I'm playing with, but I'm not sure if there is a pattern to be found so I'm being really hesitant with my results. As such I also don't have a good feel for tweaking my parameters so I decided to generate a product of parameters, calculate an error rate against the test data and add the abs sum of values across the training set and see what kind of parameters give me a good fit to my error rate.
So far it appears that as my iterations go up the over fitting gets really bad and the best configurations tend to land pretty close to the defaults.
Here are all the parameters that I am calculating a product for:
l_l1_ratio = (0, 0.05, 0.1, 0.15, 0.25, 0.5, 0.75, 1) l_penalty = ('l1', 'l2', 'elasticnet') l_alpha = (0.00001, 0.0001, 0.001, 0.01, 0.1) l_loss = ('squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive') l_n_iter = (5, 50, 500, 5000) l_eta0 = (0.01, 0.001, 0.0001) # 0.1 crashes the fit!
I'm using sklearn.linear_model.SGDRegressor
I also have plotted a scatter of expected values X against actual values Y then in a second graph performed a fill of expected against actual. 1752 error vs 15360 error.
There are 30 attributes and ~400k samples broken up 80/20 but after rejecting bad data the useable rows end up being around ~40k.
A low score seems to be highly accurate but higher scores tend to jump around quite a bit. But my main question is about strange gap centered around 0,0 are these caused by NaN rows or are they artifacts of the SGDRegressor and am I correct in assuming that high error rates with large n_iter counts are a sign of strong over fitting?
[link][2 comments]