I have a true underlying model of ax+by+some error=z a,b,x,y,z are real numbers. (x and y are constrained to be between 0 and 1000) My data set is a set of ([x,y], z). x an y are my features, z is my label. My data is randomly distributed.
Now I want to estimate a and b. So I start with a random a and b and do SGD. My error is the average of the squares of the error.
The algorithm looks like this:
- randomly pick a or b.
- randomly choose to increment or decrement by step size
- calculate new error
- if new error > old error, undo
- otherwise, new error = old error
- do it again
So, if I calculate the error over the whole dataset, no problem. But if I try to compute the error over a subset of the data set, it doesnt work and here is why: as x and y grow, so does the error since the error is:
((a-a')x + (b-b')y)2
And so if on iteration 1 I calculate errors over a subset with small xs and ys and in iteration 2 the xs and yx in the subset I pick are big, the error of the second iteration will be higher regardless of whether I stepped in the right direction or not.
So how do I solve this? Is square the wrong loss function? Wont this problem come up with any loss function? Should I pick a single subset of the dataset to calc error on instead of picking a new one each time? Isnt there a risk of bias? Any other comments?
(forgive typos, 1 arm out of order makes typing hard)
[link][5 comments]