So we have (let's say) 1 million rows of data to fit a model to. We're going to fit a Lasso Regression model, so we need to set the regularization parameter. We do K-fold validation with... 10(?) folds to set lambda. Then we use the entire(?) data to set the coefficients.
But then we want to report an MSE for this model. Should we have set aside 10% of the data to begin with, and keep it for the reporting? What's the name for that part of the dataset?
Or should I just stick to cross-validation with a 60/30/10 split on training/validation/reporting?
TL;DR: 4 questions
1) How do you set the # (k) of folds?
2) Do you fit the non-regularization parameters with the full data, or do you average the parameters from your k folds?
3) How do you report the MSE for your entire model? What do we call the data set aside for that part?
4) Should I just stick with cross-validation?
[link][3 comments]