I want to test whether Alzheimer's disease causes a change in brain aging compared to healthy patients.
Therefore I have constructed a linear regression model of spectral parameters of brain recordings versus age (age is the independent variable).
Now I wish to fit the model on the healthy patients, and then use the coefficients to calculate the expected age of the Alzheimer's patients - comparing the mean squared error of the healthy dataset and the Alzheimer's dataset should help show whether or not there is a difference to the aging due to the disease. (i.e. if the model that works well for healthy patients fails miserably for Alzheimer's patients then there is probably a difference)
I guess I will fit the linear regression model on 80% of the healthy patients (the training set), holding back 20% (as a test set) to calculate the MSE with because it seems unfair to calculate the MSE on data that the model was trained on and compare it to a dataset which the model never saw.
I would use cross-validation but then I will end up with as many different sets of coefficients as I have folds and how would I know which to fit the Alzheimer's patients with? The mean of the coefficients perhaps? An advantage to cross validation though is that I would have a mean and standard deviation of the MSE estimates from the sets of healthy patients so I could use that to determine the significance of deviation between the healthy and diseased MSE's, which is handy.
I guess I could also sample subsets many times from the Alzheimer's patients and create a set of MSE estimates which I could then calculate the standard deviation and mean of to get some idea of the variance there as well so I have an idea about how sensitive it is to that particular dataset. (Should I do this with replacement i.e. bootstrapping? Why or why not?)
Any advice is greatly appreciated.
[link][comment]