Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62921

[Help] Dealing with high-variable, (relatively) low-observation data

$
0
0

Apologies if this is inappropiate, but I'm fairly new to ML and having a bit of trouble finding resources for this particular problem.

I have 30 observations in 2 classes (15 in each). Each observation has several thousand variables (this could be reduced in a somewhat hand-wavy way, but I'd rather not). All variables are continuous; some are normally distributed and some aren't; some are most likely redundant; some are highly informative and others aren't/ are misinformative. I'm using SVM with the RBF kernel (from libSVM in Matlab) to build classifiers, using leave-one-out cross validation (or leave-pair-out, removing one from each class, for tests with fewer iterations) to test the feature selection algorithm, but I'm having real trouble finding a feature selection algorithm which is at all stable across the different iterations of the LOOCV.

At first I tried ranking features in terms of their contrast-to-noise ratio and building a classifier by using the top one, then the top two, then the top n, and finding the optimal classifier out of those, but it meant that a lot of redundant information was included (possibly weighting the classifier in an unhelpful manner), and the results were poor, as well as the choice of features being very unstable- I think because small variations in CNR cause quite large changes in CNR rank. Then I tried greedy forward selection, which was better (80% sensitivity, 87% specificity), but still each classifier was picking up different features (although some were picked more frequently than others). The greedy algorithm used LOOCV within the remaining 29 variables to choose which feature should be added, so it was a sort of (LOO2). It would be interesting to use the probability of any feature being selected to weight the final classifier, but to test this I'd need to go to a third level of LOO, which is getting absurd.

At the moment I'm trying to reduce the number of features by combining covarying variables, using PCA. However, it's my understanding that PCA doesn't really work very well with such rectangular data, and so I'd need to heuristically reduce the number of features first for it to effective. In particular, I found that the contrast:noise ratio of each PCC-transformed variable had no correlation with the latent of the PCC (even when the latent was 0). This means that PCA doesn't actually reduce the search space at all. Edit: Also, none of the features actually seem to have a very high covariance.

Have I missed some handy redundancy reduction, dimensionality reduction or other feature selection algorithm which is useful for this sort of data? Or is it crazy to be even looking at this rectangular a data set, and I should be trying to massively cut down the number of variables that I'm feeding in to any algorithm?

EDIT: Thanks for the help, guys! In the unlikely event that I can squeeze a publication out of this in the next couple of months I'll do my best to big up /r/machinelearning.

submitted by blackrat47
[link][16 comments]

Viewing all articles
Browse latest Browse all 62921

Trending Articles