Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62811

[Question] Regression with many irrelevant variables?

$
0
0

Hello everybody. I've got a biological datasets of ~20K vectors (samples from tissues) with ~300 features (genes), most of them binary or categorical (0..n), except a few numeric (phenotypes from the organism where the tissue are extracted). I wanted to try the hypothesis that the binary/categorical features are enough to predict the numeric phenotypes (for instance, lifespan of the organism), but after some exploratory data analysis with Weka I could not get anything clear.

Most of the binary or categorical features are mutations in some genomic regions, but in many cases (>90%), the value is equal to 0 (no mutation). Besides, most of this mutations are known not to produce a particular effect (i.e. they are irrelevant wrt the phenotype).

I have tried linear regression and lasso (the one implemented in Matlab) with no much success. I have tried to discretize the attribute into 3/4 classes, and tried several classifiers which finally gave really bad values.

Should I try an specific technique apart from the classical ones? Do you recommend me certain preprocessing techniques? Could I assume that the phenotypes have nothing to do with the genomic mutations (and therefore I am just wasting my time [and yours])? Is there any preprocessing technique I am probably missing? Do you recommend any specific bibliography dealing with similar problems?

Thanks for your time, redditors!

submitted by alfonsoeromero
[link][13 comments]

Viewing all articles
Browse latest Browse all 62811

Trending Articles