Hi,
I have data from two types (classes) of networks: net1, net2. Each network has 100 data instances. Each instance has twenty features and one output label.
My aims are two-fold:
- build a classification model and
- rank the features using fisher test and AUC
The labels for the data are floating numbers: 47.23, 67.5 etc. It can range from 0.0 to 100.0. If I use the labels as is, the prediction accuracy is, predictively, too low. I want to create bins to categorize these labels for a given range using k-means clustering algorithm. The number of labels hence will be the number of clusters I mention for the k-means algorithm. Once I build the model using k-fold cross validation, I will compare the performance by using two different kernels to start with: linear and RBF.
I want to repeat the steps 1 and 2 for three cases:
- data from only net1,
- data from only net2 and
- data from net1 and net2 (I will use +ve and -ve suffixes to separate the data from the two classes)
and observe which features rank high. Feature ranking using Linear SVM [1] lists 4 methods for ranking that include fisher test and AUC.
ML isn't my area of expertise. Hence, I would like to hear from the reddit ML community if this is the correct approach. I would love to hear suggestions and comments.
I started out with LibSVM but I didn't find it flexible to change the c and gamma parameters while using cross validation. I'm now using scikit-learn package in Python.
Thanks a lot!
[1] Feature Ranking Using Linear SVM - http://core.kmi.open.ac.uk/download/pdf/16008.pdf#page=61
[link][comment]