Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63763

Classification problem where each training observation is inherently a separate class.

$
0
0

I am trying to find, given a user-input vector of 65 variables/features (some of which may be correlated), the nearest training observation.

On the surface this is a classic K nearest neighbor problem. That is indeed what we started out with. But the problem is in scaling the axes such that the euclidean distance between observations reflects their actual (dis)similarity. Plus, see the curse of dimensionality. We've tried scaling each variable and applying axis weights based on "expert opinion". But this is highly arbitrary, does not account for colinearity, and is hard to justify.

What are some other options?

I've explored PCA to reduce the dimensionality and come up with more appropriate axes. This helps reduce colinearlity but I'm still left with the arbitrary axis weighting problem.

I've looked at canonical correlation and correspondence analysis and they seem appropriate only when you have some sense of dependent vs independent variables (which my model does not). I'm merely trying to predict which training observation is most similar to each test observation.

I hesitate to call it a classification problem as each training observation is inherently a class by itself. IOW, by definition, there will be only a single observation per class in any training set. Are there any negative implications to handling datasets with "singleton" classes in typical machine learning classifiers? What classifiers might be most appropriate for this case?

Any suggestions welcome.

submitted by perrygeo
[link][4 comments]

Viewing all articles
Browse latest Browse all 63763