Hi,
I realise this is likely a very broad question, and likely depends on the specific technique being used. However, to start off I am currently using random forests to classify network activity. When you can provide training data for and label all categories, everything is fine. However, problems arise when we transfer the problem away away from the model into the real world. Here it is impossible to provide data for all types of activity due to the huge variety that exists.
At the moment our attempt is to label large amounts of network activity we're not interested in as "other" in the training set, which works to some degree. This seems rather hack-y though. Is there a better way to do this?
[link][14 comments]