Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62976

Weighting labeled points over unlabeled points in semi-supervised learning.

$
0
0

I am currently working on a semi-supervised generative model with about 330 labeled nodes and 18,000 unlabeled nodes. There are about 300 different variables I'm using based on n-grams of varying sizes, each of which can have a non-negative integer value. There are 5 different classes I am trying to sort points into, with prior probabilities of 0.08 to 0.5. (If there are any more questions needed about the dataset to answer the following questions, let me know.)

However, when I run the algorithm, it will shrink some of the smaller groups down to a small number (much smaller than what one would expect based on the standard error of the prior estimates) unless I decide to weight the algorithm more heavily in favor of the labeled data for all iterations (not just the first).

Is there any basis for maintaining a higher weight for the data that is originally labeled? Is it likely that one of the issues is the sparsity of each group's training set with respect to the number of variables?

submitted by SymphMeta
[link][comment]

Viewing all articles
Browse latest Browse all 62976

Trending Articles