I was having trouble choosing a title that would fit the character limit and adequately describe the problem. My NLP vocabulary is a little limited, so I may be using more words than I need to describe what I'm doing/what I want.
What I'm doing: I have good-sized dataset (about 30000 entries) consisting of a string of raw text (more specifically, an affiliation string from PubMed search results), and several normalized/manually entered classifiers for that string (e.g. country, state/province when applicable, organization name). I'm using this set to automatically determine which present n-grams (or combinations of present and/or not present n-grams) are likely to do a good job of determining the country/state/org in future affiliation strings.
What I'm trying to improve right now is the selection process for those predictive n-grams. My criteria for selecting sets of these predictors are, in order of priority, (1) covering all the affiliation lines in the corpus, (2) containing as few n-grams as possible, and (3) containing shorter n-grams (i.e. unigrams preferred over digrams, etc.).
The problem is that even with those criteria, I still have quite a few options in how I go about the selection of my predictive n-grams. Everything I've done so far works fairly well, but I've definitely hit a plateau in terms of making improvements in accuracy and efficiency. Is there any sort of current "gold-standard" for addressing this kind of problem? Are there any resources/guides/blogs that would be worth my time to peruse? Lastly, what sort of jargon should I acquaint myself with so that I can state my situation and ask my questions more precisely?
[link][2 comments]