TL;DR - If you have operational experience implementing a distributed, online document classification system on a potentially unbounded and constantly growing dataset, care to validate my approach and/or suggest improvements?
I'd like to sanity-check my approach to solving a fairly large-scale, supervised online document classification problem. The app I've been working on has the following characteristics and goals:
- App crawls websites, stores HTML data
- A document model is produced from each page crawled, based on the non-markup text content of the page
- A classifier provides a relevancy prediction of the crawled HTML content, indicating the likely relevance to our client's core business, along with some kind of accuracy scale to indicate degree of relevance
- Subset of pages crawled from unrecognized domains are presented to users for eventual review, wherein they manually extract relevant features and record the supervised relevancy classification
- Review data is fed back into the ML algorithm which should learn from the supervised classifications and hopefully improve accuracy over time
Constraints:
- Potentially large set of documents (still in testing and we have over 2 million pages, this will explode once we turn the whole pipeline on full-time) to classify
- ML process needs to be parallelizable, i.e. multiple machines will potentially be handling the learning and prediction steps of the pipeline
- Updating the prediction model shouldn't require access to the entire corpus at once; it's possible to have the whole thing (or subset) available for initial training, but over time the additions to the corpus dataset will be streamed into the pipeline
- Latency is a concern, though we have some wiggle-room here; a few seconds for prediction is acceptable, and that has to include transforming raw HTML content into an appropriate document model
- Most importantly, updating the predictive model should require as little specialized developer interaction as possible; the app is intended to be a turn-key solution for our client, they have very-little in-house development capability, and we don't want to be on the hook for completely retraining their predictive models every month or two
So far, I've considered:
- Solr running on dedicated node + lucene classification API
- VW running on dedicated node in daemon mode performing online updates to the predictor model
- Cloud-based classification service like Alchemy API, et al
The cloud-based stuff is nice because we're a very small shop developing this app, and having a "magic classification box" would cut down on engineering and deployment overhead. However, they don't provide enough control/visibility into the classification process, as well as being too expensive.
Solr seems like a good approach, but we're wary of the complexity involved in setting up, administering, scaling, and interacting with the Solr stack. We don't currently want or need the ability to search the crawled data, so it seems like a lot of unnecessary overhead just for a binary text classification task. Assuming it scales, though, this approach does neatly solve the parallelization issue (each pipeline node can query the Solr server for both updates and predictions).
VW seems like it would be ideal for this application, but I'm not sure I'm "doing it right". Official documentation seems pretty sparse, and I've read lots of blog/forum/mailing-list posts which offer wildly varying approaches.
To wit:
- If I run VW in daemon mode, will it handle concurrent streaming updates to the predictor? Can I use the same running daemon for both update and prediction operations?
- If I use the daemon mode, do I just need to write plain ASCII-formatted examples to the socket daemon socket? Will unicode data work? How do read response data back?
- Because I don't have a full corpus available, I can't do TF-IDF weighting. Some VW examples I've seen suggest that this isn't necessary. Stemming and stopwords seem like useful pre-processing steps to produce the document model, but apparently VW can do n-grams automatically. What would seem a likely effective document model for an arbitrary website based solely on the content and URL?
- If I want VW to update the final predictor in a streaming, online fashion, what's the proper VW command-line invocation? They just added the
--save_resume
feature, does that do what I want?
Ultimately, the predictions don't have to be extremely precise (both false pos/neg predictions are fine to a degree), but we need the system to scale as the corpus grows and be pretty low-maintenance in the long run.
If anyone out there has any insight about how best to implement a system like this, I'd very much appreciate it. Does Solr or VW make more sense here? Or something else? If VW is the way to go, can anyone provide a specific example demonstrating how to best process and format an HTML document into a valid VW example, train the predictor using VW daemon, and then perform a classification prediction (returning the classification along with some kind of accuracy metric) using VW? That would be incredibly helpful.
[link][comment]