Can anyone give feedback on an approach to online HTML document classification?

TL;DR - If you have operational experience implementing a distributed, online document classification system on a potentially unbounded and constantly growing dataset, care to validate my approach and/or suggest improvements?

I'd like to sanity-check my approach to solving a fairly large-scale, supervised online document classification problem. The app I've been working on has the following characteristics and goals:

App crawls websites, stores HTML data
A document model is produced from each page crawled, based on the non-markup text content of the page
A classifier provides a relevancy prediction of the crawled HTML content, indicating the likely relevance to our client's core business, along with some kind of accuracy scale to indicate degree of relevance
Subset of pages crawled from unrecognized domains are presented to users for eventual review, wherein they manually extract relevant features and record the supervised relevancy classification
Review data is fed back into the ML algorithm which should learn from the supervised classifications and hopefully improve accuracy over time

Constraints:

Potentially large set of documents (still in testing and we have over 2 million pages, this will explode once we turn the whole pipeline on full-time) to classify
ML process needs to be parallelizable, i.e. multiple machines will potentially be handling the learning and prediction steps of the pipeline
Updating the prediction model shouldn't require access to the entire corpus at once; it's possible to have the whole thing (or subset) available for initial training, but over time the additions to the corpus dataset will be streamed into the pipeline
Latency is a concern, though we have some wiggle-room here; a few seconds for prediction is acceptable, and that has to include transforming raw HTML content into an appropriate document model
Most importantly, updating the predictive model should require as little specialized developer interaction as possible; the app is intended to be a turn-key solution for our client, they have very-little in-house development capability, and we don't want to be on the hook for completely retraining their predictive models every month or two

So far, I've considered:

Solr running on dedicated node + lucene classification API
VW running on dedicated node in daemon mode performing online updates to the predictor model
Cloud-based classification service like Alchemy API, et al

The cloud-based stuff is nice because we're a very small shop developing this app, and having a "magic classification box" would cut down on engineering and deployment overhead. However, they don't provide enough control/visibility into the classification process, as well as being too expensive.

Solr seems like a good approach, but we're wary of the complexity involved in setting up, administering, scaling, and interacting with the Solr stack. We don't currently want or need the ability to search the crawled data, so it seems like a lot of unnecessary overhead just for a binary text classification task. Assuming it scales, though, this approach does neatly solve the parallelization issue (each pipeline node can query the Solr server for both updates and predictions).

VW seems like it would be ideal for this application, but I'm not sure I'm "doing it right". Official documentation seems pretty sparse, and I've read lots of blog/forum/mailing-list posts which offer wildly varying approaches.

To wit:

If I run VW in daemon mode, will it handle concurrent streaming updates to the predictor? Can I use the same running daemon for both update and prediction operations?
If I use the daemon mode, do I just need to write plain ASCII-formatted examples to the socket daemon socket? Will unicode data work? How do read response data back?
Because I don't have a full corpus available, I can't do TF-IDF weighting. Some VW examples I've seen suggest that this isn't necessary. Stemming and stopwords seem like useful pre-processing steps to produce the document model, but apparently VW can do n-grams automatically. What would seem a likely effective document model for an arbitrary website based solely on the content and URL?
If I want VW to update the final predictor in a streaming, online fashion, what's the proper VW command-line invocation? They just added the --save_resume feature, does that do what I want?

Ultimately, the predictions don't have to be extremely precise (both false pos/neg predictions are fine to a degree), but we need the system to scale as the corpus grows and be pretty low-maintenance in the long run.

If anyone out there has any insight about how best to implement a system like this, I'd very much appreciate it. Does Solr or VW make more sense here? Or something else? If VW is the way to go, can anyone provide a specific example demonstrating how to best process and format an HTML document into a valid VW example, train the predictor using VW daemon, and then perform a classification prediction (returning the classification along with some kind of accuracy metric) using VW? That would be incredibly helpful.

submitted by Dschinghis
[link][comment]

Can anyone give feedback on an approach to online HTML document classification?

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112