Hello, I'm an MIT researcher, and machine learning methods are new to the kind of work that I do. Here's what I'm looking to do -- I have scraped text data from a large sample of websites, essentially domain names like abc.com, pqr.org, rsv.net etc. I expect to have text data from many million such websites, and in the future I hope to expand this dataset to include data from the Common Crawl Project.
My task is to identify "high growth" potential websites from this data. I.e. i want to be able to identify potential startups in this sample and throw out personal websites, websites for SMEs, news websites, blogs etc.
I have a few questions: 1. Am i right in thinking that this is a standard classification ML problem? what algorithms should I be looking at? Will I be alright using something like python's scikit-learn?
How would this task change if I wanted to classify websites into more types, say by industry (travel, holiday, finance etc).
Do you know existing papers, researchers who try to classify website text data? any references would be very helpful!
[link][2 comments]