Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63189

Guidance needed on a classification task using website data

$
0
0

Hello, I'm an MIT researcher, and machine learning methods are new to the kind of work that I do. Here's what I'm looking to do -- I have scraped text data from a large sample of websites, essentially domain names like abc.com, pqr.org, rsv.net etc. I expect to have text data from many million such websites, and in the future I hope to expand this dataset to include data from the Common Crawl Project.

My task is to identify "high growth" potential websites from this data. I.e. i want to be able to identify potential startups in this sample and throw out personal websites, websites for SMEs, news websites, blogs etc.

I have a few questions: 1. Am i right in thinking that this is a standard classification ML problem? what algorithms should I be looking at? Will I be alright using something like python's scikit-learn?

  1. How would this task change if I wanted to classify websites into more types, say by industry (travel, holiday, finance etc).

  2. Do you know existing papers, researchers who try to classify website text data? any references would be very helpful!

submitted by dalek2point3
[link][2 comments]

Viewing all articles
Browse latest Browse all 63189

Trending Articles