Guidance needed on a classification task using website data

Hello, I'm an MIT researcher, and machine learning methods are new to the kind of work that I do. Here's what I'm looking to do -- I have scraped text data from a large sample of websites, essentially domain names like abc.com, pqr.org, rsv.net etc. I expect to have text data from many million such websites, and in the future I hope to expand this dataset to include data from the Common Crawl Project.

My task is to identify "high growth" potential websites from this data. I.e. i want to be able to identify potential startups in this sample and throw out personal websites, websites for SMEs, news websites, blogs etc.

I have a few questions: 1. Am i right in thinking that this is a standard classification ML problem? what algorithms should I be looking at? Will I be alright using something like python's scikit-learn?

How would this task change if I wanted to classify websites into more types, say by industry (travel, holiday, finance etc).
Do you know existing papers, researchers who try to classify website text data? any references would be very helpful!

submitted by dalek2point3
[link][2 comments]

Guidance needed on a classification task using website data

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...