Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 62797

hey guys, what would you do with The New Yorker corpus?

$
0
0

Intuitively, it seems like it would be an interesting object of study from an NLP perspective, because there's so much Culture encoded in it.

It exists somewhere on Condé Nast's servers in messy OCR'd text; judging by some searches on archives.newyorker.com and playing with the chrome console, it's structured something like

'issue1':{ 'page1':'electronics. We began to build vac- uum-tube circuits that did all sorts of things. " As an undergraduate, Minsky had begun to imagine building an elec- tronic machine that could learn. He had become fascinated', 'page2': ... } 

so there aren't clear delineations between articles off the bat.

The best I've got so far (thanks /u/agconway for the ideation!) is use NER to extract place names, and plot on a heatmap over time by mention frequency. Then one could (somewhat playfully) address the question-- is The New Yorker really as NY-biased as it's reputed to be?

Also cool would be "zooming in" and heatmapping mentions of neighborhoods in cities over time, "hipmapping" if you will ;)

Could also compare stuff like word usages against the Wikipedia corpus...

LDA topics vs. the Wikipedia corpus....

thoughts, anyone?

submitted by charliehack
[link][1 comment]

Viewing all articles
Browse latest Browse all 62797

Trending Articles