What are some good resources for extracting the main text of a webpage? What I mean is, given a web page in HTML format, extract the main body of the text, not including irrelevant stuff like sidebars, ads etc.
I know that this is an active research topic, but I am curious if anyone has found a library that works well.
[link] [9 comments]