Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63067

Library to Strip Wiki Markup from Wikipedia?

$
0
0

I was wondering if there's a better resource for stripping the natural language text, doing appropriate substitutions of things like {{convert|3|lbs|kg}} but ignoring things like {{cite book|bla|bla|bla}} and replacing [[Cat|Cat]] style links but removing [[fr:Chat]] style links, preferably in Ruby, Python, or even C.

All I can find is the PHP that is part of Mediawiki itself and a number of sources that strip the marked-up text from XML, which is trivial.

If not, would anyone be interested in collaborating on writing (a solid begining to) such a parser over the weekend? I have been doing some experimenting and the difficulty is really just recognizing what templates are syntactically relevant (convert, et al) and which are information that doesn't belong/I don't want in the natural language portion (see, cite, et al)

submitted by ltltltlt
[link] [2 comments]

Viewing all articles
Browse latest Browse all 63067

Trending Articles