I was wondering if there's a better resource for stripping the natural language text, doing appropriate substitutions of things like {{convert|3|lbs|kg}} but ignoring things like {{cite book|bla|bla|bla}} and replacing [[Cat|Cat]] style links but removing [[fr:Chat]] style links, preferably in Ruby, Python, or even C.
All I can find is the PHP that is part of Mediawiki itself and a number of sources that strip the marked-up text from XML, which is trivial.
If not, would anyone be interested in collaborating on writing (a solid begining to) such a parser over the weekend? I have been doing some experimenting and the difficulty is really just recognizing what templates are syntactically relevant (convert, et al) and which are information that doesn't belong/I don't want in the natural language portion (see, cite, et al)
[link] [2 comments]