Newbie question. I just started my masters in computer science and I need to propose a research project for my thesis. One thing I'm interested about is using NLP techniques in endangered languages, specifically, pre-Hispanic languages of Central America.
Originally, I proposed a semantic search engine (that searches for concepts rather than just words). However my proposal was rejected because my thesis adviser suggested to that I needed a linguistics grad student whose specialized Amero-Indian languages on board (since I need to understand the semantics, morphology and grammar of the language). This last point seems a bit odd, since I would've guessed that NLP is based on statistical methods and I'm not too sure a deep knowledge of a particular language is required to use them. I do understand that at least some familiarization with the general structure of a language is needed, and the more support the better, but would using book references on a language's structure be really a step-down? (What's the minimum amount of Basque a native English speaker, who has zero knowledge about that language, need to build a semantic search engine using NLP techniques?)
I do know a lot of people who speak one of those languages, but don't have a linguistics background. They could validate results, as users, but couldn't help me any further than that.
Seeing that a student with the profile I'm looking for doesn't exist (in my school), I need to work my way around. So, my question is, what kind of projects involving NLP (and possibly deep learning) could I realistically work on during the next 1.5-2 years?
Here's some other info:
I'm fairly new new to language technologies and just built my first language-processing application, which builds a word cloud from different twitter accounts. Next semester I'll take my first NLP course, though.
I do not intend to push the state of the art of language technology (I think it's unrealistic), rather than use them in a novel way that could be socially relevant.
I was thinking of building n-grams from different indigenous languages, but I'm not too sure that could be a research project for a master's degree. They could be useful for something else though.
If building n-grams is something interesting, how large should my corpus be? For starters, I know the nahuatl version of the Wikipedia has around 10,000 entries. I know the more, the better, but what number of entries could start to yield interesting results?
Any help, advise or pointers to other resources would be greatly appreciated!
[link][2 comments]