Quantcast
Channel: Machine Learning
Viewing all articles
Browse latest Browse all 63067

Applications of deep learning and NLP on endangered languages, advise needed

$
0
0

Newbie question. I just started my masters in computer science and I need to propose a research project for my thesis. One thing I'm interested about is using NLP techniques in endangered languages, specifically, pre-Hispanic languages of Central America.

Originally, I proposed a semantic search engine (that searches for concepts rather than just words). However my proposal was rejected because my thesis adviser suggested to that I needed a linguistics grad student whose specialized Amero-Indian languages on board (since I need to understand the semantics, morphology and grammar of the language). This last point seems a bit odd, since I would've guessed that NLP is based on statistical methods and I'm not too sure a deep knowledge of a particular language is required to use them. I do understand that at least some familiarization with the general structure of a language is needed, and the more support the better, but would using book references on a language's structure be really a step-down? (What's the minimum amount of Basque a native English speaker, who has zero knowledge about that language, need to build a semantic search engine using NLP techniques?)

I do know a lot of people who speak one of those languages, but don't have a linguistics background. They could validate results, as users, but couldn't help me any further than that.

Seeing that a student with the profile I'm looking for doesn't exist (in my school), I need to work my way around. So, my question is, what kind of projects involving NLP (and possibly deep learning) could I realistically work on during the next 1.5-2 years?

Here's some other info:

  • I'm fairly new new to language technologies and just built my first language-processing application, which builds a word cloud from different twitter accounts. Next semester I'll take my first NLP course, though.

  • I do not intend to push the state of the art of language technology (I think it's unrealistic), rather than use them in a novel way that could be socially relevant.

  • I was thinking of building n-grams from different indigenous languages, but I'm not too sure that could be a research project for a master's degree. They could be useful for something else though.

  • If building n-grams is something interesting, how large should my corpus be? For starters, I know the nahuatl version of the Wikipedia has around 10,000 entries. I know the more, the better, but what number of entries could start to yield interesting results?

Any help, advise or pointers to other resources would be greatly appreciated!

submitted by astral_cowboy
[link][2 comments]

Viewing all articles
Browse latest Browse all 63067

Trending Articles