where to start in natural language processing for a language

My native language is a regional language and few people speak it. I have some assignements in a machine learning course and i was thinking about doing some natural languge processing on my native language but i don't know where to start since there is almost no research about this language ( no corpus , no research papers , ... ) and i'm new to machine learning.

I want to start doing everything from bottom and i want to do things right , can you please guide me to steps i should follow?

I also want to build my own corpus ,since it's a very tiring work , is there a way to build a single corpus that can be used on several NLP applications ( at least for translation and speach recognition) ?

Topic speech-to-text machine-translation nlp

Category Data Science


The key term for your problem is low resource languages. I'm not sure whether there is a standard approach but you could find papers about what people have done before in similar cases, potential software tools/data repositories which could help, etc.

You might also be interested in the Universal Dependencies project: https://universaldependencies.org/


There are many things that can be done related to Universal Dependencies (UD):

  • use the existing resources to analyze/parse some text data. As far as I know the standard tool would be UDPipe (python libraries here, here, maybe others...)
  • train a new dependency parser. for instance I found this repository.
  • start a treebank for a new language: https://universaldependencies.org/how_to_start.html. Warning: this is probably a lot of work if you start from scratch! and there's no or little ML involved in the creation of the data itself.

The focus of UD resources is on dependencies but imho the main interest is that it provides resources which can be used for all the standard NLP tasks: sentence or word segmentation, POS tagging and lemmatization, etc.


@Amar, you can start with NLTK, Spacy, Flair and Polyglot to start with your NLP learning. They are easier to start and understand. You can create your own model for your language and contribute if you have the corpus available with you. The nouns, verbs are easier to capture in English when compared to Chinese, Korean and Arabic. Every language has it own beautiful representation of context. You would a good library that can read the language that you are developing. You can also try IBM Watson and AWS SageMaker on trial basis with check for Language detection.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.