What is the common practice for NLP or text mining for non-English?

A lot of natural language processing tools are pre-trained with corpus in English. What if ones need to analyze, say, Dutch text? The blogs I find online are mostly saying traslating text into English as pro-processing. Is this the common practice? If not, then what? Also, does how similar a language is to English have an impact on the model performance?

For some also widely speaking languages (e.g French, Spanish), do people construct corpus in their own language and train models on it? Forgive my ignorance because I'm not able to read papers in many languages.

Topic pretraining bert text-mining nlp

Category Data Science


Translation as a pre-processing step is usually sufficient for many tasks (e.g. sentiment classification), but naturally undesirable for other tasks e.g. grading someone in written Dutch fluency.

Hence, for these tasks, the objective is:

  • Be able to train a language model for your specific language
  • However, you want to be able to do this with minimal resources

Thus, people in research view this as few-shot learning. One of the most common approaches to this is to exploit meta-learning, i.e. use models that have been pre-trained using many tasks (languages in this case) and they have been trained so that only a few gradient steps and little data is required for them to be applicable to a new task (Dutch in your example). An example of one such model you could finetune (transfer learning) is called multiliingual BERT... this was back in 2019, so I expect since many other multilingual language models have emerged - these can be good initialisation points for any bespoke model you want to train for some new language...

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.