What is the common practice for NLP or text mining for non-English?
A lot of natural language processing tools are pre-trained with corpus in English. What if ones need to analyze, say, Dutch text? The blogs I find online are mostly saying traslating text into English as pro-processing. Is this the common practice? If not, then what? Also, does how similar a language is to English have an impact on the model performance?
For some also widely speaking languages (e.g French, Spanish), do people construct corpus in their own language and train models on it? Forgive my ignorance because I'm not able to read papers in many languages.
Topic pretraining bert text-mining nlp
Category Data Science