Build a corpus for machine translation

Question

Build a corpus for machine translation

Meomeoowww

2020年12月29日 23:25

I want to train an LSTM with attention for translation between French and a rare language. I say rare because it is an african language with less digital content, and especially databases with seq to seq like format. I have found somewhere a dataset, but in terms of quality, both french and native language sentences where awfully wrong. When I used this dataset, of course my translations where damn funny ...

So I decided to do some web scraping to build myself my parallel corpus and it might be useful for research in the future.

It worked well and I managed to collect some good articles from a website containing some articles (monthly, since 2016 in both languages). Now the tricky part is putting everything into sentence to sentence format. I did a trial with a text and its translation just by tokenizing into sentence and I noticed that for example I had 23 sentences for French and 24 for native language.

Further checking showed that some small differences where notices in both languages, like a sentence where a comma was replaced in the other language by a dot.

So my question is :

Is it mandatory to put my articles into sentence-French to sentence-Native language format ? Or can I let it as text / paragraphs ?

Topic corpus sequence-to-sequence lstm nlp

Category Data Science

noe · Accepted Answer · 2020年12月29日 23:25

What you would typically do in your case is to apply a sentence alignment tool. Some popular options for that are:

hunalign: a classical tool that relies on a bilingual dictionary.
bleualign: it aligns based on the BLEU score similarity
vecalign: it is based on sentence embeddings, like LASER's.

I suggest you take a look at the preprocessing applied for the ParaCrawl corpus. In the article you can find an overview of the most popular methods for each processing step.

A different option altogether, as you suggest, is to translate at the document level. However, most NMT models are constrained in the length of the input text they accept, so if you go for document-level translation, you must ensure that your NMT system can handle such very long inputs. An example of NMT system that can be used for document-level NMT out of the box is Marian NMT with its gradient-checkpointing feature.

Build a corpus for machine translation

About