Build a corpus for machine translation
I want to train an LSTM with attention for translation between French and a rare language. I say rare because it is an african language with less digital content, and especially databases with seq to seq like format. I have found somewhere a dataset, but in terms of quality, both french and native language sentences where awfully wrong. When I used this dataset, of course my translations where damn funny ...
So I decided to do some web scraping to build myself my parallel corpus and it might be useful for research in the future.
It worked well and I managed to collect some good articles from a website containing some articles (monthly, since 2016 in both languages). Now the tricky part is putting everything into sentence to sentence format. I did a trial with a text and its translation just by tokenizing into sentence and I noticed that for example I had 23 sentences for French and 24 for native language.
Further checking showed that some small differences where notices in both languages, like a sentence where a comma was replaced in the other language by a dot.
So my question is :
Is it mandatory to put my articles into sentence-French to sentence-Native language format ? Or can I let it as text / paragraphs ?
Topic corpus sequence-to-sequence lstm nlp
Category Data Science