A good way to organize/store a lot of datasets

In machine translation, we often have bilingual dataset, e.g. for German-English and French-English we will have something that looks like this:

/en-de
   train.de
   train.en
   dev.de
   dev.en
   test.de
   test.en

/en-fr
   train.fr
   train.en
   dev.fr
   dev.en
   test.fr
   test.en

And then we have a third language pair German-French, and we'll have:

/de-fr
   train.fr
   train.de
   dev.fr
   dev.de
   test.fr
   test.de

But lets say we add Spanish-English and we'll get:

/en-es
   train.es
   train.en
   dev.es
   dev.en
   test.es
   test.en

/de-es
   train.es
   train.de
   dev.es
   dev.de
   test.es
   test.de

/fr-es
   train.es
   train.fr
   dev.es
   dev.fr
   test.es
   test.fr

And if we add even more languages, these pairs of languages goes even more tedious.

What would be a good data structure / directory organization to store the train.*, dev.* and test.* files?

Topic best-practice machine-translation dataset

Category Data Science


Data structures for multi-lingual can be bit tiresome & repetitive, especially if data is not structured properly.

Assuming the content of data in en is the same across in other languages en-es, en-fr, en-de. In other words, train.en is same in en-es, en-fr and en-de

considering word hello:

  • English(en) its Hello
  • Spanish(es) its Hola
  • French(fr) its Bonjour
  • German(de) its Guten tag

This Data structure is most simplest across different environments train.* , dev.* , test.* which can accommodate other new language translations

     /train
       en
       es
       fr
       de

      /dev
       en
       es
       fr
       de

     /test
       en
       es
       fr
       de

New translation chinese to spanish cn-es would be simplified by just adding the chinese translation in this data structure.

     \train
       en
       es
       fr
       de
       cn

File systems are good enough for data storage up to a point. As data becomes more complex, it often makes sense to move to a database.

A relationship database will allow for database normalization, storing a single, canonical value for each entity. This will greatly reduce the size of the data and allow for different combinations of train, dev, and test datasets.


At this scale, grep is fast enough, so it's just a matter of having the files in a shape that's convenient for the most common types of searches.

At ModelFront, we use and recommend Linear TSV, because TSV has first-class support in bash.

https://modelfront.com/docs/eval/#linear-tsv explains the standard and why it's such a natural fit for machine translation data:

For natural language text data, especially parallel data, TSV has key advantages over CSV and XML.

Human-readability

...

Standardization

...

Scalability

ModelFront is built to handle very large files. TSV can be read in line by line - without reading the whole file into memory. TSV is also more compact than CSV or XML.

Convenience

The built-in command-line tools like cut and paste read and write TSV by default and fundamentally operate at the line level.

So we do not split sentence pairs (or their metadata) across multiple aligned files - that's unwieldy and asking for trouble.

But we only have one language pair and dataset per file - and we note those in the file name - instead of using an additional column, because usually that's the level at which you want to search.

And it's always possible to do multiple searches, find or cat if you want to search across language pairs and datasets.

(And we obviously don't use XML, and we are disappointed that WMT chose not just XML but yet another non-standardised XML format.)

So your example would end up something like this:

en.de.train.tsv
en.de.dev.tsv
en.de.test.tsv
en.fr.train.tsv
...

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.