A good way to organize/store a lot of datasets
In machine translation, we often have bilingual dataset, e.g. for German-English and French-English we will have something that looks like this:
/en-de
train.de
train.en
dev.de
dev.en
test.de
test.en
/en-fr
train.fr
train.en
dev.fr
dev.en
test.fr
test.en
And then we have a third language pair German-French, and we'll have:
/de-fr
train.fr
train.de
dev.fr
dev.de
test.fr
test.de
But lets say we add Spanish-English and we'll get:
/en-es
train.es
train.en
dev.es
dev.en
test.es
test.en
/de-es
train.es
train.de
dev.es
dev.de
test.es
test.de
/fr-es
train.es
train.fr
dev.es
dev.fr
test.es
test.fr
And if we add even more languages, these pairs of languages goes even more tedious.
What would be a good data structure / directory organization to store the train.*
, dev.*
and test.*
files?
Topic best-practice machine-translation dataset
Category Data Science