How to deal with version control of large amounts of (binary) data

I am a PhD student of Geophysics and work with large amounts of image data (hundreds of GB, tens of thousands of files). I know svn and git fairly well and come to value a project history, combined with the ability to easily work together and have protection against disk corruption. I find git also extremely helpful for having consistent backups but I know that git cannot handle large amounts of binary data efficiently.

In my masters studies I worked on data sets of similar size (also images) and had a lot of problems keeping track of different version on different servers/devices. Diffing 100GB over the network really isn't fun, and cost me a lot of time and effort.

I know that others in science seem to have similar problems, yet I couldn't find a good solution.

I want to use the storage facilities of my institute, so I need something that can use a "dumb" server. I also would like to have an additional backup on a portable hard disk, because I would like to avoid transferring hundreds of GB over the network wherever possible. So, I need a tool that can handle more than one remote location.

Lastly, I really need something that other researcher can use, so it does not need to be super simple, but should be learnable in a few hours.

I have evaluated a lot of different solutions, but none seem to fit the bill:

  • svn is somewhat inefficient and needs a smart server
  • hg bigfile/largefile can only use one remote
  • git bigfile/media can also use only one remote, but is also not very efficient
  • attic doesn't seem to have a log, or diffing capabilities
  • bup looks really good, but needs a "smart" server to work

I've tried git-annex, which does everything I need it to do (and much more), but it is very difficult to use and not well documented. I've used it for several days and couldn't get my head around it, so I doubt any other coworker would be interested.

How do researchers deal with large datasets, and what are other research groups using?

To be clear, I am primarily interested in how other researchers deal with this situation, not just this specific dataset. It seems to me that almost everyone should have this problem, yet I don't know anyone who has solved it. Should I just keep a backup of the original data and forget all this version control stuff? Is that what everyone else is doing?

Topic version-control binary bigdata databases

Category Data Science


This is a pretty common problem. I had this pain when I did research projects for a university and now - in industrial data science projects.

I've created and recently released an open source tool to solve this problem - DVC.

It basically combines your code in Git and data in your local disk or clouds (S3 and GCP storage). DVC tracks dependency between data and code and builds the dependency graph (DAG). It helps you to make your project reproducible.

DVC project could be easily shared - sync your data to a cloud (dvc sync command), share your Git repository and provide access to your data bucket in the cloud.

"learnable in a few hours" - is a good point. You should not have any issues with DVC if you are familiar with Git. You really need to learn only three commands:

  1. dvc init - like git init. Should be done in an existing Git repository.
  2. dvc import - import your data files (sources). Local file or URL.
  3. dvc run - steps of your workflow like dvc run python mycode.py data/input.jpg data/output.csv. DVC derives the dependency between your steps automatically, builds DAG and keeps it in Git.
  4. dvc repro - reproduce your data file. Example: vi mycode.py - change code, and then dvc repro data/output.csv will reproduce the file (and all the dependencies.

You need to learn a couple more DVC commands to share data through the cloud and basic S3 or GCP skills.

DVC tutorials is the best starting point.


You could try using hangar. It is a relatively new player to the data version control world but does a really nice job by versioning the tensors instead of versioning the blob. The documentation must be the best place to start. Since the data is being stored as tensors, you should be able to use it directly inside your ML code (plus hangar now has data loaders for PyTorch and Tensorflow). With hangar, you could get all the benefit of git such as zero-cost branching, merging, time travel through history. One nice feature about cloning in the hangar is you could do partial cloning. Which means, if you have 10 TB of data at your remote and only need 100 MB for prototyping your model, you could fetch only 100 MB via partial cloning instead of a full clone.


You may take a look at my project called DOT: Distrubuted Object Tracker repository manager.
It is a very simple VCS for binary files for personal use (no collaboration).
It uses SHA1 for checksuming and block deduplication. Full P2P syncing.
One unique feature: adhoc one time TCP server for pull/push.
It can also use SSH for transport.

It is not yet released, but might be a good starting point.
http://borg.uu3.net/cgit/cgit.cgi/dot/about/


We don't version control the actual data files. We wouldn't want to even if we stored it as CSV instead of in a binary form. As Riccardo M. said, we're not going to spend our time reviewing row-by-row changes on a 10M row data set.

Instead, along with the processing code, I version control the metadata:

  • Modification date
  • File size
  • Row count
  • Column names

This gives me enough information to know if a data file has changed and an idea of what has changed (e.g., rows added/deleted, new/renamed columns), without stressing the VCS.


What I am ending up using is a sort of hybrid solution:

  • backup of the raw data
  • git of the workflow
  • manual snapshots of workflow + processed data, that are of relevance, e.g.:
    • standard preprocessing
    • really time-consuming
    • for publication

I believe it is seldom sensible to have a full revision history of large amount of binary data, because the time required to review the changes will eventually be so overwhelming that it will not pay off in the long run. Maybe a semi-automatic snapshot procedure (eventually to save some disk-space, by not replicating the unchanged data across different snapshots) would be of help.


I haven't used them but there was a similar discussion in a finance group

data repository software suggestions scidb, zfs, http://www.urbackup.org/


Try looking at Git Large File Storage (LFS). It is new, but might be the thing worth looking at.

As I see, a discussion on Hacker News mentions a few other ways to deal with large files:


I have used Versioning on Amazon S3 buckets to manage 10-100GB in 10-100 files. Transfer can be slow, so it has helped to compress and transfer in parallel, or just run computations on EC2. The boto library provides a nice python interface.


I have dealt with similar problems with very large synthetic biology datasets, where we have many, many GB of flow cytometry data spread across many, many thousands of files, and need to maintain them consistently between collaborating groups at (multiple) different institutions.

Typical version control like svn and git is not practical for this circumstance, because it's just not designed for this type of dataset. Instead, we have fallen to using "cloud storage" solutions, particularly DropBox and Bittorrent Sync. DropBox has the advantage that it does do at least some primitive logging and version control and manages the servers for you, but the disadvantage that it's a commercial service, you have to pay for large storage, and you're putting your unpublished data on a commercial storage; you don't have to pay much, though, so it's a viable option. Bittorrent Sync has a very similar interface, but you run it yourself on your own storage servers and it doesn't have any version control. Both of them hurt my programmer soul, but they're the best solutions my collaborators and I have found so far.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.