What is the difference between Pachyderm and Git?

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that:

  • It holds all your data in a central accessible location
  • It updates all depending data sets when data is added to or changed in a data set
  • It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result
  • It versions all your data
  • It handles both modified data and newly added fractions of data
  • It can keep branches of your data sets when you are testing new transformation pipelines

It seems that Git can handle all of them. And maybe data is always larger in size than code then git-lfs was created for that purpose.

In contrast, Dolt provides a different direction that combines SQL and Git.

Do tools like Pachyderm apply nowadays in data science?

Topic data version-control dataset tools bigdata

Category Data Science


Git is designed for code.

Pachyderm is designed for machine learning assets: data, pipelines, and notebooks.

You can put machine learning assets into git. However, git will treat machine learning assets just as code primitives. One example is notebooks which are JSON. JSON in git quickly becomes difficult to manage. Pachyderm will manage them in notebook specific ways.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.