What is the difference between Pachyderm and Git?

Question

What is the difference between Pachyderm and Git?

Lerner Zhang

2022年6月4日 05:03

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that:

It holds all your data in a central accessible location
It updates all depending data sets when data is added to or changed in a data set
It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result
It versions all your data
It handles both modified data and newly added fractions of data
It can keep branches of your data sets when you are testing new transformation pipelines

It seems that Git can handle all of them. And maybe data is always larger in size than code then git-lfs was created for that purpose.

In contrast, Dolt provides a different direction that combines SQL and Git.

Do tools like Pachyderm apply nowadays in data science?

Topic data version-control dataset tools bigdata

Category Data Science

Brian Spiering · Accepted Answer · 2022年5月3日 19:54

Git is designed for code.

Pachyderm is designed for machine learning assets: data, pipelines, and notebooks.

You can put machine learning assets into git. However, git will treat machine learning assets just as code primitives. One example is notebooks which are JSON. JSON in git quickly becomes difficult to manage. Pachyderm will manage them in notebook specific ways.

What is the difference between Pachyderm and Git?

About