What is the difference between Pachyderm and Git?
I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that:
- It holds all your data in a central accessible location
- It updates all depending data sets when data is added to or changed in a data set
- It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result
- It versions all your data
- It handles both modified data and newly added fractions of data
- It can keep branches of your data sets when you are testing new transformation pipelines
It seems that Git can handle all of them. And maybe data is always larger in size than code then git-lfs was created for that purpose.
In contrast, Dolt provides a different direction that combines SQL and Git.
Do tools like Pachyderm apply nowadays in data science?
Topic data version-control dataset tools bigdata
Category Data Science