dvc add is most suitable when you want to commit large files at the start of your project. Models, large files of text or folders of images are a good candidates for this command.
In the beginning, when I tried implementing DVC, I was a little over-enthusiastic. I would
dvc add as many datasets that I thought needed tracking — raw data, intermediate data, and any reference files floating around. It was only later, when I started implementing pipelines, that this method showed it flaws. I would get
output errors because the outputs generated by
dvc pipelines were already being tracked.
Data in its raw form is rarely usable. Often, it has to pass through multiple stages of cleaning and transformation before it can be passed into a model. This flow, and the intermediate and final datasets generated, are best tracked with `pipelines`. Although pipelines are defined in a
dvc.yaml file, we can save ourselves the trouble of writing `yaml` files from scratch. Instead, we have the
dvc run command, which allows us to specify inputs, outputs, and any dependencies needed along the way. “Makefiles for machine learning projects” says the documentation. As someone who has spent long, confusing days trying to make machine learning pipelines reproducible with Make,
pipelines are really valuable.
git commit early and often, I was told when I first started learning how to use Git. Make your commits small and atomic — this makes sure no big changes are introduced to the repo all at once, and also makes the commit history readable and understandable. My experience with
dvc though, has been that commits should be atomic, but not necessarily small.
Again, in the beginning I would run
dvc commit at the same time that I would run
git commit. Then I saw how full my cache was, and how it was a small changes to my data that in aggregate weren’t worth the space they were taking up in the cache 🤦♀. So
dvc commit only when data and pipelines are in a stable state (as noted in the documentation), and use the
no-commit option in
dvc add and
dvc run so data doesn’t get unneccessarily cached repeatedly.