DVC beginner gotcha’s
dvc add
dvc add
is most suitable when you want to commit large files at the start of your project. Models, large files of text or folders of images are a good candidates for this command.
In the beginning, when I tried implementing DVC, I was a little over-enthusiastic. I would dvc add
as many datasets that I thought needed tracking — raw data, intermediate data, and any reference files floating around. It was only later, when I started implementing pipelines, that this method showed it flaws. I would get output
errors because the outputs generated by dvc pipelines
were already being tracked.
dvc run
Data in its raw form is rarely usable. Often, it has to pass through multiple stages of cleaning and transformation before it can be passed into a model. This flow, and the intermediate and final datasets generated, are best tracked with `pipelines`. Although pipelines are defined in a dvc.yaml
file, we can save ourselves the trouble of writing `yaml` files from scratch. Instead, we have the dvc run
command, which allows us to specify inputs, outputs, and any dependencies needed along the way. “Makefiles for machine learning projects” says the documentation. As someone who has spent long, confusing days trying to make machine learning pipelines reproducible with Make, pipelines
are really valuable.
dvc commit
git commit
early and often, I was told when I first started learning how to use Git. Make your commits small and atomic — this makes sure no big changes are introduced to the repo all at once, and also makes the commit history readable and understandable. My experience with dvc
though, has been that commits should be atomic, but not necessarily small.
Again, in the beginning I would run dvc commit
at the same time that I would run git commit
. Then I saw how full my cache was, and how it was a small changes to my data that in aggregate weren’t worth the space they were taking up in the cache 🤦♀. So dvc commit
only when data and pipelines are in a stable state (as noted in the documentation), and use the no-commit
option in dvc add
and dvc run
so data doesn’t get unneccessarily cached repeatedly.