DVC beginner gotcha’s

littlereddotdata
2 min readSep 6, 2020

dvc add

dvc add is most suitable when you want to commit large files at the start of your project. Models, large files of text or folders of images are a good candidates for this command.

In the beginning, when I tried implementing DVC, I was a little over-enthusiastic. I would dvc add as many datasets that I thought needed tracking — raw data, intermediate data, and any reference files floating around. It was only later, when I started implementing pipelines, that this method showed it flaws. I would get output errors because the outputs generated by dvc pipelines were already being tracked.

dvc run

Data in its raw form is rarely usable. Often, it has to pass through multiple stages of cleaning and transformation before it can be passed into a model. This flow, and the intermediate and final datasets generated, are best tracked with `pipelines`. Although pipelines are defined in a dvc.yaml file, we can save ourselves the trouble of writing `yaml` files from scratch. Instead, we have the dvc run command, which allows us to specify inputs, outputs, and any dependencies needed along the way. “Makefiles for machine learning projects” says the documentation. As someone who has spent long, confusing days trying to make machine learning pipelines reproducible with Make, pipelines are really valuable.

dvc commit

git commit early and often, I was told when I first started learning how to use Git. Make your commits small and atomic — this makes sure no big changes are introduced to the repo all at once, and also makes the commit history readable and understandable. My experience with dvc though, has been that commits should be atomic, but not necessarily small.

Again, in the beginning I would run dvc commit at the same time that I would run git commit. Then I saw how full my cache was, and how it was a small changes to my data that in aggregate weren’t worth the space they were taking up in the cache 🤦‍♀. So dvc commit only when data and pipelines are in a stable state (as noted in the documentation), and use the no-commit option in dvc add and dvc run so data doesn’t get unneccessarily cached repeatedly.

--

--