-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for versioning with DVC in Kedro #4443
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
|
||
To experiment with different parameter values, update the parameter in `parameters.yaml` and then run the pipelines with `dvc repro`. | ||
|
||
Compare parameter changes between runs with `dvc params diff` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good so far! Would be nice to also add outputs to some of these steps.
@@ -0,0 +1,187 @@ | |||
# Data and pipeline versioning with Kedro and DVC | |||
|
|||
This document explains how to use [DVC](https://dvc.org/), a command line tool and VS Code Extension to help you develop reproducible machine learning projects, to version datasets and pipelines in your Kedro project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't explore the VS Code extension myself but would it work with a Kedro project? If so, would be nice to include a section on how to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give it a try!
Co-authored-by: Ankita Katiyar <[email protected]> Signed-off-by: L. R. Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to follow the instructions but found some blockers, so I didn't review all of the document
|
||
### First commits | ||
|
||
Suppose you have a dataset in your project, such as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing to suppose, the spaceflights-pandas
has this dataset right? Maybe reword it as "Verify that your conf/base/catalog.yml
contains this dataset"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although really this is about data/01_raw/companies.csv
being present 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
Suppose you have a dataset in your project, such as: | |
Verify that your project catalog contains this dataset definition: |
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being a pain in the neck 🙏🏼
I think we need to decide who's the audience of this guide.
- Is it users that don't know much about Kedro or DVC?
- Or is it Kedro users that already know their way around more or less?
In the first case, I think this should almost read like a tutorial and the commands and the steps should be fool-proof. For example, instead of clarifying about the .gitignore
after dvc add
, we should add a step before that reads like
Since the
spaceflights-pandas
starter ignores everything underdata/
by default, you have to update the.gitignore
file as follows:
Otherwise, if we assume that the reader knows a bit about Kedro, we can make the guide less prescriptive. In that case, then the sentence from the beginning "For this example, we will be using a Kedro spaceflights-pandas
starter project" should be modified, maybe like
For example, you can use the
spaceflights-pandas
starter, see the documentation about starters [link]
As an example of this, I've been always inspired by the work @/melissawm did on the NumPy tutorials, for example
- https://numpy.org/numpy-tutorials/content/tutorial-svd.html
- https://numpy.org/numpy-tutorials/content/tutorial-ma.html
We don't have to follow the same structure but at least having a clear reader persona in mind helps structure the rest of the document 💡
These are some fair points. I think the step-by-step tutorial approach might be better, I usually don't like to make assumptions about an user's knowledge level. |
Signed-off-by: Laura Couto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some more work needed
|
||
```bash | ||
git add data/01_raw/companies.csv.dvc | ||
git commit -m "Track companies.csv dataset with DVC" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This git commit
comes at a weird moment because we haven't added any other file yet.
For example, we could tell the user to do git init
followed by git add . && git commit -m 'First commit, initial structure from starter'
.
dvc add data/01_raw/companies.csv | ||
``` | ||
|
||
This generates the `companies.csv.dvc` file which can be committed to git. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also important to tell the user that the companies.csv
file will "disappear"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and a data/01_raw/.gitignore
file is added!
|
||
```bash | ||
# ignore everything in the following folders | ||
data/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the .gitignore
looked like initially, I had to comment out this line
For example: | ||
|
||
```bash | ||
dvc remote add myremote s3://mybucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s3://mybucket
will not exist, unless we tell the user to create it.
We can either use MinIO and add the required steps, or move this section towards the bottom as an "example" of something the user could do (in other words: keep the first half of this document as a step-by-step tutorial, and the second half more like a how-to guide https://diataxis.fr/)
|
||
- Intermediate and output datasets must be added to DVC manually. | ||
- Parameters and code changes are not explicitly tracked. | ||
- Artefacts and metrics cannot be tracked effectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Artefacts and metrics cannot be tracked effectively. | |
- Artifacts and metrics cannot be tracked effectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what does this mean in practice?
Description
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file