Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for versioning with DVC in Kedro #4443

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

lrcouto
Copy link
Contributor

@lrcouto lrcouto commented Jan 27, 2025

Description

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

lrcouto and others added 5 commits January 27, 2025 10:46
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>
Signed-off-by: Laura Couto <[email protected]>

To experiment with different parameter values, update the parameter in `parameters.yaml` and then run the pipelines with `dvc repro`.

Compare parameter changes between runs with `dvc params diff`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good so far! Would be nice to also add outputs to some of these steps.

@@ -0,0 +1,187 @@
# Data and pipeline versioning with Kedro and DVC

This document explains how to use [DVC](https://dvc.org/), a command line tool and VS Code Extension to help you develop reproducible machine learning projects, to version datasets and pipelines in your Kedro project.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't explore the VS Code extension myself but would it work with a Kedro project? If so, would be nice to include a section on how to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it a try!

docs/source/data/kedro_dvc_versioning.md Outdated Show resolved Hide resolved
@astrojuanlu
Copy link
Member

@lrcouto lrcouto marked this pull request as ready for review January 29, 2025 14:02
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to follow the instructions but found some blockers, so I didn't review all of the document

docs/source/data/kedro_dvc_versioning.md Outdated Show resolved Hide resolved

### First commits

Suppose you have a dataset in your project, such as:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing to suppose, the spaceflights-pandas has this dataset right? Maybe reword it as "Verify that your conf/base/catalog.yml contains this dataset"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although really this is about data/01_raw/companies.csv being present 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

Suggested change
Suppose you have a dataset in your project, such as:
Verify that your project catalog contains this dataset definition:

docs/source/data/kedro_dvc_versioning.md Show resolved Hide resolved
docs/source/data/kedro_dvc_versioning.md Outdated Show resolved Hide resolved
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being a pain in the neck 🙏🏼

I think we need to decide who's the audience of this guide.

  • Is it users that don't know much about Kedro or DVC?
  • Or is it Kedro users that already know their way around more or less?

In the first case, I think this should almost read like a tutorial and the commands and the steps should be fool-proof. For example, instead of clarifying about the .gitignore after dvc add, we should add a step before that reads like

Since the spaceflights-pandas starter ignores everything under data/ by default, you have to update the .gitignore file as follows:

Otherwise, if we assume that the reader knows a bit about Kedro, we can make the guide less prescriptive. In that case, then the sentence from the beginning "For this example, we will be using a Kedro spaceflights-pandas starter project" should be modified, maybe like

For example, you can use the spaceflights-pandas starter, see the documentation about starters [link]


As an example of this, I've been always inspired by the work @/melissawm did on the NumPy tutorials, for example

We don't have to follow the same structure but at least having a clear reader persona in mind helps structure the rest of the document 💡

@lrcouto
Copy link
Contributor Author

lrcouto commented Feb 4, 2025

Sorry for being a pain in the neck 🙏🏼

I think we need to decide who's the audience of this guide.

* Is it users that don't know much about Kedro or DVC?

* Or is it Kedro users that already know their way around more or less?

In the first case, I think this should almost read like a tutorial and the commands and the steps should be fool-proof. For example, instead of clarifying about the .gitignore after dvc add, we should add a step before that reads like

Since the spaceflights-pandas starter ignores everything under data/ by default, you have to update the .gitignore file as follows:

Otherwise, if we assume that the reader knows a bit about Kedro, we can make the guide less prescriptive. In that case, then the sentence from the beginning "For this example, we will be using a Kedro spaceflights-pandas starter project" should be modified, maybe like

For example, you can use the spaceflights-pandas starter, see the documentation about starters [link]

As an example of this, I've been always inspired by the work @/melissawm did on the NumPy tutorials, for example

* https://numpy.org/numpy-tutorials/content/tutorial-svd.html

* https://numpy.org/numpy-tutorials/content/tutorial-ma.html

We don't have to follow the same structure but at least having a clear reader persona in mind helps structure the rest of the document 💡

These are some fair points. I think the step-by-step tutorial approach might be better, I usually don't like to make assumptions about an user's knowledge level.

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some more work needed

docs/source/data/kedro_dvc_versioning.md Show resolved Hide resolved
docs/source/data/kedro_dvc_versioning.md Outdated Show resolved Hide resolved

```bash
git add data/01_raw/companies.csv.dvc
git commit -m "Track companies.csv dataset with DVC"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This git commit comes at a weird moment because we haven't added any other file yet.

For example, we could tell the user to do git init followed by git add . && git commit -m 'First commit, initial structure from starter'.

dvc add data/01_raw/companies.csv
```

This generates the `companies.csv.dvc` file which can be committed to git. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also important to tell the user that the companies.csv file will "disappear"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and a data/01_raw/.gitignore file is added!


```bash
# ignore everything in the following folders
data/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the .gitignore looked like initially, I had to comment out this line

For example:

```bash
dvc remote add myremote s3://mybucket
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s3://mybucket will not exist, unless we tell the user to create it.

We can either use MinIO and add the required steps, or move this section towards the bottom as an "example" of something the user could do (in other words: keep the first half of this document as a step-by-step tutorial, and the second half more like a how-to guide https://diataxis.fr/)


- Intermediate and output datasets must be added to DVC manually.
- Parameters and code changes are not explicitly tracked.
- Artefacts and metrics cannot be tracked effectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Artefacts and metrics cannot be tracked effectively.
- Artifacts and metrics cannot be tracked effectively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what does this mean in practice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants