Add documentation for versioning with DVC in Kedro #4443

lrcouto · 2025-01-27T13:47:06Z

Description

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Laura Couto <[email protected]>

ankatiyar · 2025-01-27T17:21:45Z

docs/source/data/kedro_dvc_versioning.md

+
+To experiment with different parameter values, update the parameter in `parameters.yaml` and then run the pipelines with `dvc repro`.
+
+Compare parameter changes between runs with `dvc params diff`


Overall looks good so far! Would be nice to also add outputs to some of these steps.

ankatiyar · 2025-01-27T17:23:08Z

docs/source/data/kedro_dvc_versioning.md

@@ -0,0 +1,187 @@
+# Data and pipeline versioning with Kedro and DVC
+
+This document explains how to use [DVC](https://dvc.org/), a command line tool and VS Code Extension to help you develop reproducible machine learning projects, to version datasets and pipelines in your Kedro project.


I didn't explore the VS Code extension myself but would it work with a Kedro project? If so, would be nice to include a section on how to use it.

I can give it a try!

docs/source/data/kedro_dvc_versioning.md

Co-authored-by: Ankita Katiyar <[email protected]> Signed-off-by: L. R. Couto <[email protected]>

Signed-off-by: Laura Couto <[email protected]>

astrojuanlu · 2025-01-28T17:42:42Z

Rendered version https://kedro--4443.org.readthedocs.build/en/4443/data/kedro_dvc_versioning.html

astrojuanlu

Tried to follow the instructions but found some blockers, so I didn't review all of the document

docs/source/data/kedro_dvc_versioning.md

astrojuanlu · 2025-01-30T15:17:00Z

docs/source/data/kedro_dvc_versioning.md

+
+### First commits
+
+Suppose you have a dataset in your project, such as:


There's nothing to suppose, the spaceflights-pandas has this dataset right? Maybe reword it as "Verify that your conf/base/catalog.yml contains this dataset"

Although really this is about data/01_raw/companies.csv being present 🤔

What about

Suggested change

Suppose you have a dataset in your project, such as:

Verify that your project catalog contains this dataset definition:

docs/source/data/kedro_dvc_versioning.md

Signed-off-by: Laura Couto <[email protected]>

docs/source/data/kedro_dvc_versioning.md

astrojuanlu

Sorry for being a pain in the neck 🙏🏼

I think we need to decide who's the audience of this guide.

Is it users that don't know much about Kedro or DVC?
Or is it Kedro users that already know their way around more or less?

In the first case, I think this should almost read like a tutorial and the commands and the steps should be fool-proof. For example, instead of clarifying about the .gitignore after dvc add, we should add a step before that reads like

Since the spaceflights-pandas starter ignores everything under data/ by default, you have to update the .gitignore file as follows:

Otherwise, if we assume that the reader knows a bit about Kedro, we can make the guide less prescriptive. In that case, then the sentence from the beginning "For this example, we will be using a Kedro spaceflights-pandas starter project" should be modified, maybe like

For example, you can use the spaceflights-pandas starter, see the documentation about starters [link]

As an example of this, I've been always inspired by the work @/melissawm did on the NumPy tutorials, for example

We don't have to follow the same structure but at least having a clear reader persona in mind helps structure the rest of the document 💡

lrcouto · 2025-02-04T15:36:49Z

Sorry for being a pain in the neck 🙏🏼

I think we need to decide who's the audience of this guide.
* Is it users that don't know much about Kedro or DVC?

* Or is it Kedro users that already know their way around more or less?
In the first case, I think this should almost read like a tutorial and the commands and the steps should be fool-proof. For example, instead of clarifying about the .gitignore after dvc add, we should add a step before that reads like

Since the spaceflights-pandas starter ignores everything under data/ by default, you have to update the .gitignore file as follows:

Otherwise, if we assume that the reader knows a bit about Kedro, we can make the guide less prescriptive. In that case, then the sentence from the beginning "For this example, we will be using a Kedro spaceflights-pandas starter project" should be modified, maybe like

For example, you can use the spaceflights-pandas starter, see the documentation about starters [link]

As an example of this, I've been always inspired by the work @/melissawm did on the NumPy tutorials, for example
* https://numpy.org/numpy-tutorials/content/tutorial-svd.html

* https://numpy.org/numpy-tutorials/content/tutorial-ma.html
We don't have to follow the same structure but at least having a clear reader persona in mind helps structure the rest of the document 💡

These are some fair points. I think the step-by-step tutorial approach might be better, I usually don't like to make assumptions about an user's knowledge level.

Signed-off-by: Laura Couto <[email protected]>

astrojuanlu

Still some more work needed

docs/source/data/kedro_dvc_versioning.md

astrojuanlu · 2025-02-06T09:00:06Z

docs/source/data/kedro_dvc_versioning.md

+
+```bash
+git add data/01_raw/companies.csv.dvc
+git commit -m "Track companies.csv dataset with DVC"


This git commit comes at a weird moment because we haven't added any other file yet.

For example, we could tell the user to do git init followed by git add . && git commit -m 'First commit, initial structure from starter'.

astrojuanlu · 2025-02-06T09:01:05Z

docs/source/data/kedro_dvc_versioning.md

+ dvc add data/01_raw/companies.csv
+ ```
+
+This generates the `companies.csv.dvc` file which can be committed to git. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking.


Also important to tell the user that the companies.csv file will "disappear"

Oh and a data/01_raw/.gitignore file is added!

astrojuanlu · 2025-02-06T09:01:44Z

docs/source/data/kedro_dvc_versioning.md

+
+```bash
+# ignore everything in the following folders
+data/**


This is how the .gitignore looked like initially, I had to comment out this line

astrojuanlu · 2025-02-06T09:04:33Z

docs/source/data/kedro_dvc_versioning.md

+For example:
+
+```bash
+dvc remote add myremote s3://mybucket


s3://mybucket will not exist, unless we tell the user to create it.

We can either use MinIO and add the required steps, or move this section towards the bottom as an "example" of something the user could do (in other words: keep the first half of this document as a step-by-step tutorial, and the second half more like a how-to guide https://diataxis.fr/)

astrojuanlu · 2025-02-06T09:04:48Z

docs/source/data/kedro_dvc_versioning.md

+
+- Intermediate and output datasets must be added to DVC manually.
+- Parameters and code changes are not explicitly tracked.
+- Artefacts and metrics cannot be tracked effectively.


Suggested change

- Artefacts and metrics cannot be tracked effectively.

- Artifacts and metrics cannot be tracked effectively.

Also what does this mean in practice?

Signed-off-by: Laura Couto <[email protected]>

lrcouto and others added 5 commits January 27, 2025 10:46

Add kedro dvc page

1bb2cd1

Signed-off-by: Laura Couto <[email protected]>

Merge branch 'main' into kedro-dvc-documentation

3ec3cd6

Lint

7d88e90

Signed-off-by: Laura Couto <[email protected]>

Add new page to index

fdc3ca9

Signed-off-by: Laura Couto <[email protected]>

Lint

ae04428

Signed-off-by: Laura Couto <[email protected]>

ankatiyar reviewed Jan 27, 2025

View reviewed changes

lrcouto and others added 3 commits January 27, 2025 16:29

Update docs/source/data/kedro_dvc_versioning.md

3b2fca8

Co-authored-by: Ankita Katiyar <[email protected]> Signed-off-by: L. R. Couto <[email protected]>

Merge branch 'main' into kedro-dvc-documentation

fe13392

Formatting, add more examples

d5a2bdf

Signed-off-by: Laura Couto <[email protected]>

lrcouto marked this pull request as ready for review January 29, 2025 14:02

lrcouto requested review from yetudada and astrojuanlu as code owners January 29, 2025 14:02

Merge branch 'main' into kedro-dvc-documentation

a019343

astrojuanlu requested changes Jan 30, 2025

View reviewed changes

lrcouto and others added 3 commits February 3, 2025 13:31

Add additional information about starters

5eee394

Signed-off-by: Laura Couto <[email protected]>

Elaborate information about the gitignore file

b1f7b84

Signed-off-by: Laura Couto <[email protected]>

Merge branch 'main' into kedro-dvc-documentation

467deba

astrojuanlu reviewed Feb 4, 2025

View reviewed changes

docs/source/data/kedro_dvc_versioning.md Outdated Show resolved Hide resolved

astrojuanlu reviewed Feb 4, 2025

View reviewed changes

Elaborate on the instructions

80822de

Signed-off-by: Laura Couto <[email protected]>

astrojuanlu requested changes Feb 6, 2025

View reviewed changes

lrcouto and others added 3 commits February 6, 2025 10:39

Merge branch 'main' into kedro-dvc-documentation

d957b8d

Further clarification on the .gitignore file

6188949

Signed-off-by: Laura Couto <[email protected]>

Merge branch 'main' into kedro-dvc-documentation

9d8e886

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for versioning with DVC in Kedro #4443

Add documentation for versioning with DVC in Kedro #4443

lrcouto commented Jan 27, 2025

ankatiyar Jan 27, 2025

ankatiyar Jan 27, 2025

lrcouto Jan 28, 2025

astrojuanlu commented Jan 28, 2025

astrojuanlu left a comment

astrojuanlu Jan 30, 2025

astrojuanlu Jan 30, 2025

astrojuanlu Feb 4, 2025

astrojuanlu left a comment

lrcouto commented Feb 4, 2025

astrojuanlu left a comment

astrojuanlu Feb 6, 2025

astrojuanlu Feb 6, 2025

astrojuanlu Feb 6, 2025

astrojuanlu Feb 6, 2025

astrojuanlu Feb 6, 2025

astrojuanlu Feb 6, 2025

astrojuanlu Feb 6, 2025


		To experiment with different parameter values, update the parameter in `parameters.yaml` and then run the pipelines with `dvc repro`.

		Compare parameter changes between runs with `dvc params diff`

		@@ -0,0 +1,187 @@
		# Data and pipeline versioning with Kedro and DVC

		This document explains how to use [DVC](https://dvc.org/), a command line tool and VS Code Extension to help you develop reproducible machine learning projects, to version datasets and pipelines in your Kedro project.


		### First commits

		Suppose you have a dataset in your project, such as:

	Suppose you have a dataset in your project, such as:
	Verify that your project catalog contains this dataset definition:

	- Artefacts and metrics cannot be tracked effectively.
	- Artifacts and metrics cannot be tracked effectively.

Add documentation for versioning with DVC in Kedro #4443

Are you sure you want to change the base?

Add documentation for versioning with DVC in Kedro #4443

Conversation

lrcouto commented Jan 27, 2025

Description

Development notes

Developer Certificate of Origin

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astrojuanlu commented Jan 28, 2025

astrojuanlu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astrojuanlu left a comment

Choose a reason for hiding this comment

lrcouto commented Feb 4, 2025

astrojuanlu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment