Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tidy; Docs] Remove kedro _get_dataset call; enhance kedro docs #1014

Merged
merged 7 commits into from
Feb 14, 2025
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion vizro-core/docs/pages/user-guides/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,19 @@ graph TD
| Can be refreshed while dashboard is running | No | Yes |
| Production-ready | Yes | Yes |

If you have a [Kedro](https://kedro.org/) project or would like to use the [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to manage your data independently of a Kedro project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md). This offers helper functions to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) as dynamic data in the Vizro data manager.
If you have a [Kedro](https://kedro.org/) project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md) to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) data to the Vizro data manager.

!!! note "Kedro Data Catalog as a data source registry"
Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](kedro-data-catalog.md#create-a-kedro-data-catalog) as a YAML registry of your dashboard's data sources. This separates configuration of your data sources from your app's code. Here is an example `catalog.yaml` file:

```yaml
motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/motorbikes.csv
load_args:
sep: ','
na_values: [NA]
```

## Static data

Expand Down
90 changes: 70 additions & 20 deletions vizro-core/docs/pages/user-guides/kedro-data-catalog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# How to integrate Vizro with Kedro Data Catalog
# How to integrate Vizro with the Kedro Data Catalog

This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. For Pandas datasets registered in a Kedro data catalog, Vizro provides a convenient way to visualize them.
This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. For Pandas datasets registered in a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html), Vizro provides a convenient way to visualize them.

Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](#create-a-kedro-data-catalog) to manage your dashboard's data sources.

## Installation

Expand All @@ -10,63 +12,111 @@ If you already have Kedro installed then you do not need to install any extra de
pip install vizro[kedro]
```

Vizro is currently compatible with `kedro>=0.19.0` and works with dataset factories for `kedro>=0.19.9`.

## Create a Kedro Data Catalog

You can create a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to be a YAML registry of your dashboard's data sources. If you have a Kedro project then you will already have this file; if you would like to use the Kedro Data Catalog outside a Kedro project then you need to create it. In this case, you should save your `catalog.yaml` file to the same directory as your `app.py`.

The Kedro Data Catalog separates configuration of your data sources from your app's code. Here is an example `catalog.yaml` file that illustrates some of the features of the Kedro Data Catalog.

```yaml
cars: # (1)!
type: pandas.CSVDataset # (2)!
filepath: cars.csv

motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/motorbikes.csv # (3)!
load_args: # (4)!
sep: ','
na_values: [NA]

trains:
type: pandas.ExcelDataset
filepath: trains.xlsx
load_args:
sheet_name: [Sheet1, Sheet2, Sheet3]

trucks:
type: pandas.ParquetDataset
filepath: trucks.parquet
load_args:
columns: [name, gear, disp, wt]
categories: list
index: name
```

1. The [minimum details needed](https://docs.kedro.org/en/stable/data/data_catalog.html#the-basics-of-catalog-yml) for a Kedro Data Catalog entry are the data source name (`cars`), the type of data (`type`), and the file's location (`filepath`).
1. Vizro supports all [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) datasets. This includes, for example, CSV, Excel and Parquet files.
1. Kedro supports a [variety of data stores](https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-filepath) including local file systems, network file systems and cloud object store.
1. You can [pass data loading arguments](https://docs.kedro.org/en/stable/data/data_catalog.html#load-save-and-filesystem-arguments) to specify how to load the data source.

For more details, refer to Kedro's [introduction to the Data Catalog](https://docs.kedro.org/en/stable/data/data_catalog.html) and their [collection of YAML examples](https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html).

## Use datasets from the Kedro Data Catalog

`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
Vizro provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) in the module `vizro.integrations.kedro`. This supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro `catalog`, the general pattern to add datasets to the Vizro data manager is:

```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager


for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

This imports all datasets of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog` into the Vizro `data_manager`.
This registers in the Vizro `data_manager` all data sources of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog`. You can now [reference the data source](data.md#reference-by-name) by name. For example, given the [above `catalog.yaml` file](#create-a-kedro-data-catalog), you could use the data source names `"cars"`, `"motorbikes"`, `"trains"` and `"trucks"`, for example with `px.scatter("cars", ...)`.

!!! note
Data sources imported from Kedro in this way are [dynamic data](data.md#dynamic-data). This means that the data can be refreshed while your dashboard is running. For example, if you execute a Kedro pipeline run then the latest data can be shown in your dashboard without restarting it.

The `catalog` variable may have been created in a number of different ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.catalog_from_project` to generate a `catalog` given the path to a Kedro project.
1. Data Catalog configuration file (`catalog.yaml`), [created as described above](#create-a-kedro-data-catalog). This generates a `catalog` variable independently of a Kedro project using [`DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).
1. Kedro project path. Vizro exposes a helper function `catalog_from_project` to generate a `catalog` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `catalog`.
1. Data Catalog configuration file (`catalog.yaml`). This can create a `catalog` entirely independently of a Kedro project using [`kedro.io.DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).

The full code for these different cases is given below.

!!! example "Import a Kedro Data Catalog into the Vizro data manager"
=== "app.py (Kedro project path)"
=== "app.py (Data Catalog configuration file)"
```python
from kedro.io import DataCatalog # (1)!
import yaml

from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager

project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)

catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8"))) # (2)!

for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.ipynb (Kedro Jupyter session)"
1. Kedro's [experimental `KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature) would also work.
1. The contents of `catalog.yaml` is [described above](#create-a-kedro-data-catalog).

=== "app.py (Kedro project path)"
```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager

project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)


for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.py (Data Catalog configuration file)"
=== "app.ipynb (Kedro Jupyter session)"
```python
from kedro.io import DataCatalog
import yaml

from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager


catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))

for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```
Expand All @@ -83,7 +133,7 @@ kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__

The `pipelines` variable may have been created the following ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
1. Kedro project path. Vizro exposes a helper function `pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.

The full code for these different cases is given below.
Expand Down
5 changes: 1 addition & 4 deletions vizro-core/src/vizro/integrations/kedro/_data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,6 @@ def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None
for dataset_name, dataset_config in kedro_datasets.items():
# "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
if "pandas" in dataset_config["type"]:
# TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
# but need to check if works with caching.
dataset = catalog._get_dataset(dataset_name, suggest=False)
vizro_data_sources[dataset_name] = dataset.load
vizro_data_sources[dataset_name] = lambda: catalog.load(dataset_name)

return vizro_data_sources
2 changes: 2 additions & 0 deletions vizro-core/src/vizro/managers/_data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,8 @@ def __setitem__(self, name: DataSourceName, data: Union[pd.DataFrame, pd_DataFra
# Once partial has been used, all dynamic data sources are on equal footing since they're all treated as
# functions rather than bound methods, e.g. by flask_caching.utils.function_namespace. This makes it much
# simpler to use flask-caching reliably.
# Note that for kedro>=0.19.9 we use lambda: catalog.load(dataset_name) rather than dataset.load so the
# bound method case no longer arises when using kedro integration.
# It's important the __qualname__ is the same across all workers, so use the data source name rather than
# e.g. the repr method that includes the id of the instance so would only work in the case that gunicorn is
# running with --preload.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
"""Unit tests for vizro.integrations.kedro."""

import types
from importlib.metadata import version
from pathlib import Path

Expand Down Expand Up @@ -28,15 +27,12 @@ def catalog(request):
return catalog_class.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))


def test_datasets_from_catalog(catalog):
def test_datasets_from_catalog(catalog, mocker):
datasets = datasets_from_catalog(catalog)
assert isinstance(datasets, dict)
assert set(datasets) == {"pandas_excel", "pandas_parquet"}
for dataset in datasets.values():
assert isinstance(dataset, types.MethodType)
assert datasets == {"pandas_excel": mocker.ANY, "pandas_parquet": mocker.ANY}


def test_datasets_from_catalog_with_pipeline(catalog):
def test_datasets_from_catalog_with_pipeline(catalog, mocker):
pipeline = kp.pipeline(
[
kp.node(
Expand All @@ -57,10 +53,10 @@ def test_datasets_from_catalog_with_pipeline(catalog):

datasets = datasets_from_catalog(catalog, pipeline=pipeline)
# Dataset factories only work for kedro>=0.19.9.
expected_datasets = (
expected_dataset_names = (
{"pandas_excel", "pandas_parquet", "something#csv", "something_else#csv"}
if parse(version("kedro")) >= parse("0.19.9")
else {"pandas_excel", "pandas_parquet"}
)

assert set(datasets) == expected_datasets
assert datasets == {dataset_name: mocker.ANY for dataset_name in expected_dataset_names}
Loading