Skip to content

Commit

Permalink
[Tidy; Docs] Remove kedro _get_dataset call; enhance kedro docs (#1014)
Browse files Browse the repository at this point in the history
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jo Stichbury <[email protected]>
  • Loading branch information
3 people authored Feb 14, 2025
1 parent 69471f4 commit 913436d
Show file tree
Hide file tree
Showing 8 changed files with 158 additions and 39 deletions.
48 changes: 48 additions & 0 deletions vizro-core/changelog.d/20250212_102221_antony.milne_kedro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
<!--
A new scriv changelog fragment.
Uncomment the section that is right (remove the HTML comment wrapper).
-->

<!--
### Highlights ✨
- A bullet item for the Highlights ✨ category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
-->
<!--
### Removed
- A bullet item for the Removed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
-->
<!--
### Added
- A bullet item for the Added category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
-->
<!--
### Changed
- A bullet item for the Changed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
-->
<!--
### Deprecated
- A bullet item for the Deprecated category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
-->
<!--
### Fixed
- A bullet item for the Fixed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
-->
<!--
### Security
- A bullet item for the Security category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
-->
15 changes: 14 additions & 1 deletion vizro-core/docs/pages/user-guides/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,20 @@ graph TD
| Can be refreshed while dashboard is running | No | Yes |
| Production-ready | Yes | Yes |

If you have a [Kedro](https://kedro.org/) project or would like to use the [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to manage your data independently of a Kedro project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md). This offers helper functions to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) as dynamic data in the Vizro data manager.
If you have a [Kedro](https://kedro.org/) project, you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md) to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) data to the Vizro data manager.

!!! note "Kedro Data Catalog as a data source registry"
Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](kedro-data-catalog.md#create-a-kedro-data-catalog) as a YAML registry of your dashboard's data sources. This separates configuration of your data sources from your app's code and is the recommended approach if you have many data sources or a complex project. Here is an example `catalog.yaml` file:

```yaml
motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/motorbikes.csv
load_args:
sep: ','
na_values: [NA]
credentials: s3_credentials
```

## Static data

Expand Down
100 changes: 79 additions & 21 deletions vizro-core/docs/pages/user-guides/kedro-data-catalog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# How to integrate Vizro with Kedro Data Catalog
# How to integrate Vizro with the Kedro Data Catalog

This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. For Pandas datasets registered in a Kedro data catalog, Vizro provides a convenient way to visualize them.
This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. Vizro provides a convenient way to visualize Pandas datasets registered in a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html).

Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](#create-a-kedro-data-catalog) to manage your dashboard's data sources. This separates configuration of your data from your app's code and is particularly useful for dashboards with many data sources or more complex data loading configuration.

## Installation

Expand All @@ -10,63 +12,119 @@ If you already have Kedro installed then you do not need to install any extra de
pip install vizro[kedro]
```

Vizro is currently compatible with `kedro>=0.19.0` and works with dataset factories for `kedro>=0.19.9`.

## Create a Kedro Data Catalog

You can create a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to be a YAML registry of your dashboard's data sources. To do so, create a new file called `catalog.yaml` file in the same directory as your `app.py`. Below is an example `catalog.yaml` file that illustrates some of the key features of the Kedro Data Catalog.

```yaml
cars: # (1)!
type: pandas.CSVDataset # (2)!
filepath: cars.csv

motorbikes:
type: pandas.CSVDataset
filepath: s3://your_bucket/data/motorbikes.csv # (3)!
load_args: # (4)!
sep: ','
na_values: [NA]
credentials: s3_credentials # (5)!

trains:
type: pandas.ExcelDataset
filepath: trains.xlsx
load_args:
sheet_name: [Sheet1, Sheet2, Sheet3]

trucks:
type: pandas.ParquetDataset
filepath: trucks.parquet
load_args:
columns: [name, gear, disp, wt]
categories: list
index: name
```
1. The [minimum details needed](https://docs.kedro.org/en/stable/data/data_catalog.html#the-basics-of-catalog-yml) for a Kedro Data Catalog entry are the data source name (`cars`), the type of data (`type`), and the file's location (`filepath`).
1. Vizro supports all [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) datasets. This includes, for example, CSV, Excel and Parquet files.
1. Kedro supports a [variety of data stores](https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-filepath) including local file systems, network file systems and cloud object stores.
1. You can [pass data loading arguments](https://docs.kedro.org/en/stable/data/data_catalog.html#load-save-and-filesystem-arguments) to specify how to load the data source.
1. You can [securely inject credentials](https://docs.kedro.org/en/stable/configuration/credentials.html) into data loading functions using a [`credentials.yaml` file](https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-access-credentials) or [environment variables](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-load-credentials-through-environment-variables).

As [shown below](#use-datasets-from-the-kedro-data-catalog), the best way to use the `catalog.yaml` is with the [Kedro configuration loader](https://docs.kedro.org/en/stable/configuration/configuration_basics.html) `OmegaConfigLoader`. For simple cases, this functions much like `yaml.safe_load`. However, the Kedro configuration loader also enables more advanced functionality.

??? "Kedro configuration loader features"
Here are a few features of the Kedro configuration loader which are not possible with a `yaml.safe_load` alone. For more details, refer to Kedro's [documentation on advanced configuration](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html).

- [Configuration environments](https://docs.kedro.org/en/stable/configuration/configuration_basics.html#configuration-environments) to organize settings that might be different between your different [development and production environments](run-deploy.md). For example, you might have different s3 buckets for development and production data.
- [Recursive scanning for configuration files](https://docs.kedro.org/en/stable/configuration/configuration_basics.html#configuration-loading) to merge complex configuration that is split across multiple files and folders.
- [Templating (variable interpolation)](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#catalog) and [dynamically computed values (resolvers)](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-resolvers-in-the-omegaconfigloader).

## Use datasets from the Kedro Data Catalog

`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
Vizro provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) in the module [`vizro.integrations.kedro`](../API-reference/kedro-integration.md). These functions support both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro `catalog`, the general pattern to add datasets to the Vizro data manager is:

```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

This imports all datasets of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog` into the Vizro `data_manager`.
The code above registers all data sources of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) in the Kedro `catalog` with Vizro's `data_manager` . You can now [reference the data source](data.md#reference-by-name) by name. For example, given the [above `catalog.yaml` file](#create-a-kedro-data-catalog), you could use the data source names `"cars"`, `"motorbikes"`, `"trains"`, and `"trucks"` with `px.scatter("cars", ...)`.

!!! note
Data sources imported from Kedro in this way are [dynamic data](data.md#dynamic-data). This means that the data can be refreshed while your dashboard is running. For example, if you run a Kedro pipeline, the latest data is shown in the Vizro dashboard without restarting it.

The `catalog` variable may have been created in a number of different ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.catalog_from_project` to generate a `catalog` given the path to a Kedro project.
1. Data Catalog configuration file (`catalog.yaml`), [created as described above](#create-a-kedro-data-catalog). This generates a `catalog` variable independently of a Kedro project using [`DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).
1. Kedro project path. Vizro exposes a helper function [`catalog_from_project`](../API-reference/kedro-integration.md#vizro.integrations.kedro.catalog_from_project) to generate a `catalog` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `catalog`.
1. Data Catalog configuration file (`catalog.yaml`). This can create a `catalog` entirely independently of a Kedro project using [`kedro.io.DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).

The full code for these different cases is given below.

!!! example "Import a Kedro Data Catalog into the Vizro data manager"
=== "app.py (Kedro project path)"
=== "app.py (Data Catalog configuration file)"
```python
from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog # (1)!
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)

conf_loader = OmegaConfigLoader(conf_source=".") # (2)!
catalog = DataCatalog.from_config(conf_loader["catalog"]) # (3)!
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.ipynb (Kedro Jupyter session)"
1. Kedro's [experimental `KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature) would also work.
1. This [loads and parses configuration in `catalog.yaml`](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#advanced-configuration-without-a-full-kedro-project). The argument `conf_source="."` specifies that `catalog.yaml` is found in the same directory as `app.py` or a subdirectory beneath this level. In a more complex setup, this could include [configuration environments](https://docs.kedro.org/en/stable/configuration/configuration_basics.html#configuration-environments), for example to organize configuration for development and production data sources.
1. If you have [credentials](https://docs.kedro.org/en/stable/configuration/credentials.html) then these can be injected with `DataCatalog.from_config(conf_loader["catalog"], conf_loader["credentials"])`.

=== "app.py (Kedro project path)"
```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.py (Data Catalog configuration file)"
=== "app.ipynb (Kedro Jupyter session)"
```python
from kedro.io import DataCatalog
import yaml

from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager
catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))

for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```
Expand All @@ -83,7 +141,7 @@ kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__

The `pipelines` variable may have been created the following ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
1. Kedro project path. Vizro exposes a helper function [`pipelines_from_project`](../API-reference/kedro-integration.md#vizro.integrations.kedro.pipelines_from_project) to generate a `pipelines` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.

The full code for these different cases is given below.
Expand Down
2 changes: 2 additions & 0 deletions vizro-core/docs/pages/user-guides/run-deploy.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ Typically when you create a dashboard, there are two distinct stages:

This guide describes methods to run your dashboard _in development_ and _in production_. Follow either section based on your current need.

If your data sources in development and production are different (for example, you have different s3 buckets for development and production data) then you might like to [use the Kedro Data Catalog](kedro-data-catalog.md#use-datasets-from-the-kedro-data-catalog) to manage your data source configuration.

Vizro is built on top of [Dash](https://dash.plotly.com/), which itself uses [Flask](https://flask.palletsprojects.com/). Most of our guidance on how to run a Vizro app in development or production is very similar to guidance on Dash and Flask.

!!! note
Expand Down
5 changes: 1 addition & 4 deletions vizro-core/src/vizro/integrations/kedro/_data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,6 @@ def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None
for dataset_name, dataset_config in kedro_datasets.items():
# "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
if "pandas" in dataset_config["type"]:
# TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
# but need to check if works with caching.
dataset = catalog._get_dataset(dataset_name, suggest=False)
vizro_data_sources[dataset_name] = dataset.load
vizro_data_sources[dataset_name] = lambda: catalog.load(dataset_name)

return vizro_data_sources
2 changes: 2 additions & 0 deletions vizro-core/src/vizro/managers/_data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,8 @@ def __setitem__(self, name: DataSourceName, data: Union[pd.DataFrame, pd_DataFra
# Once partial has been used, all dynamic data sources are on equal footing since they're all treated as
# functions rather than bound methods, e.g. by flask_caching.utils.function_namespace. This makes it much
# simpler to use flask-caching reliably.
# Note that for kedro>=0.19.9 we use lambda: catalog.load(dataset_name) rather than dataset.load so the
# bound method case no longer arises when using kedro integration.
# It's important the __qualname__ is the same across all workers, so use the data source name rather than
# e.g. the repr method that includes the id of the instance so would only work in the case that gunicorn is
# running with --preload.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
filepath: "{pandas_factory}.csv"

pandas_excel:
type: pandas.ExcelDataset
type: ${_pandas_excel_type}
filepath: pandas_excel.xlsx

pandas_parquet:
Expand All @@ -13,3 +13,6 @@ pandas_parquet:
not_dataframe:
type: pickle.PickleDataset
filepath: pickle.pkl

# Use variable interpolation to check OmegaConfigLoader does what is expected over just yaml.safe_load.
_pandas_excel_type: pandas.ExcelDataset
Loading

0 comments on commit 913436d

Please sign in to comment.