mckinsey · antonymilne · Feb 14, 2025 · Feb 12, 2025 · Feb 12, 2025 · Feb 12, 2025
@@ -38,7 +38,19 @@ graph TD
     | Can be refreshed while dashboard is running                   | No               | Yes                                      |
     | Production-ready                                              | Yes              | Yes                                      |
 
-If you have a [Kedro](https://kedro.org/) project or would like to use the [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to manage your data independently of a Kedro project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md). This offers helper functions to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) as dynamic data in the Vizro data manager.
+If you have a [Kedro](https://kedro.org/) project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md) to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) data to the Vizro data manager.
+
+!!! note "Kedro Data Catalog as a data source registry"
+    Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](kedro-data-catalog.md#create-a-kedro-data-catalog) as a YAML registry of your dashboard's data sources. This separates configuration of your data sources from your app's code. Here is an example `catalog.yaml` file:
+
+    ```yaml
+    motorbikes:
+      type: pandas.CSVDataset
+      filepath: s3://your_bucket/data/motorbikes.csv
+      load_args:
+        sep: ','
+        na_values: [NA]
+    ```
 
 ## Static data
 

@@ -1,6 +1,8 @@
-# How to integrate Vizro with Kedro Data Catalog
+# How to integrate Vizro with the Kedro Data Catalog
 
-This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. For Pandas datasets registered in a Kedro data catalog, Vizro provides a convenient way to visualize them.
+This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. For Pandas datasets registered in a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html), Vizro provides a convenient way to visualize them.
+
+Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](#create-a-kedro-data-catalog) to manage your dashboard's data sources.
 
 ## Installation
 
@@ -10,63 +12,111 @@ If you already have Kedro installed then you do not need to install any extra de
 pip install vizro[kedro]
 ```
 
+Vizro is currently compatible with `kedro>=0.19.0` and works with dataset factories for `kedro>=0.19.9`.
+
+## Create a Kedro Data Catalog
+
+You can create a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to be a YAML registry of your dashboard's data sources. If you have a Kedro project then you will already have this file; if you would like to use the Kedro Data Catalog outside a Kedro project then you need to create it. In this case, you should save your `catalog.yaml` file to the same directory as your `app.py`.
+
+The Kedro Data Catalog separates configuration of your data sources from your app's code. Here is an example `catalog.yaml` file that illustrates some of the features of the Kedro Data Catalog.
+
+```yaml
+cars:  # (1)!
+  type: pandas.CSVDataset  # (2)!
+  filepath: cars.csv
+
+motorbikes:
+  type: pandas.CSVDataset
+  filepath: s3://your_bucket/data/motorbikes.csv   # (3)!
+  load_args:   # (4)!
+    sep: ','
+    na_values: [NA]
+
+trains:
+  type: pandas.ExcelDataset
+  filepath: trains.xlsx
+  load_args:
+    sheet_name: [Sheet1, Sheet2, Sheet3]
+
+trucks:
+  type: pandas.ParquetDataset
+  filepath: trucks.parquet
+  load_args:
+    columns: [name, gear, disp, wt]
+    categories: list
+    index: name
+```
+
+1. The [minimum details needed](https://docs.kedro.org/en/stable/data/data_catalog.html#the-basics-of-catalog-yml) for a Kedro Data Catalog entry are the data source name (`cars`), the type of data (`type`), and the file's location (`filepath`).
+1. Vizro supports all [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) datasets. This includes, for example, CSV, Excel and Parquet files.
+1. Kedro supports a [variety of data stores](https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-filepath) including local file systems, network file systems and cloud object store.
+1. You can [pass data loading arguments](https://docs.kedro.org/en/stable/data/data_catalog.html#load-save-and-filesystem-arguments) to specify how to load the data source.
+
+For more details, refer to Kedro's [introduction to the Data Catalog](https://docs.kedro.org/en/stable/data/data_catalog.html) and their [collection of YAML examples](https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html).
+
 ## Use datasets from the Kedro Data Catalog
 
-`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
+Vizro provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) in the module `vizro.integrations.kedro`. This supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro `catalog`, the general pattern to add datasets to the Vizro data manager is:
 
 ```python
 from vizro.integrations import kedro as kedro_integration
 from vizro.managers import data_manager
 
 
-for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
-    data_manager[dataset_name] = dataset
+for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
+    data_manager[dataset_name] = dataset_loader
 ```
 
-This imports all datasets of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog` into the Vizro `data_manager`.
+This registers in the Vizro `data_manager` all data sources of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog`. You can now [reference the data source](data.md#reference-by-name) by name. For example, given the [above `catalog.yaml` file](#create-a-kedro-data-catalog), you could use the data source names `"cars"`, `"motorbikes"`, `"trains"` and `"trucks"`, for example with `px.scatter("cars", ...)`.
+
+!!! note
+    Data sources imported from Kedro in this way are [dynamic data](data.md#dynamic-data). This means that the data can be refreshed while your dashboard is running. For example, if you execute a Kedro pipeline run then the latest data can be shown in your dashboard without restarting it.
 
 The `catalog` variable may have been created in a number of different ways:
 
-1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.catalog_from_project` to generate a `catalog` given the path to a Kedro project.
+1. Data Catalog configuration file (`catalog.yaml`), [created as described above](#create-a-kedro-data-catalog). This generates a `catalog` variable independently of a Kedro project using [`DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).
+1. Kedro project path. Vizro exposes a helper function `catalog_from_project` to generate a `catalog` given the path to a Kedro project.
 1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `catalog`.
-1. Data Catalog configuration file (`catalog.yaml`). This can create a `catalog` entirely independently of a Kedro project using [`kedro.io.DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).
 
 The full code for these different cases is given below.
 
 !!! example "Import a Kedro Data Catalog into the Vizro data manager"
-    === "app.py (Kedro project path)"
+    === "app.py (Data Catalog configuration file)"
         ```python
+        from kedro.io import DataCatalog  # (1)!
+        import yaml
+
         from vizro.integrations import kedro as kedro_integration
         from vizro.managers import data_manager
 
-        project_path = "/path/to/kedro/project"
-        catalog = kedro_integration.catalog_from_project(project_path)
 
+        catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))  # (2)!
 
         for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
             data_manager[dataset_name] = dataset_loader
         ```
 
-    === "app.ipynb (Kedro Jupyter session)"
+        1. Kedro's [experimental `KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature) would also work.
+        1. The contents of `catalog.yaml` is [described above](#create-a-kedro-data-catalog).
+
+    === "app.py (Kedro project path)"
         ```python
+        from vizro.integrations import kedro as kedro_integration
         from vizro.managers import data_manager
 
+        project_path = "/path/to/kedro/project"
+        catalog = kedro_integration.catalog_from_project(project_path)
+
 
         for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
             data_manager[dataset_name] = dataset_loader
         ```
 
-    === "app.py (Data Catalog configuration file)"
+    === "app.ipynb (Kedro Jupyter session)"
         ```python
-        from kedro.io import DataCatalog
-        import yaml
-
-        from vizro.integrations import kedro as kedro_integration
         from vizro.managers import data_manager
 
 
-        catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))
-
         for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
             data_manager[dataset_name] = dataset_loader
         ```
@@ -83,7 +133,7 @@ kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__
 
 The `pipelines` variable may have been created the following ways:
 
-1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
+1. Kedro project path. Vizro exposes a helper function `pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
 1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.
 
 The full code for these different cases is given below.

@@ -65,9 +65,6 @@ def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None
     for dataset_name, dataset_config in kedro_datasets.items():
         # "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
         if "pandas" in dataset_config["type"]:
-            # TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
-            #  but need to check if works with caching.
-            dataset = catalog._get_dataset(dataset_name, suggest=False)
-            vizro_data_sources[dataset_name] = dataset.load
+            vizro_data_sources[dataset_name] = lambda: catalog.load(dataset_name)
 
     return vizro_data_sources
@@ -170,6 +170,8 @@ def __setitem__(self, name: DataSourceName, data: Union[pd.DataFrame, pd_DataFra
             # Once partial has been used, all dynamic data sources are on equal footing since they're all treated as
             # functions rather than bound methods, e.g. by flask_caching.utils.function_namespace. This makes it much
             # simpler to use flask-caching reliably.
+            # Note that for kedro>=0.19.9 we use lambda: catalog.load(dataset_name) rather than dataset.load so the
+            # bound method case no longer arises when using kedro integration.
             # It's important the __qualname__ is the same across all workers, so use the data source name rather than
             # e.g. the repr method that includes the id of the instance so would only work in the case that gunicorn is
             # running with --preload.

@@ -1,6 +1,5 @@
 """Unit tests for vizro.integrations.kedro."""
 
-import types
 from importlib.metadata import version
 from pathlib import Path
 
@@ -28,15 +27,12 @@ def catalog(request):
     return catalog_class.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
 
 
-def test_datasets_from_catalog(catalog):
+def test_datasets_from_catalog(catalog, mocker):
     datasets = datasets_from_catalog(catalog)
-    assert isinstance(datasets, dict)
-    assert set(datasets) == {"pandas_excel", "pandas_parquet"}
-    for dataset in datasets.values():
-        assert isinstance(dataset, types.MethodType)
+    assert datasets == {"pandas_excel": mocker.ANY, "pandas_parquet": mocker.ANY}
 
 
-def test_datasets_from_catalog_with_pipeline(catalog):
+def test_datasets_from_catalog_with_pipeline(catalog, mocker):
     pipeline = kp.pipeline(
         [
             kp.node(
@@ -57,10 +53,10 @@ def test_datasets_from_catalog_with_pipeline(catalog):
 
     datasets = datasets_from_catalog(catalog, pipeline=pipeline)
     # Dataset factories only work for kedro>=0.19.9.
-    expected_datasets = (
+    expected_dataset_names = (
         {"pandas_excel", "pandas_parquet", "something#csv", "something_else#csv"}
         if parse(version("kedro")) >= parse("0.19.9")
         else {"pandas_excel", "pandas_parquet"}
     )
 
-    assert set(datasets) == expected_datasets
+    assert datasets == {dataset_name: mocker.ANY for dataset_name in expected_dataset_names}