[Tidy; Docs] Remove kedro _get_dataset call; enhance kedro docs (#1014)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jo Stichbury <[email protected]>
mckinsey · Feb 14, 2025 · 913436d · 913436d
1 parent 69471f4
commit 913436d
Show file tree

Hide file tree

Showing 8 changed files with 158 additions and 39 deletions.
diff --git a/vizro-core/changelog.d/20250212_102221_antony.milne_kedro.md b/vizro-core/changelog.d/20250212_102221_antony.milne_kedro.md
@@ -0,0 +1,48 @@
+<!--
+A new scriv changelog fragment.
+
+Uncomment the section that is right (remove the HTML comment wrapper).
+-->
+
+<!--
+### Highlights ✨
+
+- A bullet item for the Highlights ✨ category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+<!--
+### Removed
+
+- A bullet item for the Removed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+<!--
+### Added
+
+- A bullet item for the Added category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+<!--
+### Changed
+
+- A bullet item for the Changed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+<!--
+### Deprecated
+
+- A bullet item for the Deprecated category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+<!--
+### Fixed
+
+- A bullet item for the Fixed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
+<!--
+### Security
+
+- A bullet item for the Security category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))
+
+-->
diff --git a/vizro-core/docs/pages/user-guides/data.md b/vizro-core/docs/pages/user-guides/data.md
@@ -38,7 +38,20 @@ graph TD
     | Can be refreshed while dashboard is running                   | No               | Yes                                      |
     | Production-ready                                              | Yes              | Yes                                      |
 
-If you have a [Kedro](https://kedro.org/) project or would like to use the [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to manage your data independently of a Kedro project then you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md). This offers helper functions to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) as dynamic data in the Vizro data manager.
+If you have a [Kedro](https://kedro.org/) project, you should use Vizro's [integration with the Kedro Data Catalog](kedro-data-catalog.md) to add [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) data to the Vizro data manager.
+
+!!! note "Kedro Data Catalog as a data source registry"
+    Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](kedro-data-catalog.md#create-a-kedro-data-catalog) as a YAML registry of your dashboard's data sources. This separates configuration of your data sources from your app's code and is the recommended approach if you have many data sources or a complex project. Here is an example `catalog.yaml` file:
+
+    ```yaml
+    motorbikes:
+      type: pandas.CSVDataset
+      filepath: s3://your_bucket/data/motorbikes.csv
+      load_args:
+        sep: ','
+        na_values: [NA]
+      credentials: s3_credentials
+    ```
 
 ## Static data
 

diff --git a/vizro-core/docs/pages/user-guides/kedro-data-catalog.md b/vizro-core/docs/pages/user-guides/kedro-data-catalog.md
@@ -1,6 +1,8 @@
-# How to integrate Vizro with Kedro Data Catalog
+# How to integrate Vizro with the Kedro Data Catalog
 
-This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. For Pandas datasets registered in a Kedro data catalog, Vizro provides a convenient way to visualize them.
+This page describes how to integrate Vizro with [Kedro](https://docs.kedro.org/en/stable/index.html), an open-source Python framework to create reproducible, maintainable, and modular data science code. Vizro provides a convenient way to visualize Pandas datasets registered in a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html).
+
+Even if you do not have a Kedro project, you can still [use a Kedro Data Catalog](#create-a-kedro-data-catalog) to manage your dashboard's data sources. This separates configuration of your data from your app's code and is particularly useful for dashboards with many data sources or more complex data loading configuration.
 
 ## Installation
 
@@ -10,63 +12,119 @@ If you already have Kedro installed then you do not need to install any extra de
 pip install vizro[kedro]
 ```
 
+Vizro is currently compatible with `kedro>=0.19.0` and works with dataset factories for `kedro>=0.19.9`.
+
+## Create a Kedro Data Catalog
+
+You can create a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) to be a YAML registry of your dashboard's data sources. To do so, create a new file called `catalog.yaml` file in the same directory as your `app.py`. Below is an example `catalog.yaml` file that illustrates some of the key features of the Kedro Data Catalog.
+
+```yaml
+cars:  # (1)!
+  type: pandas.CSVDataset  # (2)!
+  filepath: cars.csv
+
+motorbikes:
+  type: pandas.CSVDataset
+  filepath: s3://your_bucket/data/motorbikes.csv   # (3)!
+  load_args:   # (4)!
+    sep: ','
+    na_values: [NA]
+  credentials: s3_credentials  # (5)!
+
+trains:
+  type: pandas.ExcelDataset
+  filepath: trains.xlsx
+  load_args:
+    sheet_name: [Sheet1, Sheet2, Sheet3]
+
+trucks:
+  type: pandas.ParquetDataset
+  filepath: trucks.parquet
+  load_args:
+    columns: [name, gear, disp, wt]
+    categories: list
+    index: name
+```
+
+1. The [minimum details needed](https://docs.kedro.org/en/stable/data/data_catalog.html#the-basics-of-catalog-yml) for a Kedro Data Catalog entry are the data source name (`cars`), the type of data (`type`), and the file's location (`filepath`).
+1. Vizro supports all [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) datasets. This includes, for example, CSV, Excel and Parquet files.
+1. Kedro supports a [variety of data stores](https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-filepath) including local file systems, network file systems and cloud object stores.
+1. You can [pass data loading arguments](https://docs.kedro.org/en/stable/data/data_catalog.html#load-save-and-filesystem-arguments) to specify how to load the data source.
+1. You can [securely inject credentials](https://docs.kedro.org/en/stable/configuration/credentials.html) into data loading functions using a [`credentials.yaml` file](https://docs.kedro.org/en/stable/data/data_catalog.html#dataset-access-credentials) or [environment variables](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-load-credentials-through-environment-variables).
+
+As [shown below](#use-datasets-from-the-kedro-data-catalog), the best way to use the `catalog.yaml` is with the [Kedro configuration loader](https://docs.kedro.org/en/stable/configuration/configuration_basics.html) `OmegaConfigLoader`. For simple cases, this functions much like `yaml.safe_load`. However, the Kedro configuration loader also enables more advanced functionality.
+
+??? "Kedro configuration loader features"
+    Here are a few features of the Kedro configuration loader which are not possible with a `yaml.safe_load` alone. For more details, refer to Kedro's [documentation on advanced configuration](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html).
+
+    - [Configuration environments](https://docs.kedro.org/en/stable/configuration/configuration_basics.html#configuration-environments) to organize settings that might be different between your different [development and production environments](run-deploy.md). For example, you might have different s3 buckets for development and production data.
+    - [Recursive scanning for configuration files](https://docs.kedro.org/en/stable/configuration/configuration_basics.html#configuration-loading) to merge complex configuration that is split across multiple files and folders.
+    - [Templating (variable interpolation)](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#catalog) and [dynamically computed values (resolvers)](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-resolvers-in-the-omegaconfigloader).
+
 ## Use datasets from the Kedro Data Catalog
 
-`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
+Vizro provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html) in the module [`vizro.integrations.kedro`](../API-reference/kedro-integration.md). These functions support both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro `catalog`, the general pattern to add datasets to the Vizro data manager is:
 
 ```python
 from vizro.integrations import kedro as kedro_integration
 from vizro.managers import data_manager
 
 
-for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
-    data_manager[dataset_name] = dataset
+for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
+    data_manager[dataset_name] = dataset_loader
 ```
 
-This imports all datasets of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog` into the Vizro `data_manager`.
+The code above registers all data sources of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) in the Kedro `catalog` with Vizro's `data_manager` . You can now [reference the data source](data.md#reference-by-name) by name. For example, given the [above `catalog.yaml` file](#create-a-kedro-data-catalog), you could use the data source names `"cars"`, `"motorbikes"`, `"trains"`, and `"trucks"` with `px.scatter("cars", ...)`.
+
+!!! note
+    Data sources imported from Kedro in this way are [dynamic data](data.md#dynamic-data). This means that the data can be refreshed while your dashboard is running. For example, if you run a Kedro pipeline, the latest data is shown in the Vizro dashboard without restarting it.
 
 The `catalog` variable may have been created in a number of different ways:
 
-1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.catalog_from_project` to generate a `catalog` given the path to a Kedro project.
+1. Data Catalog configuration file (`catalog.yaml`), [created as described above](#create-a-kedro-data-catalog). This generates a `catalog` variable independently of a Kedro project using [`DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).
+1. Kedro project path. Vizro exposes a helper function [`catalog_from_project`](../API-reference/kedro-integration.md#vizro.integrations.kedro.catalog_from_project) to generate a `catalog` given the path to a Kedro project.
 1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `catalog`.
-1. Data Catalog configuration file (`catalog.yaml`). This can create a `catalog` entirely independently of a Kedro project using [`kedro.io.DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).
 
 The full code for these different cases is given below.
 
 !!! example "Import a Kedro Data Catalog into the Vizro data manager"
-    === "app.py (Kedro project path)"
+    === "app.py (Data Catalog configuration file)"
         ```python
+        from kedro.config import OmegaConfigLoader
+        from kedro.io import DataCatalog  # (1)!
+
         from vizro.integrations import kedro as kedro_integration
         from vizro.managers import data_manager
 
-        project_path = "/path/to/kedro/project"
-        catalog = kedro_integration.catalog_from_project(project_path)
-
+        conf_loader = OmegaConfigLoader(conf_source=".")  # (2)!
+        catalog = DataCatalog.from_config(conf_loader["catalog"])  # (3)!
 
         for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
             data_manager[dataset_name] = dataset_loader
         ```
 
-    === "app.ipynb (Kedro Jupyter session)"
+        1. Kedro's [experimental `KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature) would also work.
+        1. This [loads and parses configuration in `catalog.yaml`](https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#advanced-configuration-without-a-full-kedro-project). The argument `conf_source="."` specifies that `catalog.yaml` is found in the same directory as `app.py` or a subdirectory beneath this level. In a more complex setup, this could include [configuration environments](https://docs.kedro.org/en/stable/configuration/configuration_basics.html#configuration-environments), for example to organize configuration for development and production data sources.
+        1. If you have [credentials](https://docs.kedro.org/en/stable/configuration/credentials.html) then these can be injected with `DataCatalog.from_config(conf_loader["catalog"], conf_loader["credentials"])`.
+
+    === "app.py (Kedro project path)"
         ```python
+        from vizro.integrations import kedro as kedro_integration
         from vizro.managers import data_manager
 
+        project_path = "/path/to/kedro/project"
+        catalog = kedro_integration.catalog_from_project(project_path)
+
 
         for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
             data_manager[dataset_name] = dataset_loader
         ```
 
-    === "app.py (Data Catalog configuration file)"
+    === "app.ipynb (Kedro Jupyter session)"
         ```python
-        from kedro.io import DataCatalog
-        import yaml
-
-        from vizro.integrations import kedro as kedro_integration
         from vizro.managers import data_manager
 
 
-        catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))
-
         for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
             data_manager[dataset_name] = dataset_loader
         ```
@@ -83,7 +141,7 @@ kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__
 
 The `pipelines` variable may have been created the following ways:
 
-1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
+1. Kedro project path. Vizro exposes a helper function [`pipelines_from_project`](../API-reference/kedro-integration.md#vizro.integrations.kedro.pipelines_from_project) to generate a `pipelines` given the path to a Kedro project.
 1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.
 
 The full code for these different cases is given below.

diff --git a/vizro-core/docs/pages/user-guides/run-deploy.md b/vizro-core/docs/pages/user-guides/run-deploy.md
@@ -7,6 +7,8 @@ Typically when you create a dashboard, there are two distinct stages:
 
 This guide describes methods to run your dashboard _in development_ and _in production_. Follow either section based on your current need.
 
+If your data sources in development and production are different (for example, you have different s3 buckets for development and production data) then you might like to [use the Kedro Data Catalog](kedro-data-catalog.md#use-datasets-from-the-kedro-data-catalog) to manage your data source configuration.
+
 Vizro is built on top of [Dash](https://dash.plotly.com/), which itself uses [Flask](https://flask.palletsprojects.com/). Most of our guidance on how to run a Vizro app in development or production is very similar to guidance on Dash and Flask.
 
 !!! note

diff --git a/vizro-core/src/vizro/integrations/kedro/_data_manager.py b/vizro-core/src/vizro/integrations/kedro/_data_manager.py
@@ -107,9 +107,6 @@ def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None
     for dataset_name, dataset_config in kedro_datasets.items():
         # "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
         if "pandas" in dataset_config["type"]:
-            # TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
-            #  but need to check if works with caching.
-            dataset = catalog._get_dataset(dataset_name, suggest=False)
-            vizro_data_sources[dataset_name] = dataset.load
+            vizro_data_sources[dataset_name] = lambda: catalog.load(dataset_name)
 
     return vizro_data_sources
diff --git a/vizro-core/src/vizro/managers/_data_manager.py b/vizro-core/src/vizro/managers/_data_manager.py
@@ -170,6 +170,8 @@ def __setitem__(self, name: DataSourceName, data: Union[pd.DataFrame, pd_DataFra
             # Once partial has been used, all dynamic data sources are on equal footing since they're all treated as
             # functions rather than bound methods, e.g. by flask_caching.utils.function_namespace. This makes it much
             # simpler to use flask-caching reliably.
+            # Note that for kedro>=0.19.9 we use lambda: catalog.load(dataset_name) rather than dataset.load so the
+            # bound method case no longer arises when using kedro integration.
             # It's important the __qualname__ is the same across all workers, so use the data source name rather than
             # e.g. the repr method that includes the id of the instance so would only work in the case that gunicorn is
             # running with --preload.

diff --git a/...grations/kedro/fixtures/test_catalog.yaml → ...nit/vizro/integrations/kedro/catalog.yaml b/...grations/kedro/fixtures/test_catalog.yaml → ...nit/vizro/integrations/kedro/catalog.yaml
@@ -3,7 +3,7 @@
   filepath: "{pandas_factory}.csv"
 
 pandas_excel:
-  type: pandas.ExcelDataset
+  type: ${_pandas_excel_type}
   filepath: pandas_excel.xlsx
 
 pandas_parquet:
@@ -13,3 +13,6 @@ pandas_parquet:
 not_dataframe:
   type: pickle.PickleDataset
   filepath: pickle.pkl
+
+# Use variable interpolation to check OmegaConfigLoader does what is expected over just yaml.safe_load.
+_pandas_excel_type: pandas.ExcelDataset