Best practices for sharing configuration across multiple catalog files #3625

sbrugman · 2024-02-16T23:14:15Z

There are at least three dynamic techniques to define the Data Catalog currently in Kedro:

YAML anchors
(nested) variable interpolation through the OmegaConfigLoader.
Dataset factories

There is a common usage pattern for which even a combination of these three is not expressive enough to satisfactory specify the configuration.

Scenario

Each of the pipelines consists of a set of nodes of limited distinct types, e.g. source (read only), intermediate (write local) and final (write to some database).

Multiple people are working on different pipelines, so the catalog is split in a file per pipeline:

conf/base/catalog/
    pipeline_a.yml
    pipeline_b.yml
    pipeline_c.yml

The split files provide overview and prevent conflicts when working in parallel.
So far so good.

Rapidly, each of these files begins to look like:

# Templates
_source_table: &source_table
  type: datasets.SourceTable
  read_only: true
  metadata:
    kedro-viz:
      layer: source

# More templates…

_output_table: &output_table
  type: datasets.ProductionDB
  database: constant
  mode: overwrite
  metadata:
    kedro-viz:
      layer: primary

# Datasets
a_ds1:
    <<: *source_table
    database: hello
    table: world

# More datasets…

a_ds_15:
   <<: *output_table
   table: foobar

Each file contains more or less the same YAML anchors, which according to the specs cannot be shared across files. Is there another way to store this common information in a single file, instead in every single one of them - while keeping a single file per pipeline?

The variable interpolation perhaps?
Even though the OmagaConfigLoader can interpolate a dict, this (afaik) does not allow to partially pass a dict, such as we can do in Python (or with the anchors above):

Source = {“read_only”: True}


my_table = {
   **source,
    “Database”: “db”
}

The dataset factories also do not support this pattern.
The following would be close, but is too restrictive:

"{name}_source”:
  type: datasets.SourceTable
  read_only: true
  metadata:
    kedro-viz:
      layer: source
  Table: {name}

Restrictions:

All information needs to be in the dataset name. This is undesirable for long/badly names tables. Even more if the database name should be included too, and the layer information. The dataset name should be an alias.
Flexibility to overwrite individual properties that we had with the YAML anchors seems not possible.

Desiderata

The user needs to be able to template catalog entries across multiple files. It must be possible to overwrite individual entries and it should be possible that the name is an alias.

MigQ2 · 2024-09-14T19:48:22Z

Hi @sbrugman, I am having the same issue you mention here, where I want more flexibility and avoid copying common YML configs over and over but I don't seem to find an easy way in the current kedro ecosystem

Did you find any elegant solution?

I think playing with OmegaConf resolvers might make it work but it makes things quite unreadable and complicated

MigQ2 · 2024-09-18T21:48:28Z

I am adding an interesting Slack discussion here, where the interaction between factories and OmegaConf resolvers is discussed, so that we can have it as a reference on this topic:

https://linen-slack.kedro.org/t/22708331/hi-is-it-possible-to-use-a-dataset-factory-in-config-resolve

Initial message:

Hi, is it possible to use a dataset factory in config resolver?
An example:

"{name}_feature":
 type: pandas.ParquetDataset
 filepath: data/04_feature/{name}_feature.parquet
 metadata:
   pandera:
     schema: ${pa.python:my_kedro_project.pipelines.feature_preprocessing.schemas.{name}_feature_schema}

The above gives me this:

omegaconf.errors.GrammarParseError: mismatched input '{' expecting BRACE_CLOSE
   full_key: {name}_feature.metadata.pandera.schema
   object_type=dict

astrojuanlu · 2024-11-08T18:37:15Z

@sbrugman @MigQ2 sorry for the slightly slow reply here. We're going over old, unaddressed issues.

Would globals.yml work for your use case?

sbrugman · 2024-11-08T19:13:32Z

If globals.yml injects the YAML anchors in the same file under the hood, then that could work. Is that what you had in mind?

astrojuanlu · 2024-11-08T19:44:38Z

Unclear to me if YAML anchors can be shared. But variables and blocks of YAML definitely can. The way of using them deviates from normal YAML syntax though, and I don't think the "inheritance" provided by YAML anchors is supported in Omegaconf.

github-actions bot mentioned this issue Mar 1, 2024

Monthly issue metrics report #3671

Closed

merelcht added the Community Issue/PR opened by the open-source community label Sep 17, 2024

github-project-automation bot added this to Kedro Framework Sep 17, 2024

merelcht removed this from Kedro Framework Oct 31, 2024

merelcht added this to Kedro Wizard 🪄 Oct 31, 2024

astrojuanlu added the support: needs more info label Nov 8, 2024

astrojuanlu moved this to Needs more info in Kedro Wizard 🪄 Nov 8, 2024

github-actions bot removed the support: needs more info label Nov 8, 2024

astrojuanlu moved this from Needs more info to Wizard inbox in Kedro Wizard 🪄 Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for sharing configuration across multiple catalog files #3625

Best practices for sharing configuration across multiple catalog files #3625

sbrugman commented Feb 16, 2024

MigQ2 commented Sep 14, 2024

MigQ2 commented Sep 18, 2024

astrojuanlu commented Nov 8, 2024

sbrugman commented Nov 8, 2024

astrojuanlu commented Nov 8, 2024

Best practices for sharing configuration across multiple catalog files #3625

Best practices for sharing configuration across multiple catalog files #3625

Comments

sbrugman commented Feb 16, 2024

Scenario

Desiderata

MigQ2 commented Sep 14, 2024

MigQ2 commented Sep 18, 2024

astrojuanlu commented Nov 8, 2024

sbrugman commented Nov 8, 2024

astrojuanlu commented Nov 8, 2024