Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices for sharing configuration across multiple catalog files #3625

Open
sbrugman opened this issue Feb 16, 2024 · 5 comments
Open
Labels
Community Issue/PR opened by the open-source community

Comments

@sbrugman
Copy link
Contributor

There are at least three dynamic techniques to define the Data Catalog currently in Kedro:

There is a common usage pattern for which even a combination of these three is not expressive enough to satisfactory specify the configuration.

Scenario

Each of the pipelines consists of a set of nodes of limited distinct types, e.g. source (read only), intermediate (write local) and final (write to some database).

Multiple people are working on different pipelines, so the catalog is split in a file per pipeline:

conf/base/catalog/
    pipeline_a.yml
    pipeline_b.yml
    pipeline_c.yml

The split files provide overview and prevent conflicts when working in parallel.
So far so good.

Rapidly, each of these files begins to look like:

# Templates
_source_table: &source_table
  type: datasets.SourceTable
  read_only: true
  metadata:
    kedro-viz:
      layer: source

# More templates…

_output_table: &output_table
  type: datasets.ProductionDB
  database: constant
  mode: overwrite
  metadata:
    kedro-viz:
      layer: primary

# Datasets
a_ds1:
    <<: *source_table
    database: hello
    table: world

# More datasets…

a_ds_15:
   <<: *output_table
   table: foobar

Each file contains more or less the same YAML anchors, which according to the specs cannot be shared across files. Is there another way to store this common information in a single file, instead in every single one of them - while keeping a single file per pipeline?

The variable interpolation perhaps?
Even though the OmagaConfigLoader can interpolate a dict, this (afaik) does not allow to partially pass a dict, such as we can do in Python (or with the anchors above):

Source = {“read_only”: True}


my_table = {
   **source,
    “Database”: “db”
}

The dataset factories also do not support this pattern.
The following would be close, but is too restrictive:

"{name}_source”:
  type: datasets.SourceTable
  read_only: true
  metadata:
    kedro-viz:
      layer: source
  Table: {name}

Restrictions:

  • All information needs to be in the dataset name. This is undesirable for long/badly names tables. Even more if the database name should be included too, and the layer information. The dataset name should be an alias.
  • Flexibility to overwrite individual properties that we had with the YAML anchors seems not possible.

Desiderata

The user needs to be able to template catalog entries across multiple files. It must be possible to overwrite individual entries and it should be possible that the name is an alias.

@MigQ2
Copy link
Contributor

MigQ2 commented Sep 14, 2024

Hi @sbrugman, I am having the same issue you mention here, where I want more flexibility and avoid copying common YML configs over and over but I don't seem to find an easy way in the current kedro ecosystem

Did you find any elegant solution?

I think playing with OmegaConf resolvers might make it work but it makes things quite unreadable and complicated

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Sep 17, 2024
@MigQ2
Copy link
Contributor

MigQ2 commented Sep 18, 2024

I am adding an interesting Slack discussion here, where the interaction between factories and OmegaConf resolvers is discussed, so that we can have it as a reference on this topic:

https://linen-slack.kedro.org/t/22708331/hi-is-it-possible-to-use-a-dataset-factory-in-config-resolve

Initial message:

Hi, is it possible to use a dataset factory in config resolver?
An example:

"{name}_feature":
 type: pandas.ParquetDataset
 filepath: data/04_feature/{name}_feature.parquet
 metadata:
   pandera:
     schema: ${pa.python:my_kedro_project.pipelines.feature_preprocessing.schemas.{name}_feature_schema}

The above gives me this:

omegaconf.errors.GrammarParseError: mismatched input '{' expecting BRACE_CLOSE
   full_key: {name}_feature.metadata.pandera.schema
   object_type=dict

@astrojuanlu
Copy link
Member

@sbrugman @MigQ2 sorry for the slightly slow reply here. We're going over old, unaddressed issues.

Would globals.yml work for your use case?

@sbrugman
Copy link
Contributor Author

sbrugman commented Nov 8, 2024

If globals.yml injects the YAML anchors in the same file under the hood, then that could work. Is that what you had in mind?

@astrojuanlu
Copy link
Member

Unclear to me if YAML anchors can be shared. But variables and blocks of YAML definitely can. The way of using them deviates from normal YAML syntax though, and I don't think the "inheritance" provided by YAML anchors is supported in Omegaconf.

@astrojuanlu astrojuanlu moved this from Needs more info to Wizard inbox in Kedro Wizard 🪄 Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
Status: Wizard inbox
Development

No branches or pull requests

4 participants