use source scoped configs for all pipeline steps #816

rudolfix · 2023-12-11T11:34:48Z

Background
The extract step allows to specify configuration / secrets per source section/name. For example, the parallelism settings, buffer sizes etc. may be different for each processed source. (ie. https://dlthub.com/docs/reference/performance#controlling-in-memory-buffers)

The same rules do not apply to normalize and load steps. ie. the loader file format (#815) may be set only globally (or scoped to particular pipeline). you cannot declare that certain source prefers given file format.

In this ticket we'll unify that behavior. We introduce the following default section layout to configure pipeline steps
<pipeline_name>.sources.<source_section>.<source_name>.<step_name> where <pipeline_name> and <step_name> are sticky and the other sections are eliminated in the standard way (). For example: if I want to set the number of workers in normalize stage all the settings will work

zendesk.pipeline.sources.zendesk.normalize.workers=1
sources.zendesk.normalize.workers=1
normalize.workers=1
zendesk_pipeline.normalize.workers=1

the following wont

workers=1  # breaking change from 0.3.x - step settings are scoped by sticky sections
zendesk_pipeline.workers=1  # same!

a few more examples:

[sources.zendesk.extract]
[sources.zendesk.extract.data_writer]

[sources.zendesk.normalize]
[sources.zendesk.normalize.data_writer]

[sources.zendesk.load]

Breaking changes to 0.3.x:

[sources.zendesk.data_writer] section will stop working. it was implicitly applied to to extract step. it must be replaced with
[sources.zendesk.extract.data_writer] or [sources.zendesk.normalize.data_writer]

[data_writer] that was applied to all steps (which btw. did not make much sense) now requires exact scope
[extract.data_writer] or [normalize.data_writer]

normalize and load settings could be defined top level (which also didn't make sense!) ie
workers=1 applied both to load and normalize steps and now explicit step name must be used (which was the official way anyway)
normalize.workers=1

Tasks

- allow to inject an obligatory config section with step name (ie. normalize) that will be always present at the end of section list when resolving config value. this will allow for things like normalize.workers.
- inject source name and section in the normalize and load steps - same way we do it in extract. that will require code refactor where we first read a load package, take the source name from schema and then instantiate the Load and Normalize objects. so the code becomes similar to extract code.
- schema should also store section from the source so we have full injection section like source.
- existing Normalize and Load configiurations (and their corresponding storages) should still use same sections in with_config but they should prefer existing (to yield to sections as set by the pipeline with_section decorator)
- this changes should not impact destination settings. they

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to dlt core library Dec 11, 2023

github-project-automation bot moved this to Todo in dlt core library Dec 11, 2023

rudolfix added the devel label Dec 11, 2023

rudolfix changed the title ~~[WIP] use source scoped configs for all pipeline steps~~ use source scoped configs for all pipeline steps Dec 12, 2023

rudolfix removed the devel label Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use source scoped configs for all pipeline steps #816

use source scoped configs for all pipeline steps #816

rudolfix commented Dec 11, 2023 •

edited

Loading

use source scoped configs for all pipeline steps #816

use source scoped configs for all pipeline steps #816

Comments

rudolfix commented Dec 11, 2023 • edited Loading

rudolfix commented Dec 11, 2023 •

edited

Loading