Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use source scoped configs for all pipeline steps #816

Open
rudolfix opened this issue Dec 11, 2023 · 0 comments
Open

use source scoped configs for all pipeline steps #816

rudolfix opened this issue Dec 11, 2023 · 0 comments

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Dec 11, 2023

Background
The extract step allows to specify configuration / secrets per source section/name. For example, the parallelism settings, buffer sizes etc. may be different for each processed source. (ie. https://dlthub.com/docs/reference/performance#controlling-in-memory-buffers)

The same rules do not apply to normalize and load steps. ie. the loader file format (#815) may be set only globally (or scoped to particular pipeline). you cannot declare that certain source prefers given file format.

In this ticket we'll unify that behavior. We introduce the following default section layout to configure pipeline steps
<pipeline_name>.sources.<source_section>.<source_name>.<step_name> where <pipeline_name> and <step_name> are sticky and the other sections are eliminated in the standard way (). For example: if I want to set the number of workers in normalize stage all the settings will work

zendesk.pipeline.sources.zendesk.normalize.workers=1
sources.zendesk.normalize.workers=1
normalize.workers=1
zendesk_pipeline.normalize.workers=1

the following wont

workers=1  # breaking change from 0.3.x - step settings are scoped by sticky sections
zendesk_pipeline.workers=1  # same!

a few more examples:

[sources.zendesk.extract]
[sources.zendesk.extract.data_writer]

[sources.zendesk.normalize]
[sources.zendesk.normalize.data_writer]

[sources.zendesk.load]

Breaking changes to 0.3.x:

[sources.zendesk.data_writer] section will stop working. it was implicitly applied to to extract step. it must be replaced with
[sources.zendesk.extract.data_writer] or [sources.zendesk.normalize.data_writer]

[data_writer] that was applied to all steps (which btw. did not make much sense) now requires exact scope
[extract.data_writer] or [normalize.data_writer]

normalize and load settings could be defined top level (which also didn't make sense!) ie
workers=1 applied both to load and normalize steps and now explicit step name must be used (which was the official way anyway)
normalize.workers=1

Tasks

    • allow to inject an obligatory config section with step name (ie. normalize) that will be always present at the end of section list when resolving config value. this will allow for things like normalize.workers.
    • inject source name and section in the normalize and load steps - same way we do it in extract. that will require code refactor where we first read a load package, take the source name from schema and then instantiate the Load and Normalize objects. so the code becomes similar to extract code.
    • schema should also store section from the source so we have full injection section like source.
    • existing Normalize and Load configiurations (and their corresponding storages) should still use same sections in with_config but they should prefer existing (to yield to sections as set by the pipeline with_section decorator)
    • this changes should not impact destination settings. they
@rudolfix rudolfix added the devel label Dec 11, 2023
@rudolfix rudolfix changed the title [WIP] use source scoped configs for all pipeline steps use source scoped configs for all pipeline steps Dec 12, 2023
@rudolfix rudolfix removed the devel label Dec 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant