local dlt pipeline cli runner #800

rudolfix · 2023-12-01T17:42:31Z

Background
We are looking for convenient way to execute dlt pipelines from command line, possibly with minimal / without additional code.

There are two options to investigate (not mutually exclusive, we just need to start somewhere):

a command to run any source or resource from a specified module (no additional user code needed)
a command to run and instance of the pipeline in the user pipeline script (some user code required: import a source/resource, pass the parameters and instantiate source, instantiate a pipeline with a given name etc.)

In case of (1) user would specify the name and module with the source(s) and then parameters to instantiate source class (we can use fire lib to create cli interfaces automatically for source/resource functions: https://github.com/google/python-fire), the command would create instance of dlt pipeline, attach a destination and dataset to it, import the desired source, create an instance with passed parameters and then run it (see below).

In case of (2) user would write a pipeline script where source(s) and pipeline are instantiated and then would pass the name of the script, pipeline and source(s) names to the script which will run them. (in this case overriding destination/dataset etc. for the actual run so it is possible to switch from dev destination to production one)

both 1 and 2 have a few common features that we already have in our airflow helper: (https://github.com/dlt-hub/dlt/blob/master/dlt/helpers/airflow_helper.py#L39 and runner example: )

option to select/deselect resources being loaded
option to force full load: (set all resource to replace)
option to setup buffer sizes and parallelism
option to retry extract/load/normalize stages
option to abort if any job failed
option to load lineage data together with a dataset
probably many others

An option to backfil could be available for resources that use Incremental class for incremental loading and are aware of external schedulers, In that case a start and end value could be passed from cli (not only dates but also timestamps or integers - whatever is used as incremental cursor)

The text was updated successfully, but these errors were encountered:

mehd-io · 2023-12-04T08:47:47Z

Nice!
I'm definitely for 2) as this is something I'm already doing. A few things that lean me towards this :

It's easier to understand how the code flow through a defined pipeline object
There are (often?) some custom/extra steps that would be included in a pipeline definition (e.g loading some specific creds, etc) so source/resource won't be really enough IMO.
I feel it somehow dangerous to be able from the CLI to run from any source to any destination. I would rather have strict pipeline and expose only this to the CLI.

geo909 · 2023-12-04T10:25:55Z

Hi @rudolfix, and thanks for considering our feedback from Slack.

Like @mehd-io, I'm already using your second scenario with a custom CLI wrapper based on Click, so I'd support that option as well.

I also agree with @mehd-io's concerns about the first option. Furthermore, my own concern is that creating a CLI runner for a source/resource might lead to confusion, especially for those less familiar with the tool. In this approach, a pipeline is created behind the scenes, but on the surface it might blur the distinction between a pipeline and a source/resource, as the latter might also function as a pipeline in practice, given the option to run it as such.

Just my 2 cents!

rudolfix · 2024-02-22T10:04:24Z

@sultaniman please read the code in dlt.cli namespace so you'll see that a lot of things are done
runner example: https://dlthub.com/devel/examples/chess_production/

github-project-automation bot added this to dlt core library Dec 1, 2023

github-project-automation bot moved this to Todo in dlt core library Dec 1, 2023

rudolfix moved this from Todo to In Progress in dlt core library Feb 22, 2024

rudolfix assigned sultaniman Feb 22, 2024

sultaniman added runner and removed runner labels Feb 28, 2024

rudolfix mentioned this issue Mar 7, 2024

Add initial command and run_command file #1045

Closed

rudolfix moved this from In Progress to Planned in dlt core library Apr 8, 2024

rudolfix moved this from Planned to Todo in dlt core library Apr 17, 2024

rudolfix unassigned sultaniman Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local dlt pipeline cli runner #800

local dlt pipeline cli runner #800

rudolfix commented Dec 1, 2023

mehd-io commented Dec 4, 2023

geo909 commented Dec 4, 2023

rudolfix commented Feb 22, 2024

local dlt pipeline cli runner #800

local dlt pipeline cli runner #800

Comments

rudolfix commented Dec 1, 2023

mehd-io commented Dec 4, 2023

geo909 commented Dec 4, 2023

rudolfix commented Feb 22, 2024