Allow to backup and restore pipeline working directory #944

rudolfix · 2024-02-07T13:24:48Z

Background
Backup and restore functions allow to work with snapshots of the pipeline working directory. Working directory contains full pipeline state and load packages and having such functions allows for example:

take a snapshot after extract step and use it to load the same extracted data into several destinations (we can call it "forking" pipeline)
take a snapshot after normalize step and use it to load data to several destinations of the same type (ie. two postgres databases) or different type (if file format is compatible: you can load same parquet files to duckdb, bigquery and snowflake)
take snapshot after load/run to save pipeline state (and schemas) for destinations that do not support state sync (ie. sinks)
4.. take snapshot after failed load to finalize it later
take snapshots after extract or normalize and keep them to be able to "replay" loading
use snapshot to pass packages between independently working extract, normalize and load steps ie. to scale things horizontally.

Note: snapshot must be restored in local file system, dlt does not support fsspec to keep its working folder (but it works on fuse)

Requirements
Implementation idea is to have helper functions that take pipeline instance, filesystem destination and snapshot name as an input, zip the pipeline state and load packages and pass it to a bucket.
Restore works in opposite way
Would be cool to have a context manager in which we can wrap pipeline run that would make sure that state is backed up and restored on run (sink support)
Would be cool to somehow chunk zipped files if too large
User should be able to tell helper function what to backup (only state, also data, also completed packages)
We should support pipeline fork on restore by allowing to change the pipeline name and destination

PoC: https://gist.github.com/rudolfix/ee6e16d8671f26ac4b9ffc915ad24b6e

github-project-automation bot added this to dlt core library Feb 7, 2024

github-project-automation bot moved this to Todo in dlt core library Feb 7, 2024

rudolfix added good first issue Good for newcomers enhancement New feature or request and removed good first issue Good for newcomers labels Feb 7, 2024

rudolfix mentioned this issue Jun 2, 2024

[WIP] allow to fork and split pipelines #1433

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to backup and restore pipeline working directory #944

Allow to backup and restore pipeline working directory #944

rudolfix commented Feb 7, 2024

Allow to backup and restore pipeline working directory #944

Allow to backup and restore pipeline working directory #944

Comments

rudolfix commented Feb 7, 2024