Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to backup and restore pipeline working directory #944

Open
rudolfix opened this issue Feb 7, 2024 · 0 comments
Open

Allow to backup and restore pipeline working directory #944

rudolfix opened this issue Feb 7, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Feb 7, 2024

Background
Backup and restore functions allow to work with snapshots of the pipeline working directory. Working directory contains full pipeline state and load packages and having such functions allows for example:

  1. take a snapshot after extract step and use it to load the same extracted data into several destinations (we can call it "forking" pipeline)
  2. take a snapshot after normalize step and use it to load data to several destinations of the same type (ie. two postgres databases) or different type (if file format is compatible: you can load same parquet files to duckdb, bigquery and snowflake)
  3. take snapshot after load/run to save pipeline state (and schemas) for destinations that do not support state sync (ie. sinks)
    4.. take snapshot after failed load to finalize it later
  4. take snapshots after extract or normalize and keep them to be able to "replay" loading
  5. use snapshot to pass packages between independently working extract, normalize and load steps ie. to scale things horizontally.

Note: snapshot must be restored in local file system, dlt does not support fsspec to keep its working folder (but it works on fuse)

Requirements
Implementation idea is to have helper functions that take pipeline instance, filesystem destination and snapshot name as an input, zip the pipeline state and load packages and pass it to a bucket.
Restore works in opposite way
Would be cool to have a context manager in which we can wrap pipeline run that would make sure that state is backed up and restored on run (sink support)
Would be cool to somehow chunk zipped files if too large
User should be able to tell helper function what to backup (only state, also data, also completed packages)
We should support pipeline fork on restore by allowing to change the pipeline name and destination

PoC: https://gist.github.com/rudolfix/ee6e16d8671f26ac4b9ffc915ad24b6e

@rudolfix rudolfix added good first issue Good for newcomers enhancement New feature or request and removed good first issue Good for newcomers labels Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

1 participant