You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
Backup and restore functions allow to work with snapshots of the pipeline working directory. Working directory contains full pipeline state and load packages and having such functions allows for example:
take a snapshot after extract step and use it to load the same extracted data into several destinations (we can call it "forking" pipeline)
take a snapshot after normalize step and use it to load data to several destinations of the same type (ie. two postgres databases) or different type (if file format is compatible: you can load same parquet files to duckdb, bigquery and snowflake)
take snapshot after load/run to save pipeline state (and schemas) for destinations that do not support state sync (ie. sinks)
4.. take snapshot after failed load to finalize it later
take snapshots after extract or normalize and keep them to be able to "replay" loading
use snapshot to pass packages between independently working extract, normalize and load steps ie. to scale things horizontally.
Note: snapshot must be restored in local file system, dlt does not support fsspec to keep its working folder (but it works on fuse)
Requirements
Implementation idea is to have helper functions that take pipeline instance, filesystem destination and snapshot name as an input, zip the pipeline state and load packages and pass it to a bucket.
Restore works in opposite way
Would be cool to have a context manager in which we can wrap pipeline run that would make sure that state is backed up and restored on run (sink support)
Would be cool to somehow chunk zipped files if too large
User should be able to tell helper function what to backup (only state, also data, also completed packages)
We should support pipeline fork on restore by allowing to change the pipeline name and destination
Background
Backup and restore functions allow to work with snapshots of the pipeline working directory. Working directory contains full pipeline state and load packages and having such functions allows for example:
4.. take snapshot after failed load to finalize it later
Note: snapshot must be restored in local file system, dlt does not support fsspec to keep its working folder (but it works on fuse)
Requirements
Implementation idea is to have helper functions that take pipeline instance, filesystem destination and snapshot name as an input, zip the pipeline state and load packages and pass it to a bucket.
Restore works in opposite way
Would be cool to have a context manager in which we can wrap pipeline run that would make sure that state is backed up and restored on run (sink support)
Would be cool to somehow chunk zipped files if too large
User should be able to tell helper function what to backup (only state, also data, also completed packages)
We should support pipeline fork on restore by allowing to change the pipeline name and destination
PoC: https://gist.github.com/rudolfix/ee6e16d8671f26ac4b9ffc915ad24b6e
The text was updated successfully, but these errors were encountered: