[feature request] Support __<dataset_name>
suffix in individual resources written to staging datasets
#2234
Labels
question
Further information is requested
Feature description
I'd like to be able to share a single staging dataset across many
dlt
pipelines / resources, such that each resource is written to the same staging dataset with some identifying suffix for disambiguation of resources with name collisionsFor example, a single dataset named
dlt_staging_data
, where all dlt pipelines which require a staging dataset can write to, and each resource gets some suffix for disambiguation with other sources (for example,dlt_staging_data.<resource_name>__<source_dataset_name>
)Are you a dlt user?
Yes, I run dlt in production.
Use case
In my case, I have multiple SQL databases as sources, and a single Snowflake database as destination. Currently, in my Snowflake destination database, I have a separate dataset (schema) for each SQL database, as well as a separate staging dataset (schema) for each (I am using incremental loading with upsert strategy, so each destination dataset (schema) also gets a staging dataset (schema))
Since individual tables from the source databases might share the same name (e.g.
source_sql_db1
andsource_sql_db2
might both have a table namedusers
), I can't re-use the same dlt staging schema in the destination, because this might lead to name collisions and race conditions where one incremental pipeline might upsert data from another incremental pipeline into it's main schemaExample Current State Pattern
My current solution to this issue is to have full isolation of staging schemas for each dataset, but this is not ideal because it leads to significantly more schemas and object management overhead in Snowflake (more objects, roles, grants, etc. Noisy IaC plans. Increased Cloud Services costs)
Source Databases:
source_sql_db1
source_sql_db2
Destination Schemas (in a single Snowflake database named
RAW
):source_sql_db1
("main" schema where data from staging schema gets upserted into)source_sql_db1_staging
source_sql_db2
("main" schema where data from staging schema gets upserted into)source_sql_db2_staging
Proposed solution
Example Desired Behavior
Source Databases:
source_sql_db1
source_sql_db2
Destination Schemas (in a single Snowflake database named
RAW
):source_sql_db1
source_sql_db2
dlt_staging_data
(all pipelines write staging data here, regardless of which source sql db they use)dataset_name
. For example, a table namedusers
which is sourced fromsource_sql_db1
will be written tousers__source_sql_db1
(assuming thedataset_name
is named after the databasesource_sql_db2
also has a table namedusers
, and is running an incremental pipeline at the same time assource_sql_db2
, there are no collisionsRelated issues
No response
The text was updated successfully, but these errors were encountered: