[feature request] Support `__<dataset_name>` suffix in individual resources written to staging datasets #2234

acaruso7 · 2025-01-24T20:06:03Z

Feature description

I'd like to be able to share a single staging dataset across many dlt pipelines / resources, such that each resource is written to the same staging dataset with some identifying suffix for disambiguation of resources with name collisions

For example, a single dataset named dlt_staging_data, where all dlt pipelines which require a staging dataset can write to, and each resource gets some suffix for disambiguation with other sources (for example, dlt_staging_data.<resource_name>__<source_dataset_name>)

Are you a dlt user?

Yes, I run dlt in production.

Use case

In my case, I have multiple SQL databases as sources, and a single Snowflake database as destination. Currently, in my Snowflake destination database, I have a separate dataset (schema) for each SQL database, as well as a separate staging dataset (schema) for each (I am using incremental loading with upsert strategy, so each destination dataset (schema) also gets a staging dataset (schema))

Since individual tables from the source databases might share the same name (e.g. source_sql_db1 and source_sql_db2 might both have a table named users), I can't re-use the same dlt staging schema in the destination, because this might lead to name collisions and race conditions where one incremental pipeline might upsert data from another incremental pipeline into it's main schema

Example Current State Pattern

My current solution to this issue is to have full isolation of staging schemas for each dataset, but this is not ideal because it leads to significantly more schemas and object management overhead in Snowflake (more objects, roles, grants, etc. Noisy IaC plans. Increased Cloud Services costs)

Source Databases:

source_sql_db1
source_sql_db2

Destination Schemas (in a single Snowflake database named RAW):

source_sql_db1 ("main" schema where data from staging schema gets upserted into)
source_sql_db1_staging
source_sql_db2 ("main" schema where data from staging schema gets upserted into)
source_sql_db2_staging

Proposed solution

Example Desired Behavior

Source Databases:

source_sql_db1
source_sql_db2

Destination Schemas (in a single Snowflake database named RAW):

source_sql_db1
source_sql_db2
dlt_staging_data (all pipelines write staging data here, regardless of which source sql db they use)
- Individual tables in this schema are suffixed with the dataset_name. For example, a table named users which is sourced from source_sql_db1 will be written to users__source_sql_db1 (assuming the dataset_name is named after the database
- This way, if source_sql_db2 also has a table named users, and is running an incremental pipeline at the same time as source_sql_db2, there are no collisions

Related issues

No response

The text was updated successfully, but these errors were encountered:

rudolfix · 2025-01-27T10:45:44Z

@acaruso7 a single staging dataset is more like "esthetic" requirement or this has some negative impact in production? Internally our code maintains identical schemas on the staging and final dataset and this is the assumption in many places ie. those generating merge SQL. Changing that will needs a lot of new tests so I expect this task to labor intensive. I can't give ETA for it... but we are happy to review and merge a PR if you'd like to submit it. We can also help with writing tests

acaruso7 · 2025-01-27T14:50:48Z

a single staging dataset is more like "esthetic" requirement or this has some negative impact in production?

In my case there are some impacts in terms of object management overhead in snowflake, because the number of schemas I need to maintain is much higher, which results in a lot more roles and grants with my specific IaC setup (I'm using SnowDDL, a tool like Terraform but dedicated for Snowflake)

There's no negative impact on the actual dlt pipelines though, so this might not affect other users in the same way it does me, depending on setups

In the current state, there's nothing stopping me from using a single staging schema - it's just that I risk table name collisions and race conditions if I have multiple dlt pipelines for different sql db sources running simultaneously and writing to the same place.

I could also address this at the orchestration level, using some locking mechanism, to ensure no 2 dlt pipelines for the different source tables with the same name are ever writing to the same staging schema simultaneously

I was hoping to add a disambiguating suffix though to guarantee those collisions never occur, without needing careful orchestration

I'd be happy to look into this and evaluate the effort if you could point me in the general direction of where to start! (which source files / modules destination table naming conventions are managed etc)

Or if you have any other suggestions on a potential solution I'm interested in hearing them

github-project-automation bot added this to dlt core library Jan 24, 2025

github-project-automation bot moved this to Todo in dlt core library Jan 24, 2025

rudolfix self-assigned this Jan 27, 2025

rudolfix added the question Further information is requested label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Support `__<dataset_name>` suffix in individual resources written to staging datasets #2234

[feature request] Support `__<dataset_name>` suffix in individual resources written to staging datasets #2234

acaruso7 commented Jan 24, 2025

rudolfix commented Jan 27, 2025

acaruso7 commented Jan 27, 2025

[feature request] Support __<dataset_name> suffix in individual resources written to staging datasets #2234

[feature request] Support __<dataset_name> suffix in individual resources written to staging datasets #2234

Comments

acaruso7 commented Jan 24, 2025

Feature description

Are you a dlt user?

Use case

Example Current State Pattern

Proposed solution

Example Desired Behavior

Related issues

rudolfix commented Jan 27, 2025

acaruso7 commented Jan 27, 2025

[feature request] Support `__<dataset_name>` suffix in individual resources written to staging datasets #2234

[feature request] Support `__<dataset_name>` suffix in individual resources written to staging datasets #2234