Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Support __<dataset_name> suffix in individual resources written to staging datasets #2234

Open
acaruso7 opened this issue Jan 24, 2025 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@acaruso7
Copy link

Feature description

I'd like to be able to share a single staging dataset across many dlt pipelines / resources, such that each resource is written to the same staging dataset with some identifying suffix for disambiguation of resources with name collisions

For example, a single dataset named dlt_staging_data, where all dlt pipelines which require a staging dataset can write to, and each resource gets some suffix for disambiguation with other sources (for example, dlt_staging_data.<resource_name>__<source_dataset_name>)

Are you a dlt user?

Yes, I run dlt in production.

Use case

In my case, I have multiple SQL databases as sources, and a single Snowflake database as destination. Currently, in my Snowflake destination database, I have a separate dataset (schema) for each SQL database, as well as a separate staging dataset (schema) for each (I am using incremental loading with upsert strategy, so each destination dataset (schema) also gets a staging dataset (schema))

Since individual tables from the source databases might share the same name (e.g. source_sql_db1 and source_sql_db2 might both have a table named users), I can't re-use the same dlt staging schema in the destination, because this might lead to name collisions and race conditions where one incremental pipeline might upsert data from another incremental pipeline into it's main schema

Example Current State Pattern

My current solution to this issue is to have full isolation of staging schemas for each dataset, but this is not ideal because it leads to significantly more schemas and object management overhead in Snowflake (more objects, roles, grants, etc. Noisy IaC plans. Increased Cloud Services costs)

Source Databases:

  • source_sql_db1
  • source_sql_db2

Destination Schemas (in a single Snowflake database named RAW):

  • source_sql_db1 ("main" schema where data from staging schema gets upserted into)
  • source_sql_db1_staging
  • source_sql_db2 ("main" schema where data from staging schema gets upserted into)
  • source_sql_db2_staging

Proposed solution

Example Desired Behavior

Source Databases:

  • source_sql_db1
  • source_sql_db2

Destination Schemas (in a single Snowflake database named RAW):

  • source_sql_db1
  • source_sql_db2
  • dlt_staging_data (all pipelines write staging data here, regardless of which source sql db they use)
    • Individual tables in this schema are suffixed with the dataset_name. For example, a table named users which is sourced from source_sql_db1 will be written to users__source_sql_db1 (assuming the dataset_name is named after the database
    • This way, if source_sql_db2 also has a table named users, and is running an incremental pipeline at the same time as source_sql_db2, there are no collisions

Related issues

No response

@rudolfix
Copy link
Collaborator

@acaruso7 a single staging dataset is more like "esthetic" requirement or this has some negative impact in production? Internally our code maintains identical schemas on the staging and final dataset and this is the assumption in many places ie. those generating merge SQL. Changing that will needs a lot of new tests so I expect this task to labor intensive. I can't give ETA for it... but we are happy to review and merge a PR if you'd like to submit it. We can also help with writing tests

@rudolfix rudolfix self-assigned this Jan 27, 2025
@rudolfix rudolfix added the question Further information is requested label Jan 27, 2025
@acaruso7
Copy link
Author

a single staging dataset is more like "esthetic" requirement or this has some negative impact in production?

In my case there are some impacts in terms of object management overhead in snowflake, because the number of schemas I need to maintain is much higher, which results in a lot more roles and grants with my specific IaC setup (I'm using SnowDDL, a tool like Terraform but dedicated for Snowflake)

There's no negative impact on the actual dlt pipelines though, so this might not affect other users in the same way it does me, depending on setups

In the current state, there's nothing stopping me from using a single staging schema - it's just that I risk table name collisions and race conditions if I have multiple dlt pipelines for different sql db sources running simultaneously and writing to the same place.

I could also address this at the orchestration level, using some locking mechanism, to ensure no 2 dlt pipelines for the different source tables with the same name are ever writing to the same staging schema simultaneously

I was hoping to add a disambiguating suffix though to guarantee those collisions never occur, without needing careful orchestration

I'd be happy to look into this and evaluate the effort if you could point me in the general direction of where to start! (which source files / modules destination table naming conventions are managed etc)

Or if you have any other suggestions on a potential solution I'm interested in hearing them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: Todo
Development

No branches or pull requests

2 participants