-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add upsert
merge strategy
#1294
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for a new merge strategy! My overall comments
- why do we need to modify the normalizer?
upsert
should be compatible withinsert-delete
- when generating SQL, the only difference I see is that we can replace INSERT / DELETE into root table with MERGE statement and we do not need a temp table. the child tables however still need temp table to know which rows to delete. maybe our original merge sql should be restructured to separate the child tables processing so it can be reused?
- we'll need to support merge keys and hard deletes as well.
- we still need temp tables. was the problem with ATHENA a lack of temp tables?
First addressing a main consideration here: Then addressing your points in order:
Two options:
|
@jorritsandbrink Athena has some kind of concept of temp tables when you use the WITH statement. I'm not sure wether that is helpful in your case, but that is what I found when looking at your pr this morning and doing some research. |
Also why do we need a primary key? The merge into has this ON clause which should support AND and OR conditions for merging on composite keys. Or am I mistaken here? Maybe we can make the athena merge work first like the other destinations without a new merge strategy? I am not sure why I did not do that, there was some road block, but it may have been that I did not fully understand our merges at the time. |
@sh-rp We already discussed this on Slack, but I'll add it here too for transparency and completeness.
We need "The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows."
Consider this case to understand why these errors make sense: staging table:
destination table:
What should be the result of the destination table after upserting the staging table into it? It's not properly defined.
Yes, these are the CTEs I mention in my previous comment under (4). We can definitely use that to make it work, but it would imply recalculation for each table, rather than calculating once and reusing results as is the case with a temp table. As such, it won't be efficient. |
@jorritsandbrink thx for the explanation. I reviewed the code and now I understand your concept better.
now regarding the upsert itself:
|
table_name, staging_table_name = sql_client.get_qualified_table_names(table["name"]) | ||
|
||
# delete records for elements no longer in the list | ||
sql.append(f""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dummy comment to create link.
Closed PR. This has moved to #1466. |
Description
This started as building
merge
support for Athena Iceberg, but the SQL logic runs on some other engines out-of-the-box too (e.g.postgres
,mssql
,snowflake
,databricks
,bigquery
). It primarily uses theMERGE INTO
statement to implement upsert (update + insert) logic. Since bothdelete-insert
andscd2
don't apply, I introduced a new merge strategy calledupsert
.Characteristics:
primary_key
primary_key
as deterministic row identifier (_dlt_id
) in root tableRelated Issues
Additional Context