-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deltalake dedup_sort support #2202
base: devel
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guitcastro right now dedup_sort
has no function for upsert
merge strategy. indeed we do not issue a warning when it is defined. it is just ignored.
we assume that in case of upsert
the input dataset is deduplicated on primary_key. in that case, the target dataset is deduplicated as well.
we could indeed use dedup_sort
to deduplicate the source. but your code is doing something else. #2201 does not help to understand your use case.
are you trying to use dedup_sort
as some kind of record version? and skip the update on rows for which version is not changed?
that is typically done using Incremental
during extract phase so records that are not changed are simply not present.
The issue was indeed confused. English is not my native language and I opened the issues before digging in the code and didn't not updated it. Sorry about that.
I might have written the code wrong, but my intention is to deduplicate using dedup_sort. It's working locally and is based on this example from data brics docs:
Could you point out what might be wrong with my code? cheers! |
Your problems is: your source data contains duplicates and you want to get rid of them. correct? This should happen BEFORE you execute merge statement so:
you need to deduplicate current implementation in |
Description
Deltalake is ignoring
dedup_sort
column and always overriding rows with matching ids. This PR added support to dedup_sortAdditional Context
This is my first PR, and I am very unfamiliar with the code base. So, I expected that this PR might need a few interactions of reviews to get it right. Per example, I didn't not find an easy way to test it. I am very open to make changes if someone point's me the right direction.