Deltalake dedup_sort support #2202

guitcastro · 2025-01-09T22:59:41Z

Description

Deltalake is ignoring dedup_sort column and always overriding rows with matching ids. This PR added support to dedup_sort

Additional Context

This is my first PR, and I am very unfamiliar with the code base. So, I expected that this PR might need a few interactions of reviews to get it right. Per example, I didn't not find an easy way to test it. I am very open to make changes if someone point's me the right direction.

netlify · 2025-01-09T23:00:00Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`f613ac6`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/678685a325de08000858889e
😎 Deploy Preview	https://deploy-preview-2202--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

rudolfix

@guitcastro right now dedup_sort has no function for upsert merge strategy. indeed we do not issue a warning when it is defined. it is just ignored.

we assume that in case of upsert the input dataset is deduplicated on primary_key. in that case, the target dataset is deduplicated as well.

we could indeed use dedup_sort to deduplicate the source. but your code is doing something else. #2201 does not help to understand your use case.

are you trying to use dedup_sort as some kind of record version? and skip the update on rows for which version is not changed?

that is typically done using Incremental during extract phase so records that are not changed are simply not present.

guitcastro · 2025-01-21T19:10:35Z

@rudolfix

The issue was indeed confused. English is not my native language and I opened the issues before digging in the code and didn't not updated it. Sorry about that.

we could indeed use dedup_sort to deduplicate the source. but your code is doing something else. #2201 does not help to understand your use case.

I might have written the code wrong, but my intention is to deduplicate using dedup_sort. It's working locally and is based on this example from data brics docs:

deltaTable.alias("logs").merge(
    newDedupedLogs.alias("newDedupedLogs"),
    "logs.uniqueId = newDedupedLogs.uniqueId AND logs.date > current_date() - INTERVAL 7 DAYS") \
  .whenNotMatchedInsertAll("newDedupedLogs.date > current_date() - INTERVAL 7 DAYS") \
  .execute()

Could you point out what might be wrong with my code?

cheers!

rudolfix · 2025-01-22T10:46:08Z

Your problems is: your source data contains duplicates and you want to get rid of them. correct? This should happen BEFORE you execute merge statement so:

def merge_delta_table(
    table: DeltaTable,
    data: Union[pa.Table, pa.RecordBatchReader],
    schema: TTableSchema,
) -> None:

you need to deduplicate data using pyarrow or duckdb.

current implementation in dlt assumes that your source data is deduplicated

Deltalake dedup_sort support

daae03f

rudolfix added the ci from fork run ci workflows on a pr even if they are from a fork label Jan 12, 2025

fix fmt deltalake.py

f613ac6

rudolfix self-requested a review January 20, 2025 11:16

rudolfix self-assigned this Jan 20, 2025

rudolfix reviewed Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deltalake dedup_sort support #2202

Deltalake dedup_sort support #2202

guitcastro commented Jan 9, 2025

netlify bot commented Jan 9, 2025 •

edited

Loading

rudolfix left a comment

guitcastro commented Jan 21, 2025

rudolfix commented Jan 22, 2025

Deltalake dedup_sort support #2202

Are you sure you want to change the base?

Deltalake dedup_sort support #2202

Conversation

guitcastro commented Jan 9, 2025

Description

Additional Context

netlify bot commented Jan 9, 2025 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

rudolfix left a comment

Choose a reason for hiding this comment

guitcastro commented Jan 21, 2025

rudolfix commented Jan 22, 2025

netlify bot commented Jan 9, 2025 •

edited

Loading