[documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows #2269

pee-kay-bee · 2025-02-05T01:15:55Z

Proposed solution

Update incremental documentation to hint how to use lag feature to acquire records created in certain time window - on systems with frequent and concurrent updates coming from the application layer

see:
#2269 (comment)

dlt version

1.5.0

Describe the problem

Assume the source database has a table called events where rows are only ever inserted (never updated or deleted) with a timestamp column called created

The snippet of code below is intended to load new rows into the destination table incrementally

source_1 = sql_database(credentials).with_resources("events")
source_1.events.apply_hints(incremental=dlt.sources.incremental("created"))

info = pipeline.run(source_1, write_disposition="append")

At the start of a pipeline run, assume the maximum value of the events.created = '2025-02-05 02:30:00'

It appears that dlt stores this value ('2025-02-05 02:30:00') as a 'high water mark' to be used in the next pipeline run. However, on a busy application / database, it's quite possible that a new row is committed to the database AFTER the pipeline started with an earlier events.created value (of say, '2025-02-05 02:29:59').

As a result, this row would not be included in the subsequent pipeline run, since it appears it applies a filter SELECT * FROM events where created >

I would assume the same kind of issue occurs when using an auto-increment column

Expected behavior

Best thing I can suggest is that dlt allows developers to access / modify the high-water-mark value to allow for such lags / latencies which can occur from the point at which an application assigns a timestamp to a column and the time the database actually commits that value. This latency can vary from system to system.

A side effect of this is that consecutive pipeline runs may fetch the same subset of rows. This means the destination table will contain duplicates (unless dlt takes measures to deduplicate - for example :

INSERT into destination.table as tgt 
where not exists 
(select 1 from destination.table as tgt1 
where tgt1.<primary_key_colum> = tgt.<primary_key_colum>)

Steps to reproduce

create a database table in the source database
manually insert events into to the table where max event.created = <some_timestamp>
run the dlt pipeline
manually insert new events into to the source database table where :

some events have event.created > <some_timestamp>
some events have event.created < <some_timestamp>

run the dlt pipeline
check the corresponding destination database table - the rows in the source having event.created < <some_timestamp> will not be present in the destination

Operating system

macOS

Runtime environment

Local

Python version

3.10

dlt data source

postgresql

dlt destination

No response

Other deployment details

postgres

Additional information

No response

The text was updated successfully, but these errors were encountered:

jkoninger · 2025-02-06T12:24:09Z

Forgive me if I've misunderstood but would adding a lag window help prevent this issue?

pee-kay-bee · 2025-02-06T18:31:05Z

You understood perfectly. Looks like the lag window will do the trick (I should have read the documentation more thoroughly). Thank you for the quick reply!

jkoninger · 2025-02-07T10:01:10Z

No problem, though I wonder if this is behaviour should be better documented or if it's worth adding some default behaviour into the incremental loading functionality as this is likely to be a common issue and is something I have thought about myself as well. Unless dbt already considers and corrects for this behind the scenes in which case I stand to be corrected. Anyone have any deeper knowledge on this?

rudolfix · 2025-02-10T17:22:42Z

yes, lag will do the trick but it does that by re-acquiring events within a configured window. the use case is to refresh data that got updated (ie. reports in google ads). obviously you need to use merge disposition to avoid duplicates.

what you could also do is to exclude records that are fresher than 1 hour (or more), then your SQL query lags behind newest timestamps giving app layer ie. 1h to insert all missing records. Please take a look at this:
https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database/usage#write-custom-sql-custom-queries
to write/modify query.

pee-kay-bee · 2025-02-10T18:08:01Z

No problem, though I wonder if this is behaviour should be better documented or if it's worth adding some default behaviour into the incremental loading functionality as this is likely to be a common issue and is something I have thought about myself as well. Unless dbt already considers and corrects for this behind the scenes in which case I stand to be corrected. Anyone have any deeper knowledge on this?

I personally think it would be wise to document that an incremental load might 'miss' rows because rows with a higher value for the 'high water mark' column had not been committed to the database at the time the pipeline ran. The documentation could suggest ways to mitigate the problem (such as using lag).

rudolfix · 2025-02-17T12:28:36Z

OK! we will convert it into docs request

github-project-automation bot added this to dlt core library Feb 5, 2025

github-project-automation bot moved this to Todo in dlt core library Feb 5, 2025

rudolfix moved this from Todo to In Progress in dlt core library Feb 10, 2025

rudolfix added the question Further information is requested label Feb 10, 2025

rudolfix self-assigned this Feb 10, 2025

rudolfix added documentation Improvements or additions to documentation and removed question Further information is requested labels Feb 17, 2025

rudolfix removed their assignment Feb 17, 2025

rudolfix moved this from In Progress to Todo in dlt core library Feb 17, 2025

rudolfix changed the title ~~Incremental loads which extract data from a source database table using a timestamp column may miss rows~~ [documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows #2269

[documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows #2269

pee-kay-bee commented Feb 5, 2025 •

edited by rudolfix

Loading

jkoninger commented Feb 6, 2025

pee-kay-bee commented Feb 6, 2025

jkoninger commented Feb 7, 2025

rudolfix commented Feb 10, 2025

pee-kay-bee commented Feb 10, 2025 •

edited

Loading

rudolfix commented Feb 17, 2025

[documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows #2269

[documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows #2269

Comments

pee-kay-bee commented Feb 5, 2025 • edited by rudolfix Loading

Proposed solution

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

jkoninger commented Feb 6, 2025

pee-kay-bee commented Feb 6, 2025

jkoninger commented Feb 7, 2025

rudolfix commented Feb 10, 2025

pee-kay-bee commented Feb 10, 2025 • edited Loading

rudolfix commented Feb 17, 2025

pee-kay-bee commented Feb 5, 2025 •

edited by rudolfix

Loading

pee-kay-bee commented Feb 10, 2025 •

edited

Loading