Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows #2269

Open
pee-kay-bee opened this issue Feb 5, 2025 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@pee-kay-bee
Copy link

pee-kay-bee commented Feb 5, 2025

Proposed solution

Update incremental documentation to hint how to use lag feature to acquire records created in certain time window - on systems with frequent and concurrent updates coming from the application layer

see:
#2269 (comment)

dlt version

1.5.0

Describe the problem

Assume the source database has a table called events where rows are only ever inserted (never updated or deleted) with a timestamp column called created

The snippet of code below is intended to load new rows into the destination table incrementally

source_1 = sql_database(credentials).with_resources("events")
source_1.events.apply_hints(incremental=dlt.sources.incremental("created"))

info = pipeline.run(source_1, write_disposition="append")

At the start of a pipeline run, assume the maximum value of the events.created = '2025-02-05 02:30:00'

It appears that dlt stores this value ('2025-02-05 02:30:00') as a 'high water mark' to be used in the next pipeline run. However, on a busy application / database, it's quite possible that a new row is committed to the database AFTER the pipeline started with an earlier events.created value (of say, '2025-02-05 02:29:59').

As a result, this row would not be included in the subsequent pipeline run, since it appears it applies a filter SELECT * FROM events where created >

I would assume the same kind of issue occurs when using an auto-increment column

Expected behavior

Best thing I can suggest is that dlt allows developers to access / modify the high-water-mark value to allow for such lags / latencies which can occur from the point at which an application assigns a timestamp to a column and the time the database actually commits that value. This latency can vary from system to system.

A side effect of this is that consecutive pipeline runs may fetch the same subset of rows. This means the destination table will contain duplicates (unless dlt takes measures to deduplicate - for example :

INSERT into destination.table as tgt 
where not exists 
(select 1 from destination.table as tgt1 
where tgt1.<primary_key_colum> = tgt.<primary_key_colum>)

Steps to reproduce

  1. create a database table in the source database

  2. manually insert events into to the table where max event.created = <some_timestamp>

  3. run the dlt pipeline

  4. manually insert new events into to the source database table where :

  • some events have event.created > <some_timestamp>
  • some events have event.created < <some_timestamp>
  1. run the dlt pipeline

  2. check the corresponding destination database table - the rows in the source having event.created < <some_timestamp> will not be present in the destination

Operating system

macOS

Runtime environment

Local

Python version

3.10

dlt data source

postgresql

dlt destination

No response

Other deployment details

postgres

Additional information

No response

@jkoninger
Copy link

Forgive me if I've misunderstood but would adding a lag window help prevent this issue?

@pee-kay-bee
Copy link
Author

You understood perfectly. Looks like the lag window will do the trick (I should have read the documentation more thoroughly). Thank you for the quick reply!

@jkoninger
Copy link

No problem, though I wonder if this is behaviour should be better documented or if it's worth adding some default behaviour into the incremental loading functionality as this is likely to be a common issue and is something I have thought about myself as well. Unless dbt already considers and corrects for this behind the scenes in which case I stand to be corrected. Anyone have any deeper knowledge on this?

@rudolfix
Copy link
Collaborator

yes, lag will do the trick but it does that by re-acquiring events within a configured window. the use case is to refresh data that got updated (ie. reports in google ads). obviously you need to use merge disposition to avoid duplicates.

what you could also do is to exclude records that are fresher than 1 hour (or more), then your SQL query lags behind newest timestamps giving app layer ie. 1h to insert all missing records. Please take a look at this:
https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database/usage#write-custom-sql-custom-queries
to write/modify query.

@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Feb 10, 2025
@rudolfix rudolfix added the question Further information is requested label Feb 10, 2025
@rudolfix rudolfix self-assigned this Feb 10, 2025
@pee-kay-bee
Copy link
Author

pee-kay-bee commented Feb 10, 2025

No problem, though I wonder if this is behaviour should be better documented or if it's worth adding some default behaviour into the incremental loading functionality as this is likely to be a common issue and is something I have thought about myself as well. Unless dbt already considers and corrects for this behind the scenes in which case I stand to be corrected. Anyone have any deeper knowledge on this?

I personally think it would be wise to document that an incremental load might 'miss' rows because rows with a higher value for the 'high water mark' column had not been committed to the database at the time the pipeline ran. The documentation could suggest ways to mitigate the problem (such as using lag).

@rudolfix
Copy link
Collaborator

OK! we will convert it into docs request

@rudolfix rudolfix added documentation Improvements or additions to documentation and removed question Further information is requested labels Feb 17, 2025
@rudolfix rudolfix removed their assignment Feb 17, 2025
@rudolfix rudolfix moved this from In Progress to Todo in dlt core library Feb 17, 2025
@rudolfix rudolfix changed the title Incremental loads which extract data from a source database table using a timestamp column may miss rows [documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Todo
Development

No branches or pull requests

3 participants