-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[documentation] Incremental loads which extract data from a source database table using a timestamp column may miss rows #2269
Comments
Forgive me if I've misunderstood but would adding a lag window help prevent this issue? |
You understood perfectly. Looks like the lag window will do the trick (I should have read the documentation more thoroughly). Thank you for the quick reply! |
No problem, though I wonder if this is behaviour should be better documented or if it's worth adding some default behaviour into the incremental loading functionality as this is likely to be a common issue and is something I have thought about myself as well. Unless dbt already considers and corrects for this behind the scenes in which case I stand to be corrected. Anyone have any deeper knowledge on this? |
yes, what you could also do is to exclude records that are fresher than 1 hour (or more), then your SQL query lags behind newest timestamps giving app layer ie. 1h to insert all missing records. Please take a look at this: |
I personally think it would be wise to document that an incremental load might 'miss' rows because rows with a higher value for the 'high water mark' column had not been committed to the database at the time the pipeline ran. The documentation could suggest ways to mitigate the problem (such as using lag). |
OK! we will convert it into docs request |
Proposed solution
Update incremental documentation to hint how to use
lag
feature to acquire records created in certain time window - on systems with frequent and concurrent updates coming from the application layersee:
#2269 (comment)
dlt version
1.5.0
Describe the problem
Assume the source database has a table called
events
where rows are only ever inserted (never updated or deleted) with a timestamp column calledcreated
The snippet of code below is intended to load new rows into the destination table incrementally
At the start of a pipeline run, assume the maximum value of the events.created = '2025-02-05 02:30:00'
It appears that dlt stores this value ('2025-02-05 02:30:00') as a 'high water mark' to be used in the next pipeline run. However, on a busy application / database, it's quite possible that a new row is committed to the database AFTER the pipeline started with an earlier
events.created
value (of say, '2025-02-05 02:29:59').As a result, this row would not be included in the subsequent pipeline run, since it appears it applies a filter SELECT * FROM events where created >
I would assume the same kind of issue occurs when using an auto-increment column
Expected behavior
Best thing I can suggest is that dlt allows developers to access / modify the high-water-mark value to allow for such lags / latencies which can occur from the point at which an application assigns a timestamp to a column and the time the database actually commits that value. This latency can vary from system to system.
A side effect of this is that consecutive pipeline runs may fetch the same subset of rows. This means the destination table will contain duplicates (unless dlt takes measures to deduplicate - for example :
Steps to reproduce
create a database table in the source database
manually insert events into to the table where max event.created = <some_timestamp>
run the dlt pipeline
manually insert new events into to the source database table where :
event.created > <some_timestamp>
event.created < <some_timestamp>
run the dlt pipeline
check the corresponding destination database table - the rows in the source having
event.created < <some_timestamp>
will not be present in the destinationOperating system
macOS
Runtime environment
Local
Python version
3.10
dlt data source
postgresql
dlt destination
No response
Other deployment details
postgres
Additional information
No response
The text was updated successfully, but these errors were encountered: