-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition pruning does not work when values in predicate do not match column datatype #3278
Comments
Thanks for the repro, I'll look at it :) |
Yep - the main issue I am facing is actually JUST the quotes I added around my predicate values. If I remove those quotes then this works great (I was able to use the rust engine to merge my big wide table for the first time ever successfully). |
Indeed, so streaming exec works fine, the issue or potential not yet supported functionality is that the datafusion pruning_predicate doesn't coerce the expr values to the column type while pruning |
Updated repro: import os
import shutil
import pyarrow as pa
from deltalake import DeltaTable
table_path1 = "data/test_delta_incorrect_predicate"
table_path2 = "data/test_delta_correct_predicate"
for path in [table_path1, table_path2]:
if os.path.exists(path):
shutil.rmtree(path)
os.makedirs(path)
schema = pa.schema([
pa.field("month_id", pa.int32()),
pa.field("date_id", pa.int32()),
pa.field("unique_row_hash", pa.string()),
pa.field("col1", pa.string())
])
dt1 = DeltaTable.create(
table_uri=table_path1,
schema=schema,
partition_by=["month_id", "date_id"]
)
dt2 = DeltaTable.create(
table_uri=table_path2,
schema=schema,
partition_by=["month_id", "date_id"]
)
partitions = [
(202501, 20250101),
(202502, 20250201),
(202502, 20250226),
(202503, 20250301),
]
rows = []
for month_id, date_id in partitions:
rows.append([month_id, date_id, f"hash_{month_id}_{date_id}", "value"])
source_table = pa.Table.from_arrays(
[
pa.array([row[0] for row in rows], type=pa.int32()),
pa.array([row[1] for row in rows], type=pa.int32()),
pa.array([row[2] for row in rows], type=pa.string()),
pa.array([row[3] for row in rows], type=pa.string()),
],
names=["month_id", "date_id", "unique_row_hash", "col1"]
)
for dt in [dt1, dt2]:
dt.merge(
source=source_table,
predicate="s.unique_row_hash = t.unique_row_hash",
source_alias="s",
target_alias="t"
).when_not_matched_insert_all().execute()
print(f"Table 1 has {len(dt1.files())} files")
print(f"Table 2 has {len(dt2.files())} files")
source_table_new = pa.Table.from_arrays(
[
pa.array([202502], type=pa.int32()),
pa.array([20250226], type=pa.int32()),
pa.array(["new_hash"], type=pa.string()),
pa.array(["new_value"], type=pa.string()),
],
names=["month_id", "date_id", "unique_row_hash", "col1"]
)
correct_merge_predicate = "(s.unique_row_hash = t.unique_row_hash) AND (s.month_id = t.month_id AND t.month_id = 202502 AND s.date_id = t.date_id AND t.date_id = 20250226)"
incorrect_merge_predicate = "(s.unique_row_hash = t.unique_row_hash) AND (s.month_id = t.month_id AND t.month_id = '202502' AND s.date_id = t.date_id AND t.date_id = '20250226')"
result1 = dt1.merge(
source=source_table_new,
predicate=incorrect_merge_predicate,
source_alias="s",
target_alias="t",
).when_not_matched_insert_all().execute()
result2 = dt2.merge(
source=source_table_new,
predicate=correct_merge_predicate,
source_alias="s",
target_alias="t",
).when_not_matched_insert_all().execute()
print(f"Files scanned (incorrect predicate): {result1['num_target_files_scanned']}")
print(f"Files skipped (incorrect predicate ): {result1['num_target_files_skipped_during_scan']}")
print(f"Files scanned (correct predicate): {result2['num_target_files_scanned']}")
print(f"Files skipped (correct predicate): {result2['num_target_files_skipped_during_scan']}")
|
Environment
Delta-rs version:
0.25.2
Binding:
Environment:
Bug
What happened:
When I upgraded to 0.25 a lot of my delta tables began to fail due to pod OOM errors despite providing more and more memory to them. Disabling
streamed_exec=True
fixed this for me.It seems like partition pruning is not happening - tables with hundreds or thousands are partitions are being fully scanned even when the merge predicates are specific and point directly to a single partition. For example, one of my tables is just a single date of data (76k rows) which is merged into a table partitioned by month_id and date_id. This should (and used to be) a quick merge since the predicate we use is explicit. This fails with
streamed_exec=True
though and debugging shows all 2600+ files are scanned.What you expected to happen:
Predicates which filter out partitions should be respected.
How to reproduce it:
More details:
The results show that when
streamed_exec=True
the merge is not skipping all files.Results:
FYI - I noticed that when I quoted the partition values in the predicate it made the situation worse.
month_id and date_id are integers in the data, but my custom merge function adds partitions to the predicate and I am quoting all values:
So my real-world example is impacted by this. I was under the impression that we need to always use strings with delta-rs. Maybe that is just for partitions filters, like
[("month_id", "=", "202502")]
, and not predicates?The text was updated successfully, but these errors were encountered: