Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list.any/all block predicate pushdown #21014

Open
2 tasks done
amotzop opened this issue Jan 30, 2025 · 0 comments
Open
2 tasks done

list.any/all block predicate pushdown #21014

amotzop opened this issue Jan 30, 2025 · 0 comments
Labels
A-optimizer Area: plan optimization bug Something isn't working P-medium Priority: medium

Comments

@amotzop
Copy link

amotzop commented Jan 30, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({"lst": [[True]] * 100})
print(df.lazy().select(check=pl.col("lst").list.any()).head().explain())

prints

SLICE[offset: 0, len: 5]
   SELECT [col("lst").list.any().alias("check")] FROM
    DF ["lst"]; PROJECT 1/1 COLUMNS

while

print(df.lazy().select(check=pl.col("lst").list.contains(True)).head().explain())

prints

SELECT [col("lst").list.contains([true]).alias("check")] FROM
  DF ["lst"]; PROJECT 1/1 COLUMNS

Log output

Issue description

If you use list.any/all and the filter on something else, the filtering step isn't pushed down to before the any/all call.
I also saw similar optimization problems around using these functions, for example filtering on list.any and then calling head should be quite fast (as the code should finish when 5 filtered rows are found), but can be very slow for large frames (as I think the filter is calculated on all rows).

Expected behavior

These optimizations should be used (list.any has slightly different behavior than list.contains(True), but not enough to explain this)

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-6.8.0-49-generic-x86_64-with-glibc2.39
Python:              3.12.3 (main, Jan 17 2025, 18:03:48) [GCC 13.3.0]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.6.3
numpy                1.26.4
openpyxl             <not installed>
pandas               <not installed>
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@amotzop amotzop added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 30, 2025
@coastalwhite coastalwhite added P-medium Priority: medium A-optimizer Area: plan optimization and removed needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-optimizer Area: plan optimization bug Something isn't working P-medium Priority: medium
Projects
None yet
Development

No branches or pull requests

2 participants