Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow native Delta reader to prune partitions before scanning underlying Parquet #20998

Open
29antonioac opened this issue Jan 30, 2025 · 0 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@29antonioac
Copy link
Contributor

29antonioac commented Jan 30, 2025

Description

When scanning a Delta Table, the native Parquet reader scans all the Parquet files without pruning partitions. This is achieved with file_uris() call while scanning.

https://github.com/pola-rs/polars/blob/main/py-polars/polars/io/delta.py#L393

This is not a problem when the number of partitions is small. However, when the number is large (in my case 12 years of daily partitions) the Parquet reader scans every single file. The result will be the same, but lots of Parquet metadata have to be downloaded and checked.

Since the scan_delta function can't have access to filters applied afterwards, I was wondering if an extra partition_filters argument could be added, so it could be passed through file_uris(), as per the Delta documentation.

I'd be happy to open a PR myself if the issue gets accepted (I'd need some guidance to not break type checks though!). I'd start with something like

FilterLiteralType = Tuple[str, str, Any]
FilterConjunctionType = List[FilterLiteralType]

def scan_delta(
    source: str | DeltaTable,
    *,
    version: int | str | datetime | None = None,
    storage_options: dict[str, Any] | None = None,
    credential_provider: CredentialProviderFunction | Literal["auto"] | None = "auto",
    delta_table_options: dict[str, Any] | None = None,
    partition_filters: Optional[FilterConjunctionType] = None,
    use_pyarrow: bool = False,
    pyarrow_options: dict[str, Any] | None = None,
    rechunk: bool | None = None,
) -> LazyFrame:

# Rest of the code

file_uris = dl_tbl.file_uris(partition_filters=partition_filters)

# Rest of the code

Thanks for the hard work on Polars!

@29antonioac 29antonioac added the enhancement New feature or an improvement of an existing feature label Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant