Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow.dataset should be used rather than pyarrow.parquet #79

Open
barbuz opened this issue Sep 13, 2024 · 0 comments
Open

pyarrow.dataset should be used rather than pyarrow.parquet #79

barbuz opened this issue Sep 13, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@barbuz
Copy link

barbuz commented Sep 13, 2024

The pyarrow.dataset module is a newer implementation than pyarrow.parquet, that aims to have an interface less specific to python or to parquet.
More interestingly:

In addition pyarrow.dataset boasts improved performance and new features (e.g. filtering within files rather than only on partition keys).

In general, we should try to use the new module rather than the old one whenever possible. From what I've seen, the parquet module still has more powerful tools to write parquet files, but the dataset module is better at reading them. This change should then mostly affect the notebooks, switching from using the ParquetDataset class to reading data with the pyarrow.dataset.dataset function.

Also, if reading data from s3 a good performance improvement (especially for datasets with lots of partitions) can be obtained by explicitly passing a filesystem argument to the dataset function with an object created via s3fs.S3FileSystem or pyarrow's own s3 filesystem. I have found this to be true for both the parquet and dataset modules, but I could not find documentation about this.

@barbuz barbuz added the enhancement New feature or request label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant