pyarrow.dataset should be used rather than pyarrow.parquet #79

barbuz · 2024-09-13T02:07:17Z

The pyarrow.dataset module is a newer implementation than pyarrow.parquet, that aims to have an interface less specific to python or to parquet.
More interestingly:

In addition pyarrow.dataset boasts improved performance and new features (e.g. filtering within files rather than only on partition keys).

In general, we should try to use the new module rather than the old one whenever possible. From what I've seen, the parquet module still has more powerful tools to write parquet files, but the dataset module is better at reading them. This change should then mostly affect the notebooks, switching from using the ParquetDataset class to reading data with the pyarrow.dataset.dataset function.

Also, if reading data from s3 a good performance improvement (especially for datasets with lots of partitions) can be obtained by explicitly passing a filesystem argument to the dataset function with an object created via s3fs.S3FileSystem or pyarrow's own s3 filesystem. I have found this to be true for both the parquet and dataset modules, but I could not find documentation about this.

The text was updated successfully, but these errors were encountered:

barbuz added the enhancement New feature or request label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyarrow.dataset should be used rather than pyarrow.parquet #79

pyarrow.dataset should be used rather than pyarrow.parquet #79

barbuz commented Sep 13, 2024

pyarrow.dataset should be used rather than pyarrow.parquet #79

pyarrow.dataset should be used rather than pyarrow.parquet #79

Comments

barbuz commented Sep 13, 2024