You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pyarrow.dataset module is a newer implementation than pyarrow.parquet, that aims to have an interface less specific to python or to parquet.
More interestingly:
In addition pyarrow.dataset boasts improved performance and new features (e.g. filtering within files rather than only on partition keys).
In general, we should try to use the new module rather than the old one whenever possible. From what I've seen, the parquet module still has more powerful tools to write parquet files, but the dataset module is better at reading them. This change should then mostly affect the notebooks, switching from using the ParquetDataset class to reading data with the pyarrow.dataset.dataset function.
Also, if reading data from s3 a good performance improvement (especially for datasets with lots of partitions) can be obtained by explicitly passing a filesystem argument to the dataset function with an object created via s3fs.S3FileSystem or pyarrow's own s3 filesystem. I have found this to be true for both the parquet and dataset modules, but I could not find documentation about this.
The text was updated successfully, but these errors were encountered:
The
pyarrow.dataset
module is a newer implementation thanpyarrow.parquet
, that aims to have an interface less specific to python or to parquet.More interestingly:
In general, we should try to use the new module rather than the old one whenever possible. From what I've seen, the
parquet
module still has more powerful tools to write parquet files, but thedataset
module is better at reading them. This change should then mostly affect the notebooks, switching from using the ParquetDataset class to reading data with the pyarrow.dataset.dataset function.Also, if reading data from s3 a good performance improvement (especially for datasets with lots of partitions) can be obtained by explicitly passing a
filesystem
argument to thedataset
function with an object created vias3fs.S3FileSystem
or pyarrow's own s3 filesystem. I have found this to be true for both theparquet
anddataset
modules, but I could not find documentation about this.The text was updated successfully, but these errors were encountered: