Skip to content

Commit

Permalink
Minor: explain why we recommend a certain size for row groups (#3122)
Browse files Browse the repository at this point in the history
minor: explain why we recommend a certain size for row groups
  • Loading branch information
lhoestq authored Dec 18, 2024
1 parent 096abc2 commit 04f9b1e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion docs/source/parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ To read and query the Parquet files, take a look at the [Query datasets from the
## Partially converted datasets

The Parquet version can be partial in two cases:
- if the dataset is already in Parquet format but it contains row groups bigger than the recommended size (100-300MB uncompressed)
- if the dataset is already in Parquet format but it contains row groups bigger than the recommended size (100-300MB uncompressed). This size is better for memory usage since Parquet is streamed row group per row group in most data libraries.
- if the dataset is not already in Parquet format or if it is bigger than 5GB.

In that case the Parquet files are generated up to 5GB and placed in a split directory prefixed with "partial", e.g. "partial-train" instead of "train".
Expand Down

0 comments on commit 04f9b1e

Please sign in to comment.