From 04f9b1efee148593271bbcc200a0db6252d22a9b Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Wed, 18 Dec 2024 16:55:25 +0100 Subject: [PATCH] Minor: explain why we recommend a certain size for row groups (#3122) minor: explain why we recommend a certain size for row groups --- docs/source/parquet.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/parquet.md b/docs/source/parquet.md index 12ef6bd5f..bd4741bb8 100644 --- a/docs/source/parquet.md +++ b/docs/source/parquet.md @@ -200,7 +200,7 @@ To read and query the Parquet files, take a look at the [Query datasets from the ## Partially converted datasets The Parquet version can be partial in two cases: -- if the dataset is already in Parquet format but it contains row groups bigger than the recommended size (100-300MB uncompressed) +- if the dataset is already in Parquet format but it contains row groups bigger than the recommended size (100-300MB uncompressed). This size is better for memory usage since Parquet is streamed row group per row group in most data libraries. - if the dataset is not already in Parquet format or if it is bigger than 5GB. In that case the Parquet files are generated up to 5GB and placed in a split directory prefixed with "partial", e.g. "partial-train" instead of "train".