Optimize `ParquetReader` #40

mariosasko · 2023-12-12T18:07:35Z

Optimize the ParquetReader component by reading files in chunks (of size 1000) and only decoding the content_key column (reads less bytes and makes the Arrow to Python conversion faster)

This Colab shows the difference in speed:
https://colab.research.google.com/drive/14c1lvasWYg0ScsIxeTZcqIUVV8ulCYUi?usp=sharing

guipenedo · 2023-12-12T18:32:18Z

Hi!
I suppose we can indeed increase the batch size a bit, I just don't want it to be too big so as to not considerably increase memory consumption. 1k is fine I guess, maybe we can make it configurable?

Regarding the columns, the general approach with readers is to dump the remaining columns on metadata as people might want to keep it (and even the id could come from the parquet file). Maybe also add an option to only load the text column?

Otherwise LGTM

mariosasko · 2023-12-13T13:04:30Z

The datasets lib has the default batch size of 10_000, and its primary focus is not processing data at scale (we got no complaints there regarding this 🙂), so I think 1000 here should be okay (we can expect users to have enough RAM).

guipenedo · 2023-12-13T14:18:16Z

src/datatrove/pipeline/readers/parquet.py

+                            if not document:
+                                continue
+                            li += 1
+                            yield document


just a small note that the yield should be outside the track_time block, otherwise other blocks' execution time will be taken into account here

guipenedo · 2023-12-13T14:45:52Z

Looks great, thanks a lot! Regarding the RAM comment, just as an example I'm currently running a data processing job on a few k cpus where each has ~2GB of ram, and the usage is almost the full 2GB of RAM per cpu, hence my concern for keeping usage low when possible. But with the configurable option we can now tune with a performance/memory trade off, so all good.
I've made a small change which addresses my yield comment above. Unfortunately it also adds a new list, if you find a better way to do it feel free to change, if not we can merge.

Optimize ParquetReader

ba648c9

mariosasko requested a review from guipenedo December 12, 2023 18:20

Address comments

cb81215

guipenedo reviewed Dec 13, 2023

View reviewed changes

correctly track time

0ebc066

guipenedo merged commit 46750dd into main Dec 15, 2023
3 checks passed

mariosasko deleted the optimize-parquet-reader branch December 18, 2023 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `ParquetReader` #40

Optimize `ParquetReader` #40

mariosasko commented Dec 12, 2023 •

edited

Loading

guipenedo commented Dec 12, 2023

mariosasko commented Dec 13, 2023

guipenedo Dec 13, 2023

guipenedo commented Dec 13, 2023

Optimize ParquetReader #40

Optimize ParquetReader #40

Conversation

mariosasko commented Dec 12, 2023 • edited Loading

guipenedo commented Dec 12, 2023

mariosasko commented Dec 13, 2023

guipenedo Dec 13, 2023

Choose a reason for hiding this comment

guipenedo commented Dec 13, 2023

Optimize `ParquetReader` #40

Optimize `ParquetReader` #40

mariosasko commented Dec 12, 2023 •

edited

Loading