Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ParquetReader #40

Merged
merged 3 commits into from
Dec 15, 2023
Merged

Optimize ParquetReader #40

merged 3 commits into from
Dec 15, 2023

Conversation

mariosasko
Copy link
Contributor

@mariosasko mariosasko commented Dec 12, 2023

Optimize the ParquetReader component by reading files in chunks (of size 1000) and only decoding the content_key column (reads less bytes and makes the Arrow to Python conversion faster)

This Colab shows the difference in speed:
https://colab.research.google.com/drive/14c1lvasWYg0ScsIxeTZcqIUVV8ulCYUi?usp=sharing

@mariosasko mariosasko requested a review from guipenedo December 12, 2023 18:20
@guipenedo
Copy link
Collaborator

Hi!
I suppose we can indeed increase the batch size a bit, I just don't want it to be too big so as to not considerably increase memory consumption. 1k is fine I guess, maybe we can make it configurable?

Regarding the columns, the general approach with readers is to dump the remaining columns on metadata as people might want to keep it (and even the id could come from the parquet file). Maybe also add an option to only load the text column?

Otherwise LGTM

@mariosasko
Copy link
Contributor Author

The datasets lib has the default batch size of 10_000, and its primary focus is not processing data at scale (we got no complaints there regarding this 🙂), so I think 1000 here should be okay (we can expect users to have enough RAM).

if not document:
continue
li += 1
yield document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small note that the yield should be outside the track_time block, otherwise other blocks' execution time will be taken into account here

@guipenedo
Copy link
Collaborator

Looks great, thanks a lot! Regarding the RAM comment, just as an example I'm currently running a data processing job on a few k cpus where each has ~2GB of ram, and the usage is almost the full 2GB of RAM per cpu, hence my concern for keeping usage low when possible. But with the configurable option we can now tune with a performance/memory trade off, so all good.
I've made a small change which addresses my yield comment above. Unfortunately it also adds a new list, if you find a better way to do it feel free to change, if not we can merge.

@guipenedo guipenedo merged commit 46750dd into main Dec 15, 2023
3 checks passed
@mariosasko mariosasko deleted the optimize-parquet-reader branch December 18, 2023 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants