Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch all IO to fsspec #49

Merged
merged 34 commits into from
Jan 18, 2024
Merged

Switch all IO to fsspec #49

merged 34 commits into from
Jan 18, 2024

Conversation

guipenedo
Copy link
Collaborator

@guipenedo guipenedo commented Jan 10, 2024

WIP, a better solution needs to be found for tokenization and other memmap dependent blocks for sources like s3.
Paths can be passed as simple strings, or as tuples (str, fs options dictionary) or as a DataFolder object directly.

addresses #41

@guipenedo guipenedo requested a review from mariosasko January 10, 2024 16:07
@guipenedo
Copy link
Collaborator Author

Conflicts fixed

@guipenedo
Copy link
Collaborator Author

guipenedo commented Jan 12, 2024

Should be ready now. Replaced most mmaps with simple seek on streams (there is quite a bit of jumping around, mostly on the shuffling part of tokenization). I tried to optimize block_size a bit (and disabled caching) for this part, but it's still much slower than before (write file to disk, memmap it, create new one by shuffling that previous one and only upload the final one)
Would love to hear your thoughts/comments @mariosasko

Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Regarding the mmap issue, I think we should be able to address it by using the LocalFileSystem's underlying file descriptor in the mmap functions.

src/datatrove/datafolder.py Outdated Show resolved Hide resolved
src/datatrove/datafolder.py Outdated Show resolved Hide resolved
src/datatrove/datafolder.py Outdated Show resolved Hide resolved
src/datatrove/datafolder.py Outdated Show resolved Hide resolved
src/datatrove/datafolder.py Outdated Show resolved Hide resolved
src/datatrove/datafolder.py Outdated Show resolved Hide resolved
src/datatrove/datafolder.py Outdated Show resolved Hide resolved
src/datatrove/datafolder.py Outdated Show resolved Hide resolved
src/datatrove/pipeline/dedup/minhash.py Outdated Show resolved Hide resolved
src/datatrove/pipeline/dedup/sentence_dedup.py Outdated Show resolved Hide resolved
@guipenedo
Copy link
Collaborator Author

Nice! Regarding the mmap issue, I think we should be able to address it by using the LocalFileSystem's underlying file descriptor in the mmap functions.

I will add a local "working_directory" for this part then I suppose

Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more comments

src/datatrove/io.py Show resolved Hide resolved
src/datatrove/io.py Show resolved Hide resolved
@guipenedo guipenedo requested a review from mariosasko January 17, 2024 14:18
@guipenedo guipenedo merged commit 25b84b9 into main Jan 18, 2024
4 checks passed
@guipenedo guipenedo deleted the io-fsspec branch January 18, 2024 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants