-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use registry when creating Stream in StreamingDataset #858
Conversation
Hoping to request a review from @XiaohanZhangCMU too, but Github wouldn't let me do that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code and tests all lgtm, thanks for adding this! Could you please add a few sentences about registries in the streaming docs, likely in this section: https://docs.mosaicml.com/projects/streaming/en/stable/dataset_configuration/mixing_data_sources.html#using-streams
Can be very similar to what you have in the PR description. And a link to LLM Foundry's READMEs on registry usage would be useful too.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding tests and updating the docs! Just minor changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good! Left a few minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit, other than that LGTM
Co-authored-by: Saaketh Narayan <[email protected]>
Description of changes:
Add registry support to allow custom implementations of components like
Stream
, (in the future) batching methods and shuffling algorithms, similar to LLM foundry's registry.Changes made in this PR:
streams_registry
that contains theStream
classStreamingDataset
to use the stream instance constructed fromstreams_registry
Existing
Stream
andStreamingDataset
functionality remains unchanged, with added support for custom Stream implementations. Seetest_registry.py
and LLM foundry's README of registry for more usage examples. Minimal example:Issue #, if available:
N/A
Merge Checklist:
Put an
x
without space in the boxes that apply. If you are unsure about any checklist, please don't hesitate to ask. We are here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
pre-commit
on my change. (check out thepre-commit
section of prerequisites)