Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image documentation #238

Merged
merged 62 commits into from
Oct 24, 2024
Merged

Add image documentation #238

merged 62 commits into from
Oct 24, 2024

Conversation

ryantwolf
Copy link
Collaborator

@ryantwolf ryantwolf commented Sep 10, 2024

Description

This PR adds

  • Docstrings for all classes in the image curation.
  • API docs for the docstrings.
  • Pages in the user guide about image curation.
  • Updated README with image curation. I felt the README was getting too long with all the image curation stuff in it, so I removed a lot of the fine grain details.

Usage

N/A

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
@sarahyurick sarahyurick added the documentation Improvements or additions to documentation label Oct 7, 2024
@ryantwolf
Copy link
Collaborator Author

The docs are ready to review.

@ryantwolf ryantwolf marked this pull request as ready for review October 18, 2024 20:35
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tutorial! Added general comments and proofreading.

@ryantwolf ryantwolf requested a review from sarahyurick October 22, 2024 21:49
@ryantwolf
Copy link
Collaborator Author

@sarahyurick I addressed most of your feedback. I left comments where I disagreed / wanted clarification. Please take another look when you can.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! I believe there should just be 3 more comments from my previous review that still need to be updated, otherwise LGTM.

@ryantwolf ryantwolf requested a review from sarahyurick October 24, 2024 17:16
@ryantwolf
Copy link
Collaborator Author

I think I addressed all your concerns @sarahyurick, let me know if I missed anything though.

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ryantwolf ryantwolf merged commit 717da18 into main Oct 24, 2024
3 checks passed
@ryantwolf ryantwolf deleted the rywolf/image-docs branch October 24, 2024 20:30
VibhuJawa pushed a commit to VibhuJawa/NeMo-Curator that referenced this pull request Oct 29, 2024
* Add partial image implementation

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor requirements

Signed-off-by: Ryan Wolf <[email protected]>

* Fix bugs

Signed-off-by: Ryan Wolf <[email protected]>

* Change from_map to map_partitions

Signed-off-by: Ryan Wolf <[email protected]>

* Add super constructor

Signed-off-by: Ryan Wolf <[email protected]>

* Add kwargs for load_object_on_worker

Signed-off-by: Ryan Wolf <[email protected]>

* Get proper epoch size

Signed-off-by: Ryan Wolf <[email protected]>

* Complete embedding creation loop

Signed-off-by: Ryan Wolf <[email protected]>

* Change devices

Signed-off-by: Ryan Wolf <[email protected]>

* Add device

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor embedding creation and add classifier

Signed-off-by: Ryan Wolf <[email protected]>

* Fix bugs in classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor model names

Signed-off-by: Ryan Wolf <[email protected]>

* Add model name

Signed-off-by: Ryan Wolf <[email protected]>

* Fix classifier bugs

Signed-off-by: Ryan Wolf <[email protected]>

* Allow postprocessing for classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Fix name and add print

Signed-off-by: Ryan Wolf <[email protected]>

* Fix variable name

Signed-off-by: Ryan Wolf <[email protected]>

* Add NSFW

Signed-off-by: Ryan Wolf <[email protected]>

* Update init for import

Signed-off-by: Ryan Wolf <[email protected]>

* Fix embedding size

Signed-off-by: Ryan Wolf <[email protected]>

* Add fused classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Fix missing index

Signed-off-by: Ryan Wolf <[email protected]>

* Update metdata for fused classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Add export to webdataset

Signed-off-by: Ryan Wolf <[email protected]>

* Fix missing id col

Signed-off-by: Ryan Wolf <[email protected]>

* Sort embeddings by id

Signed-off-by: Ryan Wolf <[email protected]>

* Add timm

Signed-off-by: Ryan Wolf <[email protected]>

* Update init file

Signed-off-by: Ryan Wolf <[email protected]>

* Add autocast to timm

Signed-off-by: Ryan Wolf <[email protected]>

* Update requirements and transform

Signed-off-by: Ryan Wolf <[email protected]>

* Add additional interpolation support

Signed-off-by: Ryan Wolf <[email protected]>

* Fix transform normalization

Signed-off-by: Ryan Wolf <[email protected]>

* Remove open_clip

Signed-off-by: Ryan Wolf <[email protected]>

* Add index path support to wds

Signed-off-by: Ryan Wolf <[email protected]>

* Address Vibhu's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add import guard for image dataset

Signed-off-by: Ryan Wolf <[email protected]>

* Change default device

Signed-off-by: Ryan Wolf <[email protected]>

* Remove commented code

Signed-off-by: Ryan Wolf <[email protected]>

* Remove device id

Signed-off-by: Ryan Wolf <[email protected]>

* Fix index issue

Signed-off-by: Ryan Wolf <[email protected]>

* Add docstrings and standardize variable names

Signed-off-by: Ryan Wolf <[email protected]>

* Add image curation tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add initial image docs

Signed-off-by: Ryan Wolf <[email protected]>

* Remove tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add dataset docs

Signed-off-by: Ryan Wolf <[email protected]>

* Add embedder documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Revert embedding column name change

Signed-off-by: Ryan Wolf <[email protected]>

* Update user guide for images

Signed-off-by: Ryan Wolf <[email protected]>

* Update README

Signed-off-by: Ryan Wolf <[email protected]>

* Update README with RAPIDS nightly instructions

Signed-off-by: Ryan Wolf <[email protected]>

* Fix formatting issues in image documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Remove extra newline in README

Signed-off-by: Ryan Wolf <[email protected]>

* Address most of Sarah's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add section summary

Signed-off-by: Ryan Wolf <[email protected]>

* Fix errors and REWORD GPU bullets in README

Signed-off-by: Ryan Wolf <[email protected]>

* Fix how table of contents displays with new sections

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
ayushdg pushed a commit to ayushdg/NeMo-Curator that referenced this pull request Oct 30, 2024
* Add partial image implementation

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor requirements

Signed-off-by: Ryan Wolf <[email protected]>

* Fix bugs

Signed-off-by: Ryan Wolf <[email protected]>

* Change from_map to map_partitions

Signed-off-by: Ryan Wolf <[email protected]>

* Add super constructor

Signed-off-by: Ryan Wolf <[email protected]>

* Add kwargs for load_object_on_worker

Signed-off-by: Ryan Wolf <[email protected]>

* Get proper epoch size

Signed-off-by: Ryan Wolf <[email protected]>

* Complete embedding creation loop

Signed-off-by: Ryan Wolf <[email protected]>

* Change devices

Signed-off-by: Ryan Wolf <[email protected]>

* Add device

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor embedding creation and add classifier

Signed-off-by: Ryan Wolf <[email protected]>

* Fix bugs in classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor model names

Signed-off-by: Ryan Wolf <[email protected]>

* Add model name

Signed-off-by: Ryan Wolf <[email protected]>

* Fix classifier bugs

Signed-off-by: Ryan Wolf <[email protected]>

* Allow postprocessing for classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Fix name and add print

Signed-off-by: Ryan Wolf <[email protected]>

* Fix variable name

Signed-off-by: Ryan Wolf <[email protected]>

* Add NSFW

Signed-off-by: Ryan Wolf <[email protected]>

* Update init for import

Signed-off-by: Ryan Wolf <[email protected]>

* Fix embedding size

Signed-off-by: Ryan Wolf <[email protected]>

* Add fused classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Fix missing index

Signed-off-by: Ryan Wolf <[email protected]>

* Update metdata for fused classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Add export to webdataset

Signed-off-by: Ryan Wolf <[email protected]>

* Fix missing id col

Signed-off-by: Ryan Wolf <[email protected]>

* Sort embeddings by id

Signed-off-by: Ryan Wolf <[email protected]>

* Add timm

Signed-off-by: Ryan Wolf <[email protected]>

* Update init file

Signed-off-by: Ryan Wolf <[email protected]>

* Add autocast to timm

Signed-off-by: Ryan Wolf <[email protected]>

* Update requirements and transform

Signed-off-by: Ryan Wolf <[email protected]>

* Add additional interpolation support

Signed-off-by: Ryan Wolf <[email protected]>

* Fix transform normalization

Signed-off-by: Ryan Wolf <[email protected]>

* Remove open_clip

Signed-off-by: Ryan Wolf <[email protected]>

* Add index path support to wds

Signed-off-by: Ryan Wolf <[email protected]>

* Address Vibhu's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add import guard for image dataset

Signed-off-by: Ryan Wolf <[email protected]>

* Change default device

Signed-off-by: Ryan Wolf <[email protected]>

* Remove commented code

Signed-off-by: Ryan Wolf <[email protected]>

* Remove device id

Signed-off-by: Ryan Wolf <[email protected]>

* Fix index issue

Signed-off-by: Ryan Wolf <[email protected]>

* Add docstrings and standardize variable names

Signed-off-by: Ryan Wolf <[email protected]>

* Add image curation tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add initial image docs

Signed-off-by: Ryan Wolf <[email protected]>

* Remove tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add dataset docs

Signed-off-by: Ryan Wolf <[email protected]>

* Add embedder documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Revert embedding column name change

Signed-off-by: Ryan Wolf <[email protected]>

* Update user guide for images

Signed-off-by: Ryan Wolf <[email protected]>

* Update README

Signed-off-by: Ryan Wolf <[email protected]>

* Update README with RAPIDS nightly instructions

Signed-off-by: Ryan Wolf <[email protected]>

* Fix formatting issues in image documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Remove extra newline in README

Signed-off-by: Ryan Wolf <[email protected]>

* Address most of Sarah's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add section summary

Signed-off-by: Ryan Wolf <[email protected]>

* Fix errors and REWORD GPU bullets in README

Signed-off-by: Ryan Wolf <[email protected]>

* Fix how table of contents displays with new sections

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
* Add partial image implementation

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor requirements

Signed-off-by: Ryan Wolf <[email protected]>

* Fix bugs

Signed-off-by: Ryan Wolf <[email protected]>

* Change from_map to map_partitions

Signed-off-by: Ryan Wolf <[email protected]>

* Add super constructor

Signed-off-by: Ryan Wolf <[email protected]>

* Add kwargs for load_object_on_worker

Signed-off-by: Ryan Wolf <[email protected]>

* Get proper epoch size

Signed-off-by: Ryan Wolf <[email protected]>

* Complete embedding creation loop

Signed-off-by: Ryan Wolf <[email protected]>

* Change devices

Signed-off-by: Ryan Wolf <[email protected]>

* Add device

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor embedding creation and add classifier

Signed-off-by: Ryan Wolf <[email protected]>

* Fix bugs in classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor model names

Signed-off-by: Ryan Wolf <[email protected]>

* Add model name

Signed-off-by: Ryan Wolf <[email protected]>

* Fix classifier bugs

Signed-off-by: Ryan Wolf <[email protected]>

* Allow postprocessing for classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Fix name and add print

Signed-off-by: Ryan Wolf <[email protected]>

* Fix variable name

Signed-off-by: Ryan Wolf <[email protected]>

* Add NSFW

Signed-off-by: Ryan Wolf <[email protected]>

* Update init for import

Signed-off-by: Ryan Wolf <[email protected]>

* Fix embedding size

Signed-off-by: Ryan Wolf <[email protected]>

* Add fused classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Fix missing index

Signed-off-by: Ryan Wolf <[email protected]>

* Update metdata for fused classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Add export to webdataset

Signed-off-by: Ryan Wolf <[email protected]>

* Fix missing id col

Signed-off-by: Ryan Wolf <[email protected]>

* Sort embeddings by id

Signed-off-by: Ryan Wolf <[email protected]>

* Add timm

Signed-off-by: Ryan Wolf <[email protected]>

* Update init file

Signed-off-by: Ryan Wolf <[email protected]>

* Add autocast to timm

Signed-off-by: Ryan Wolf <[email protected]>

* Update requirements and transform

Signed-off-by: Ryan Wolf <[email protected]>

* Add additional interpolation support

Signed-off-by: Ryan Wolf <[email protected]>

* Fix transform normalization

Signed-off-by: Ryan Wolf <[email protected]>

* Remove open_clip

Signed-off-by: Ryan Wolf <[email protected]>

* Add index path support to wds

Signed-off-by: Ryan Wolf <[email protected]>

* Address Vibhu's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add import guard for image dataset

Signed-off-by: Ryan Wolf <[email protected]>

* Change default device

Signed-off-by: Ryan Wolf <[email protected]>

* Remove commented code

Signed-off-by: Ryan Wolf <[email protected]>

* Remove device id

Signed-off-by: Ryan Wolf <[email protected]>

* Fix index issue

Signed-off-by: Ryan Wolf <[email protected]>

* Add docstrings and standardize variable names

Signed-off-by: Ryan Wolf <[email protected]>

* Add image curation tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add initial image docs

Signed-off-by: Ryan Wolf <[email protected]>

* Remove tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add dataset docs

Signed-off-by: Ryan Wolf <[email protected]>

* Add embedder documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Revert embedding column name change

Signed-off-by: Ryan Wolf <[email protected]>

* Update user guide for images

Signed-off-by: Ryan Wolf <[email protected]>

* Update README

Signed-off-by: Ryan Wolf <[email protected]>

* Update README with RAPIDS nightly instructions

Signed-off-by: Ryan Wolf <[email protected]>

* Fix formatting issues in image documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Remove extra newline in README

Signed-off-by: Ryan Wolf <[email protected]>

* Address most of Sarah's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add section summary

Signed-off-by: Ryan Wolf <[email protected]>

* Fix errors and REWORD GPU bullets in README

Signed-off-by: Ryan Wolf <[email protected]>

* Fix how table of contents displays with new sections

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
* Add partial image implementation

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor requirements

Signed-off-by: Ryan Wolf <[email protected]>

* Fix bugs

Signed-off-by: Ryan Wolf <[email protected]>

* Change from_map to map_partitions

Signed-off-by: Ryan Wolf <[email protected]>

* Add super constructor

Signed-off-by: Ryan Wolf <[email protected]>

* Add kwargs for load_object_on_worker

Signed-off-by: Ryan Wolf <[email protected]>

* Get proper epoch size

Signed-off-by: Ryan Wolf <[email protected]>

* Complete embedding creation loop

Signed-off-by: Ryan Wolf <[email protected]>

* Change devices

Signed-off-by: Ryan Wolf <[email protected]>

* Add device

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor embedding creation and add classifier

Signed-off-by: Ryan Wolf <[email protected]>

* Fix bugs in classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Refactor model names

Signed-off-by: Ryan Wolf <[email protected]>

* Add model name

Signed-off-by: Ryan Wolf <[email protected]>

* Fix classifier bugs

Signed-off-by: Ryan Wolf <[email protected]>

* Allow postprocessing for classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Fix name and add print

Signed-off-by: Ryan Wolf <[email protected]>

* Fix variable name

Signed-off-by: Ryan Wolf <[email protected]>

* Add NSFW

Signed-off-by: Ryan Wolf <[email protected]>

* Update init for import

Signed-off-by: Ryan Wolf <[email protected]>

* Fix embedding size

Signed-off-by: Ryan Wolf <[email protected]>

* Add fused classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Fix missing index

Signed-off-by: Ryan Wolf <[email protected]>

* Update metdata for fused classifiers

Signed-off-by: Ryan Wolf <[email protected]>

* Add export to webdataset

Signed-off-by: Ryan Wolf <[email protected]>

* Fix missing id col

Signed-off-by: Ryan Wolf <[email protected]>

* Sort embeddings by id

Signed-off-by: Ryan Wolf <[email protected]>

* Add timm

Signed-off-by: Ryan Wolf <[email protected]>

* Update init file

Signed-off-by: Ryan Wolf <[email protected]>

* Add autocast to timm

Signed-off-by: Ryan Wolf <[email protected]>

* Update requirements and transform

Signed-off-by: Ryan Wolf <[email protected]>

* Add additional interpolation support

Signed-off-by: Ryan Wolf <[email protected]>

* Fix transform normalization

Signed-off-by: Ryan Wolf <[email protected]>

* Remove open_clip

Signed-off-by: Ryan Wolf <[email protected]>

* Add index path support to wds

Signed-off-by: Ryan Wolf <[email protected]>

* Address Vibhu's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add import guard for image dataset

Signed-off-by: Ryan Wolf <[email protected]>

* Change default device

Signed-off-by: Ryan Wolf <[email protected]>

* Remove commented code

Signed-off-by: Ryan Wolf <[email protected]>

* Remove device id

Signed-off-by: Ryan Wolf <[email protected]>

* Fix index issue

Signed-off-by: Ryan Wolf <[email protected]>

* Add docstrings and standardize variable names

Signed-off-by: Ryan Wolf <[email protected]>

* Add image curation tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add initial image docs

Signed-off-by: Ryan Wolf <[email protected]>

* Remove tutorial

Signed-off-by: Ryan Wolf <[email protected]>

* Add dataset docs

Signed-off-by: Ryan Wolf <[email protected]>

* Add embedder documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Revert embedding column name change

Signed-off-by: Ryan Wolf <[email protected]>

* Update user guide for images

Signed-off-by: Ryan Wolf <[email protected]>

* Update README

Signed-off-by: Ryan Wolf <[email protected]>

* Update README with RAPIDS nightly instructions

Signed-off-by: Ryan Wolf <[email protected]>

* Fix formatting issues in image documentation

Signed-off-by: Ryan Wolf <[email protected]>

* Remove extra newline in README

Signed-off-by: Ryan Wolf <[email protected]>

* Address most of Sarah's feedback

Signed-off-by: Ryan Wolf <[email protected]>

* Add section summary

Signed-off-by: Ryan Wolf <[email protected]>

* Fix errors and REWORD GPU bullets in README

Signed-off-by: Ryan Wolf <[email protected]>

* Fix how table of contents displays with new sections

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants