Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Streaming/Retriever #3173

Merged
merged 11 commits into from
Jan 30, 2025
Merged

Conversation

nvidianz
Copy link
Collaborator

@nvidianz nvidianz commented Jan 23, 2025

Description

  1. ContainerStreamer to stream containers.
  2. ContainerRetriever to retrieve containers from a remote site.
  3. Added examples for file and container streaming.
  4. Merged the class loading functions in class_utils and FOBS.
  5. Fixed a F3 bug that wipes out original exception.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@nvidianz
Copy link
Collaborator Author

/build

Copy link
Collaborator

@yanchengnv yanchengnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. See my comments for enhancement.

@ZiyueXu77
Copy link
Collaborator

A general comment: shall we do a comparison of memory footprint standard v.s. dict streaming v.s. file streaming? I will do some experiment on how to track the memory usage, I think using JobAPI should work

@ZiyueXu77
Copy link
Collaborator

I did some experiments but it seems the basic transmission without dict streaming has similar memory footprint as compared with streaming, not sure if it is expected (@nvidianz I created a PR on your branch, let me know if you find any mistakes in my code)

@ZiyueXu77
Copy link
Collaborator

ZiyueXu77 commented Jan 27, 2025

Mis-interpreted the results last Friday, the result actually makes sense:

Performed a test with a 6 GB model:
Communicator - INFO - Received from simulator_server server. getTask: retrieve_dict size: 6GB (5994010153 Bytes)
Had system memory usage print out every 0.5 s, and the peak usage below:
Simulator w/o dict streaming: 57451 MB
Simulator w/ dict streaming: 53147 MB
Poc w/o dict streaming: 40910 MB
Poc w/ dict streaming: 35422 MB

Considering the largest layer is ~1GB, the difference should be ~5GB, so the above results on diff are reasonable, note that the recorded memory usage is for the general system, so only the diff is meaningful

Copy link
Collaborator

@ZiyueXu77 ZiyueXu77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good and I can confirm the functionality part. This PR is ready for merge once Yan's comment is addressed, I will created a separate PR to refine the example.

@nvidianz
Copy link
Collaborator Author

/build

@nvidianz nvidianz force-pushed the container-streaming branch from 0bbc352 to 344d606 Compare January 30, 2025 10:21
@nvidianz
Copy link
Collaborator Author

/build

@ZiyueXu77 ZiyueXu77 requested a review from yanchengnv January 30, 2025 14:54
Copy link
Collaborator

@yanchengnv yanchengnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvidianz nvidianz merged commit dbaef05 into NVIDIA:main Jan 30, 2025
20 checks passed
@nvidianz nvidianz deleted the container-streaming branch January 30, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants