Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Core] - Add ability to specify gpu memory resources in addition to gpu units #37574

Open
achordia20 opened this issue Jul 19, 2023 · 9 comments · May be fixed by #41147
Open

[Ray Core] - Add ability to specify gpu memory resources in addition to gpu units #37574

achordia20 opened this issue Jul 19, 2023 · 9 comments · May be fixed by #41147
Assignees
Labels
core Issues that should be addressed in Ray Core core-hardware Hardware support in Ray core: e.g. accelerators core-scheduler enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@achordia20
Copy link

Description

Rather than just allowing just a num_gpus resource, it would be great to also have the ability to specify num_gpu_resources as a logical requirement.

Use case

This would allows us to port workloads across different gpu types very easily. Right now, we have to adjust gpu resource requests across different gpu's with different gpu memory available.

@achordia20 achordia20 added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 19, 2023
@jjyao jjyao added the core Issues that should be addressed in Ray Core label Jul 19, 2023
@rkooo567
Copy link
Contributor

cc @ericl Do you have any thoughts on this issue?

@ericl
Copy link
Contributor

ericl commented Jul 24, 2023

We've been discussing this for LLM serving use cases, and this would solve some problems but not all problems of scheduling large models.

It's not a bad idea to add a logical "gpu_memory" resource automatically though, similar to how we add the "memory" logical resource. This could be done in the same code that adds the "accelerator_type" resource.

@cadedaniel
Copy link
Member

@achordia20 could you add more detail here? What kind of workloads are you able to port between GPU types?

This would allows us to port workloads across different gpu types very easily. Right now, we have to adjust gpu resource requests across different gpu's with different gpu memory available.

Asking because in our experience there's a lot of per-GPU-type configuration that needs to change 😄 . Are there workloads that can easily move between GPU types?

@rkooo567 rkooo567 added P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 2, 2023
@jjyao jjyao added core-scheduler core-hardware Hardware support in Ray core: e.g. accelerators labels Sep 25, 2023
@martystack
Copy link

I also would like to be able to schedule on graphics memory as well. I think it provides a better utilization strategy for the Ray users rather than an admin segmenting off certain gpus for one task vs another. My team sees all types of configurations, some very advanced GPU rigs and some simple. GPU memory seems to be the most logical method for requesting / allocating resources.

@ericl
Copy link
Contributor

ericl commented Oct 5, 2023

Maybe one way we could support this is by translating gpu_memory into GPU requests of a specific accelerator type label(s) under the hood (i.e., it's syntactic sugar for manually specifying accelerator types). That way we wouldn't have to make changes to the scheduler internals.

@thatcort
Copy link

It's a bit strange to specify a percentage of a GPU that's required, since you don't know in advance the specs of the GPU the task will be scheduled on.

@jonathan-anyscale
Copy link
Contributor

Hi, want to quick update on this. So we have REP and prototype ready for review. Please try out and leave feedback!
Prototype: #41147
REP: ray-project/enhancements#47

@jonathan-anyscale
Copy link
Contributor

@thatcort @martystack @achordia20 have you guys have chance to take a look in the REP and try the prototype?

@thatcort
Copy link

I added a comment on the REP doc. Overall it looks good! It would be a nice improvement to be able to specify that a task needs multiple GPUs with a certain amount of memory on each.

@jjyao jjyao added P2 Important issue, but not time-critical and removed P1.5 Issues that will be fixed in a couple releases. It will be bumped once all P1s are cleared labels Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core core-hardware Hardware support in Ray core: e.g. accelerators core-scheduler enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants