-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Enhance checks around KIND_GPU and tensor parallelism #42
Conversation
Didn't mean for the Lora stuff to be removed - will fix that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM until / if we find a different way.
Would be good to add a section on deploying on multiple GPUs TP 1 , TP > 1
But defer to others on other ways to tackle it
Marking draft while I fix a couple things |
I need LoRA back before I can approve |
src/model.py
Outdated
) | ||
# NOTE: this only affects this process and it's subprocesses, not other processes. | ||
# vLLM doesn't currently seem to expose selecting a specific device in the APIs. | ||
os.environ["CUDA_VISIBLE_DEVICES"] = triton_device_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave my observations here for the record.
I've tried also LOCAL_RANK
. It is very flaky and GPU block's calculations do not correspond to what is calculated in "GPU-isolated" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, my only ask is to try re-running tests a couple of times, to see if any flakiness is happenning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting the fix in quickly!
…odel instance_count
935bf92
I think we want to allow the model to interact with other GPUs even if KIND_GPU and a specific device is specified. The reason is that the model can be part of an ensemble pipeline and it may want to copy the tensors from other devices even though the actual execution is happening on device_id. I think it is better to set the default context to the device id specified by Triton. For vllm, this can be achieved using |
@Tabrizian Sure, I'll try setting device instead. There may be less unintended consequences that way. |
…ding other GPUs for d2d copies
Co-authored-by: Olga Andreeva <[email protected]>
function run_multi_gpu_test() { | ||
export KIND="${1}" | ||
export TENSOR_PARALLELISM="${2}" | ||
export INSTANCE_COUNT="${3}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you kneed export
? Looks like all usages are local
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I see it now.. Should try to move server setup into py test (setup / teardown), @jbkyang-nvi had done something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you or @jbkyang-nvi have a reference for that? If not, I can probably just do all this stuff inside the pytest using the in-process python API a bit more easily if we don't need any frontend features and @oandreeva-nv doesn't mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just turn what we do in bash to in python (spawn process / file system manipulation etc.)
Seems like the change is reverted, sad.
triton-inference-server/server#7195 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a common set of utils to prep/start/stop server via python+subprocess would be great. That would probably take me some time to write something good though. Can I merge these tests using bash and follow-up with this after we deal with the P0's and pipeline failures? I'll take this test as a specific example to refactor using the common util I write. @GuanLuo @oandreeva-nv
# Run unit tests | ||
set +e | ||
CLIENT_LOG="./vllm_multi_gpu_test--${KIND}_tp${TENSOR_PARALLELISM}_count${INSTANCE_COUNT}--client.log" | ||
python3 $CLIENT_PY -v > $CLIENT_LOG 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running all unit tests against different settings? Is that necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only a single test right now, just lots of helpers. If I move the server/model setup into the python test like you mentioned, then the bash part can be simplified.
if int(tp) * int(instance_count) != 2: | ||
msg = "TENSOR_PARALLELISM and INSTANCE_COUNT must have a product of 2 for this 2-GPU test" | ||
print("Skipping Test:", msg) | ||
self.skipTest(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this into @unittest.skipIf
? It would be easier to locate then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd have to move the tp and instance_counts to be global or passed directly to the test somehow to do this - I was trying to avoid being too fancy with these tests, but looks like I'll need to rethink them based on the comments so far.
if kind == "KIND_MODEL" and int(instance_count) > 1: | ||
msg = "Testing multiple model instances of KIND_MODEL is not implemented at this time" | ||
print("Skipping Test:", msg) | ||
self.skipTest(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Follow-up ticket for the threads I'm leaving unresolved: DLIS-6804 |
What does the PR do?
Problem
When loading multiple instances of a vLLM model on a multi-gpu system (default behavior with KIND_GPU which is the default instance group when left unspecified), all model instances will default to the same device and can cause a CUDA OOM rather than loading a model on each GPU device assigned by the instance group settings.
This is rooted in Triton's KIND_GPU behavior making an assumption that the model is assigned only one GPU. In the future, KIND_GPU may be expanded to define a set of multiple GPUs for a single model. However, for now the recommendation is to use KIND_MODEL when a model can have multiple GPUs and use them freely (such as python models).
Solution
These changes try to account for this assumption by isolating the assigned GPU device ID when KIND_GPU is used, and otherwise raising an error and recommending usage of KIND_MODEL when the vLLM config implies that this is a multi-GPU model (such as
tensor_parallel_size > 1
).Checklist
<commit_type>: <Title>
Commit Type:
Check the conventional commit type box here and add the label to the github PR.
Test plan
Change: Success, specify device ID assigned by Triton Core (from the instance group) when initializing vLLM to avoid OOM from all instances defaulting to device 0.
Change: Failure, with clear error to specify KIND_MODEL instead for multi-gpu models
Success, and auto uses Ray Worker.
Caveats
tensor_parallel_size
, we can add checks for those too.KIND_MODEL
models withtensor_parallelism == 1
andmodel_instance_count > 1
, we run into the same issue where vLLM will try to allocate each instance on the same GPU. This could be enhanced with a similar check in this PR, or deferred to future enhancement.tensor_parallel_size: 2
, with 8-GPUs at your disposal, there's currently no explicit check or validation around this. I don't know how this would behave, and would likely default to how the RayWorker logic in vLLM would attempt to assign GPUs.