fix: Enhance checks around KIND_GPU and tensor parallelism #42

rmccorm4 · 2024-05-24T19:14:30Z

What does the PR do?

Problem

When loading multiple instances of a vLLM model on a multi-gpu system (default behavior with KIND_GPU which is the default instance group when left unspecified), all model instances will default to the same device and can cause a CUDA OOM rather than loading a model on each GPU device assigned by the instance group settings.

This is rooted in Triton's KIND_GPU behavior making an assumption that the model is assigned only one GPU. In the future, KIND_GPU may be expanded to define a set of multiple GPUs for a single model. However, for now the recommendation is to use KIND_MODEL when a model can have multiple GPUs and use them freely (such as python models).

Solution

These changes try to account for this assumption by isolating the assigned GPU device ID when KIND_GPU is used, and otherwise raising an error and recommending usage of KIND_MODEL when the vLLM config implies that this is a multi-GPU model (such as tensor_parallel_size > 1).

Checklist

Commit Type:

Check the conventional commit type box here and add the label to the github PR.

Test plan

Single-gpu model on a multi-gpu system with KIND_GPU (default)

Change: Success, specify device ID assigned by Triton Core (from the instance group) when initializing vLLM to avoid OOM from all instances defaulting to device 0.

I0524 19:05:41.036859 3948 model.py:170] Detected KIND_GPU model instance, explicitly setting GPU device=0 for llama-3-8b-instruct_0
I0524 19:05:41.043640 3948 model.py:170] Detected KIND_GPU model instance, explicitly setting GPU device=1 for llama-3-8b-instruct_1

Multi-gpu model (tensor_parallel_size > 1) with KIND_GPU (default)

Change: Failure, with clear error to specify KIND_MODEL instead for multi-gpu models

Multi-gpu model (tensor_parallel_size > 1) with KIND_MODEL (manually specified)

Success, and auto uses Ray Worker.

Caveats

If there are other vLLM config fields that can imply multi-gpu other than tensor_parallel_size, we can add checks for those too.
For KIND_MODEL models with tensor_parallelism == 1 and model_instance_count > 1, we run into the same issue where vLLM will try to allocate each instance on the same GPU. This could be enhanced with a similar check in this PR, or deferred to future enhancement.
This PR doesn't provide an explicitly good way to support the case for multiple multi-gpu instances.
- Take a 2-GPU model that you want 4 copies of on an 8-GPU system as an example.
- If you specify KIND_MODEL in the config with 4 model instances, and tensor_parallel_size: 2, with 8-GPUs at your disposal, there's currently no explicit check or validation around this. I don't know how this would behave, and would likely default to how the RayWorker logic in vLLM would attempt to assign GPUs.

src/model.py

rmccorm4 · 2024-05-24T19:31:20Z

Didn't mean for the Lora stuff to be removed - will fix that.

nnshah1

LGTM until / if we find a different way.

Would be good to add a section on deploying on multiple GPUs TP 1 , TP > 1

But defer to others on other ways to tackle it

src/model.py

rmccorm4 · 2024-05-24T19:36:01Z

Marking draft while I fix a couple things

oandreeva-nv · 2024-05-24T19:36:35Z

I need LoRA back before I can approve

oandreeva-nv · 2024-05-24T23:11:39Z

src/model.py

+            )
+            # NOTE: this only affects this process and it's subprocesses, not other processes.
+            # vLLM doesn't currently seem to expose selecting a specific device in the APIs.
+            os.environ["CUDA_VISIBLE_DEVICES"] = triton_device_id


I'm going to leave my observations here for the record.

I've tried also LOCAL_RANK . It is very flaky and GPU block's calculations do not correspond to what is calculated in "GPU-isolated" case.

oandreeva-nv

lgtm, my only ask is to try re-running tests a couple of times, to see if any flakiness is happenning.

src/model.py

nnshah1

Thanks for getting the fix in quickly!

…odel instance_count

ci/L0_multi_gpu/vllm_backend/vllm_multi_gpu_test.py

Tabrizian · 2024-05-30T15:19:35Z

I think we want to allow the model to interact with other GPUs even if KIND_GPU and a specific device is specified. The reason is that the model can be part of an ensemble pipeline and it may want to copy the tensors from other devices even though the actual execution is happening on device_id.

I think it is better to set the default context to the device id specified by Triton. For vllm, this can be achieved using torch.cuda.set_device.

rmccorm4 · 2024-05-30T17:16:06Z

@Tabrizian Sure, I'll try setting device instead. There may be less unintended consequences that way.

…ding other GPUs for d2d copies

src/model.py

Co-authored-by: Olga Andreeva <[email protected]>

GuanLuo · 2024-05-30T18:50:12Z

ci/L0_multi_gpu/vllm_backend/test.sh

+function run_multi_gpu_test() {
+    export KIND="${1}"
+    export TENSOR_PARALLELISM="${2}"
+    export INSTANCE_COUNT="${3}"


Do you kneed export? Looks like all usages are local

Okay I see it now.. Should try to move server setup into py test (setup / teardown), @jbkyang-nvi had done something similar.

Do you or @jbkyang-nvi have a reference for that? If not, I can probably just do all this stuff inside the pytest using the in-process python API a bit more easily if we don't need any frontend features and @oandreeva-nv doesn't mind.

It's just turn what we do in bash to in python (spawn process / file system manipulation etc.)

Seems like the change is reverted, sad.
triton-inference-server/server#7195 (comment)

I agree a common set of utils to prep/start/stop server via python+subprocess would be great. That would probably take me some time to write something good though. Can I merge these tests using bash and follow-up with this after we deal with the P0's and pipeline failures? I'll take this test as a specific example to refactor using the common util I write. @GuanLuo @oandreeva-nv

GuanLuo · 2024-05-30T18:54:02Z

ci/L0_multi_gpu/vllm_backend/test.sh

+    # Run unit tests
+    set +e
+    CLIENT_LOG="./vllm_multi_gpu_test--${KIND}_tp${TENSOR_PARALLELISM}_count${INSTANCE_COUNT}--client.log"
+    python3 $CLIENT_PY -v > $CLIENT_LOG 2>&1


Running all unit tests against different settings? Is that necessary?

There's only a single test right now, just lots of helpers. If I move the server/model setup into the python test like you mentioned, then the bash part can be simplified.

oandreeva-nv · 2024-05-30T19:14:46Z

ci/L0_multi_gpu/vllm_backend/vllm_multi_gpu_test.py

+        if int(tp) * int(instance_count) != 2:
+            msg = "TENSOR_PARALLELISM and INSTANCE_COUNT must have a product of 2 for this 2-GPU test"
+            print("Skipping Test:", msg)
+            self.skipTest(msg)


Can we put this into @unittest.skipIf ? It would be easier to locate then

I think I'd have to move the tp and instance_counts to be global or passed directly to the test somehow to do this - I was trying to avoid being too fancy with these tests, but looks like I'll need to rethink them based on the comments so far.

oandreeva-nv · 2024-05-30T19:17:20Z

ci/L0_multi_gpu/vllm_backend/vllm_multi_gpu_test.py

+        if kind == "KIND_MODEL" and int(instance_count) > 1:
+            msg = "Testing multiple model instances of KIND_MODEL is not implemented at this time"
+            print("Skipping Test:", msg)
+            self.skipTest(msg)


rmccorm4 · 2024-05-31T18:26:02Z

Follow-up ticket for the threads I'm leaving unresolved: DLIS-6804

Enhance checks around KIND_GPU and tensor parallelism

dc12c3b

rmccorm4 requested review from tanmayv25, nnshah1, GuanLuo and oandreeva-nv May 24, 2024 19:14

github-advanced-security bot found potential problems May 24, 2024

View reviewed changes

src/model.py Fixed Show fixed Hide fixed

rmccorm4 requested a review from Tabrizian May 24, 2024 19:18

nnshah1 previously approved these changes May 24, 2024

View reviewed changes

oandreeva-nv reviewed May 24, 2024

View reviewed changes

src/model.py Outdated Show resolved Hide resolved

Log instance name

a32de53

rmccorm4 marked this pull request as draft May 24, 2024 19:36

rmccorm4 dismissed nnshah1’s stale review via a32de53 May 24, 2024 19:36

rmccorm4 added 2 commits May 24, 2024 12:41

Sync back with main

cdd7e77

Bring back LoRA changes, with slight refactor

a4e6162

rmccorm4 requested review from oandreeva-nv and nnshah1 May 24, 2024 21:13

rmccorm4 marked this pull request as ready for review May 24, 2024 21:13

GuanLuo previously approved these changes May 24, 2024

View reviewed changes

[comment only change] Remove dupe comment

34abb90

rmccorm4 dismissed GuanLuo’s stale review via 34abb90 May 24, 2024 21:51

rmccorm4 requested a review from GuanLuo May 24, 2024 21:52

GuanLuo previously approved these changes May 24, 2024

View reviewed changes

oandreeva-nv reviewed May 24, 2024

View reviewed changes

oandreeva-nv previously approved these changes May 24, 2024

View reviewed changes

nnshah1 reviewed May 25, 2024

View reviewed changes

src/model.py Show resolved Hide resolved

nnshah1 reviewed May 25, 2024

View reviewed changes

src/model.py Outdated Show resolved Hide resolved

nnshah1 previously approved these changes May 25, 2024

View reviewed changes

Add tests for various combinations of kind, tensor_parallelism, and m…

935bf92

…odel instance_count

rmccorm4 dismissed stale reviews from nnshah1, oandreeva-nv, and GuanLuo via 935bf92 May 29, 2024 22:18

rmccorm4 changed the title ~~Enhance checks around KIND_GPU and tensor parallelism~~ fix: Enhance checks around KIND_GPU and tensor parallelism May 29, 2024

Sync with main

3d55a8e

github-advanced-security bot found potential problems May 29, 2024

View reviewed changes

ci/L0_multi_gpu/vllm_backend/vllm_multi_gpu_test.py Fixed Show fixed Hide fixed

rmccorm4 added 2 commits May 30, 2024 10:43

Use torch.cuda.set_device instead of CUDA_VISIBLE_DEVICES to avoid hi…

ef8a12d

…ding other GPUs for d2d copies

Remove unused import

b0d753b

rmccorm4 requested a review from oandreeva-nv May 30, 2024 17:44

oandreeva-nv reviewed May 30, 2024

View reviewed changes

src/model.py Outdated Show resolved Hide resolved

Add vllm version to comment for last known API info

e60b47a

Co-authored-by: Olga Andreeva <[email protected]>

GuanLuo reviewed May 30, 2024

View reviewed changes

oandreeva-nv reviewed May 30, 2024

View reviewed changes

GuanLuo approved these changes May 30, 2024

View reviewed changes

oandreeva-nv approved these changes May 31, 2024

View reviewed changes

rmccorm4 merged commit 18a96e3 into main May 31, 2024
3 checks passed

rmccorm4 deleted the rmccormick-multi-gpu-default branch May 31, 2024 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Enhance checks around KIND_GPU and tensor parallelism #42

fix: Enhance checks around KIND_GPU and tensor parallelism #42

rmccorm4 commented May 24, 2024 •

edited

Loading

rmccorm4 commented May 24, 2024

nnshah1 left a comment •

edited

Loading

rmccorm4 commented May 24, 2024

oandreeva-nv commented May 24, 2024

oandreeva-nv May 24, 2024

oandreeva-nv left a comment

nnshah1 left a comment

Tabrizian commented May 30, 2024 •

edited

Loading

rmccorm4 commented May 30, 2024

GuanLuo May 30, 2024

GuanLuo May 30, 2024

rmccorm4 May 30, 2024 •

edited

Loading

GuanLuo May 30, 2024

rmccorm4 May 30, 2024

GuanLuo May 30, 2024

rmccorm4 May 30, 2024

oandreeva-nv May 30, 2024

rmccorm4 May 30, 2024

oandreeva-nv May 30, 2024

rmccorm4 commented May 31, 2024 •

edited

Loading

fix: Enhance checks around KIND_GPU and tensor parallelism #42

fix: Enhance checks around KIND_GPU and tensor parallelism #42

Conversation

rmccorm4 commented May 24, 2024 • edited Loading

What does the PR do?

Problem

Solution

Checklist

Commit Type:

Test plan

Caveats

rmccorm4 commented May 24, 2024

nnshah1 left a comment • edited Loading

Choose a reason for hiding this comment

rmccorm4 commented May 24, 2024

oandreeva-nv commented May 24, 2024

Choose a reason for hiding this comment

oandreeva-nv left a comment

Choose a reason for hiding this comment

nnshah1 left a comment

Choose a reason for hiding this comment

Tabrizian commented May 30, 2024 • edited Loading

rmccorm4 commented May 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmccorm4 May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rmccorm4 commented May 31, 2024 • edited Loading

rmccorm4 commented May 24, 2024 •

edited

Loading

nnshah1 left a comment •

edited

Loading

Tabrizian commented May 30, 2024 •

edited

Loading

rmccorm4 May 30, 2024 •

edited

Loading

rmccorm4 commented May 31, 2024 •

edited

Loading