what is the most efficient way to do with a 72b model and 8 * A100 ? #3002

Chandler-Bing · 2025-01-20T08:16:27Z

Hi，say my baseline 1 instance and TP=4 , throughput is x

Suppose I have 8*A100 gpu, and I want to deploy a 72b model. I have the following two ways:

method	instance	TP	throughput
baseline	1	4	x
A	2	4	2x (apparently)
B	1	8	1.5x

I am a little confused, option B give bad throughput lesser than 2x, is it normal ?
or How can I get throughput greater than 2x with just 8A100 gpu? (or I can't?)
thanks for helping!

The text was updated successfully, but these errors were encountered:

JohnnyBoyzzz · 2025-01-20T09:40:30Z

maybe you can try plan C,instance=4, tp=2. One 72b model can server in two a100 gpus.

zhaochenyang20 · 2025-01-21T19:26:21Z

I think this is normal. If your devices are lacking connectivity, like no NVLink on them. TP introduces overhead of communication between GPUs, thus slowing down the speed and making it less than 2x.

Router is suggested to you:

https://docs.sglang.ai/router/router.html

Better to test dp 4 * tp 2 or dp 2 * tp 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what is the most efficient way to do with a 72b model and 8 * A100 ? #3002

what is the most efficient way to do with a 72b model and 8 * A100 ? #3002

Chandler-Bing commented Jan 20, 2025

JohnnyBoyzzz commented Jan 20, 2025

zhaochenyang20 commented Jan 21, 2025

what is the most efficient way to do with a 72b model and 8 * A100 ? #3002

what is the most efficient way to do with a 72b model and 8 * A100 ? #3002

Comments

Chandler-Bing commented Jan 20, 2025

JohnnyBoyzzz commented Jan 20, 2025

zhaochenyang20 commented Jan 21, 2025