[Bug] Qwen2-VL Online Serving Issue #3098

ywang96 · 2025-01-24T07:38:05Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

Hello! I was running some online serving benchmarks with Qwen2-VL deployed with the latest release on 1 x H100. The dataset is the recently released vision arena dataset. The benchmark was conducted with a default output max_tokens=128.

For some reason the last few requests will always generate until max context wintdow (i.e does not respect max_tokens=128), and this issue is request-independent: I have tried different seeds (for shuffling) & number of requests to make sure that's the case, but unfortunately I don't have bandwidth to dig deeper to see what exactly triggered this bug.

Below are the logs when this happens: the generated tokens for the decode batch are way above 128, indicating something's wrong with stopping criteria/abort operation

[2025-01-24 07:34:10 TP0] Decode batch. #running-req: 8, #token: 135115, token usage: 0.14, gen throughput (token/s): 620.36, #queue-req: 0
[2025-01-24 07:34:11 TP0] Decode batch. #running-req: 8, #token: 135435, token usage: 0.14, gen throughput (token/s): 619.59, #queue-req: 0
[2025-01-24 07:34:12 TP0] Decode batch. #running-req: 8, #token: 135755, token usage: 0.15, gen throughput (token/s): 619.51, #queue-req: 0
[2025-01-24 07:34:12 TP0] Decode batch. #running-req: 8, #token: 136075, token usage: 0.15, gen throughput (token/s): 619.73, #queue-req: 0
[2025-01-24 07:34:13 TP0] Decode batch. #running-req: 8, #token: 136395, token usage: 0.15, gen throughput (token/s): 619.12, #queue-req: 0
[2025-01-24 07:34:13 TP0] Decode batch. #running-req: 8, #token: 136715, token usage: 0.15, gen throughput (token/s): 618.66, #queue-req: 0
[2025-01-24 07:34:14 TP0] Decode batch. #running-req: 8, #token: 137035, token usage: 0.15, gen throughput (token/s): 616.76, #queue-req: 0
[2025-01-24 07:34:14 TP0] Decode batch. #running-req: 8, #token: 137355, token usage: 0.15, gen throughput (token/s): 616.73, #queue-req: 0
[2025-01-24 07:34:15 TP0] Decode batch. #running-req: 8, #token: 137675, token usage: 0.15, gen throughput (token/s): 615.61, #queue-req: 0
[2025-01-24 07:34:15 TP0] Decode batch. #running-req: 8, #token: 137995, token usage: 0.15, gen throughput (token/s): 615.67, #queue-req: 0
[2025-01-24 07:34:16 TP0] Decode batch. #running-req: 8, #token: 138315, token usage: 0.15, gen throughput (token/s): 614.87, #queue-req: 0
[2025-01-24 07:34:16 TP0] Decode batch. #running-req: 8, #token: 138635, token usage: 0.15, gen throughput (token/s): 615.22, #queue-req: 0
[2025-01-24 07:34:17 TP0] Decode batch. #running-req: 8, #token: 138955, token usage: 0.15, gen throughput (token/s): 615.42, #queue-req: 0
[2025-01-24 07:34:17 TP0] Decode batch. #running-req: 8, #token: 139275, token usage: 0.15, gen throughput (token/s): 614.43, #queue-req: 0
[2025-01-24 07:34:18 TP0] Decode batch. #running-req: 8, #token: 139595, token usage: 0.15, gen throughput (token/s): 613.67, #queue-req: 0
[2025-01-24 07:34:18 TP0] Decode batch. #running-req: 8, #token: 139915, token usage: 0.15, gen throughput (token/s): 612.97, #queue-req: 0
[2025-01-24 07:34:19 TP0] Decode batch. #running-req: 8, #token: 140235, token usage: 0.15, gen throughput (token/s): 610.82, #queue-req: 0

For other (early) requests, they finish as expected

[2025-01-24 07:30:37 TP0] Decode batch. #running-req: 7, #token: 6829, token usage: 0.01, gen throughput (token/s): 438.17, #queue-req: 0
[2025-01-24 07:30:37 TP0] Decode batch. #running-req: 7, #token: 7109, token usage: 0.01, gen throughput (token/s): 802.13, #queue-req: 0
[2025-01-24 07:30:37 TP0] Decode batch. #running-req: 4, #token: 5114, token usage: 0.01, gen throughput (token/s): 540.89, #queue-req: 0
[2025-01-24 07:30:38 TP0] Decode batch. #running-req: 3, #token: 5234, token usage: 0.01, gen throughput (token/s): 352.43, #queue-req: 0
[2025-01-24 07:30:38 TP0] Decode batch. #running-req: 2, #token: 4310, token usage: 0.00, gen throughput (token/s): 301.13, #queue-req: 0

Reproduction

You can run the benchmark script from this soon-to-be-merged PR vllm-project/vllm#12389 (Happy to port it over to this repository as well if needed)

Server launch command:

python3 -m sglang.launch_server --model-path Qwen/Qwen2-VL-7B-Instruct --port=8000  --chat-template=qwen2-vl

Benchmark launching command
python3 benchmark_serving.py --model Qwen/Qwen2-VL-7B-Instruct --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --num-prompts 250 --request-rate 1 --percentile-metrics ttft,tpot,e2el

Environment

/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

'fields' has been removed
warnings.warn(message, UserWarning)
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.103
CUDA Driver Version: 550.90.12
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post7
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.1
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.7
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.60.0
anthropic: 0.45.0
decord: 0.6.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 48-95,144-191 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576

The text was updated successfully, but these errors were encountered:

KYG-APPS · 2025-01-24T19:08:29Z

Thanks for making this post, @ywang96!

I've tested your script and run into the same issue.

yizhang2077 · 2025-01-25T13:04:07Z

It seems like caused by the deprecated param and there is no relationship with VLM itself, we will fix it here #3122

ywang96 · 2025-01-26T01:00:58Z

It seems like caused by the deprecated param and there is no relationship with VLM itself, we will fix it here #3122

@yizhang2077 Thanks for looking into this! I will try building from that branch and see if it fixes the issue.

zhyncs assigned yizhang2077 Jan 24, 2025

mickqian mentioned this issue Jan 25, 2025

fix: Fix deprecated max_tokens param in openai ChatCompletionRequest #3122

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Qwen2-VL Online Serving Issue #3098

[Bug] Qwen2-VL Online Serving Issue #3098

ywang96 commented Jan 24, 2025 •

edited

Loading

KYG-APPS commented Jan 24, 2025

yizhang2077 commented Jan 25, 2025

ywang96 commented Jan 26, 2025

[Bug] Qwen2-VL Online Serving Issue #3098

[Bug] Qwen2-VL Online Serving Issue #3098

Comments

ywang96 commented Jan 24, 2025 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

KYG-APPS commented Jan 24, 2025

yizhang2077 commented Jan 25, 2025

ywang96 commented Jan 26, 2025

ywang96 commented Jan 24, 2025 •

edited

Loading