Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Qwen2-VL Online Serving Issue #3098

Open
5 tasks done
ywang96 opened this issue Jan 24, 2025 · 0 comments
Open
5 tasks done

[Bug] Qwen2-VL Online Serving Issue #3098

ywang96 opened this issue Jan 24, 2025 · 0 comments
Assignees

Comments

@ywang96
Copy link
Contributor

ywang96 commented Jan 24, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Hello! I was running some online serving benchmarks with Qwen2-VL deployed with the latest release on 1 x H100. The dataset is the recently released vision arena dataset. The benchmark was conducted with a default output max_tokens=128.

For some reason the last few requests will always generate until max context wintdow (i.e does not respect max_tokens=128), and this issue is request-independent: I have tried different seeds (for shuffling) & number of requests to make sure that's the case, but unfortunately I don't have bandwidth to dig deeper to see what exactly triggered this bug.

Below are the logs when this happens: the generated tokens for the decode batch are way above 128, indicating something's wrong with stopping criteria/abort operation

[2025-01-24 07:34:10 TP0] Decode batch. #running-req: 8, #token: 135115, token usage: 0.14, gen throughput (token/s): 620.36, #queue-req: 0
[2025-01-24 07:34:11 TP0] Decode batch. #running-req: 8, #token: 135435, token usage: 0.14, gen throughput (token/s): 619.59, #queue-req: 0
[2025-01-24 07:34:12 TP0] Decode batch. #running-req: 8, #token: 135755, token usage: 0.15, gen throughput (token/s): 619.51, #queue-req: 0
[2025-01-24 07:34:12 TP0] Decode batch. #running-req: 8, #token: 136075, token usage: 0.15, gen throughput (token/s): 619.73, #queue-req: 0
[2025-01-24 07:34:13 TP0] Decode batch. #running-req: 8, #token: 136395, token usage: 0.15, gen throughput (token/s): 619.12, #queue-req: 0
[2025-01-24 07:34:13 TP0] Decode batch. #running-req: 8, #token: 136715, token usage: 0.15, gen throughput (token/s): 618.66, #queue-req: 0
[2025-01-24 07:34:14 TP0] Decode batch. #running-req: 8, #token: 137035, token usage: 0.15, gen throughput (token/s): 616.76, #queue-req: 0
[2025-01-24 07:34:14 TP0] Decode batch. #running-req: 8, #token: 137355, token usage: 0.15, gen throughput (token/s): 616.73, #queue-req: 0
[2025-01-24 07:34:15 TP0] Decode batch. #running-req: 8, #token: 137675, token usage: 0.15, gen throughput (token/s): 615.61, #queue-req: 0
[2025-01-24 07:34:15 TP0] Decode batch. #running-req: 8, #token: 137995, token usage: 0.15, gen throughput (token/s): 615.67, #queue-req: 0
[2025-01-24 07:34:16 TP0] Decode batch. #running-req: 8, #token: 138315, token usage: 0.15, gen throughput (token/s): 614.87, #queue-req: 0
[2025-01-24 07:34:16 TP0] Decode batch. #running-req: 8, #token: 138635, token usage: 0.15, gen throughput (token/s): 615.22, #queue-req: 0
[2025-01-24 07:34:17 TP0] Decode batch. #running-req: 8, #token: 138955, token usage: 0.15, gen throughput (token/s): 615.42, #queue-req: 0
[2025-01-24 07:34:17 TP0] Decode batch. #running-req: 8, #token: 139275, token usage: 0.15, gen throughput (token/s): 614.43, #queue-req: 0
[2025-01-24 07:34:18 TP0] Decode batch. #running-req: 8, #token: 139595, token usage: 0.15, gen throughput (token/s): 613.67, #queue-req: 0
[2025-01-24 07:34:18 TP0] Decode batch. #running-req: 8, #token: 139915, token usage: 0.15, gen throughput (token/s): 612.97, #queue-req: 0
[2025-01-24 07:34:19 TP0] Decode batch. #running-req: 8, #token: 140235, token usage: 0.15, gen throughput (token/s): 610.82, #queue-req: 0

For other (early) requests, they finish as expected

[2025-01-24 07:30:37 TP0] Decode batch. #running-req: 7, #token: 6829, token usage: 0.01, gen throughput (token/s): 438.17, #queue-req: 0
[2025-01-24 07:30:37 TP0] Decode batch. #running-req: 7, #token: 7109, token usage: 0.01, gen throughput (token/s): 802.13, #queue-req: 0
[2025-01-24 07:30:37 TP0] Decode batch. #running-req: 4, #token: 5114, token usage: 0.01, gen throughput (token/s): 540.89, #queue-req: 0
[2025-01-24 07:30:38 TP0] Decode batch. #running-req: 3, #token: 5234, token usage: 0.01, gen throughput (token/s): 352.43, #queue-req: 0
[2025-01-24 07:30:38 TP0] Decode batch. #running-req: 2, #token: 4310, token usage: 0.00, gen throughput (token/s): 301.13, #queue-req: 0

Reproduction

You can run the benchmark script from this soon-to-be-merged PR vllm-project/vllm#12389 (Happy to port it over to this repository as well if needed)

Server launch command:

python3 -m sglang.launch_server --model-path Qwen/Qwen2-VL-7B-Instruct --port=8000  --chat-template=qwen2-vl

Benchmark launching command
python3 benchmark_serving.py --model Qwen/Qwen2-VL-7B-Instruct --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --num-prompts 250 --request-rate 1 --percentile-metrics ttft,tpot,e2el

Environment

/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

  • 'fields' has been removed
    warnings.warn(message, UserWarning)
    Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
    CUDA available: True
    GPU 0: NVIDIA H100 80GB HBM3
    GPU 0 Compute Capability: 9.0
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.3, V12.3.103
    CUDA Driver Version: 550.90.12
    PyTorch: 2.5.1+cu124
    sglang: 0.4.1.post7
    flashinfer: 0.1.6+cu124torch2.4
    triton: 3.1.0
    transformers: 4.48.1
    torchao: 0.8.0
    numpy: 1.26.4
    aiohttp: 3.11.11
    fastapi: 0.115.7
    hf_transfer: 0.1.9
    huggingface_hub: 0.27.1
    interegular: 0.3.3
    modelscope: 1.22.3
    orjson: 3.10.15
    packaging: 24.2
    psutil: 6.1.1
    pydantic: 2.10.6
    multipart: 0.0.20
    zmq: 26.2.0
    uvicorn: 0.34.0
    uvloop: 0.21.0
    vllm: 0.6.4.post1
    openai: 1.60.0
    anthropic: 0.45.0
    decord: 0.6.0
    NVIDIA Topology:
    GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
    GPU0 X 48-95,144-191 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants