You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
5. Please use English, otherwise it will be closed.
Describe the bug
Hello! I was running some online serving benchmarks with Qwen2-VL deployed with the latest release on 1 x H100. The dataset is the recently released vision arena dataset. The benchmark was conducted with a default output max_tokens=128.
For some reason the last few requests will always generate until max context wintdow (i.e does not respect max_tokens=128), and this issue is request-independent: I have tried different seeds (for shuffling) & number of requests to make sure that's the case, but unfortunately I don't have bandwidth to dig deeper to see what exactly triggered this bug.
Below are the logs when this happens: the generated tokens for the decode batch are way above 128, indicating something's wrong with stopping criteria/abort operation
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
'fields' has been removed
warnings.warn(message, UserWarning)
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.103
CUDA Driver Version: 550.90.12
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post7
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.1
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.7
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.60.0
anthropic: 0.45.0
decord: 0.6.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 48-95,144-191 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered:
Checklist
Describe the bug
Hello! I was running some online serving benchmarks with Qwen2-VL deployed with the latest release on 1 x H100. The dataset is the recently released vision arena dataset. The benchmark was conducted with a default output
max_tokens=128
.For some reason the last few requests will always generate until max context wintdow (i.e does not respect
max_tokens=128
), and this issue is request-independent: I have tried different seeds (for shuffling) & number of requests to make sure that's the case, but unfortunately I don't have bandwidth to dig deeper to see what exactly triggered this bug.Below are the logs when this happens: the generated tokens for the decode batch are way above 128, indicating something's wrong with stopping criteria/abort operation
For other (early) requests, they finish as expected
Reproduction
You can run the benchmark script from this soon-to-be-merged PR vllm-project/vllm#12389 (Happy to port it over to this repository as well if needed)
Server launch command:
Benchmark launching command
python3 benchmark_serving.py --model Qwen/Qwen2-VL-7B-Instruct --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --num-prompts 250 --request-rate 1 --percentile-metrics ttft,tpot,e2el
Environment
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
warnings.warn(message, UserWarning)
Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0: NVIDIA H100 80GB HBM3
GPU 0 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.3, V12.3.103
CUDA Driver Version: 550.90.12
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post7
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.1
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.7
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.60.0
anthropic: 0.45.0
decord: 0.6.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 48-95,144-191 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered: