-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]ImportError: undefined symbol: cuModuleGetFunction when using lmsysorg/sglang:v0.4.1.post7-cu124 #3065
Comments
Apologies, I just realized I made a mistake. The correct image version I am using is lmsysorg/sglang:v0.4.1.post7-cu124. The issue remains the same, with the ImportError: undefined symbol: cuModuleGetFunction error occurring. Looking forward to any insights on this! |
Hi @aooxin It works well for me |
Hello, could you share your environment details? Additionally, I wanted to ask if using a cu121 image for DeepSeek V3 would have any performance impact compared to cu124. Thanks! |
I use the H200. Here is the command I used: docker pull lmsysorg/sglang:v0.4.1.post7-cu124
docker run -itd --shm-size 32g --gpus all -v $HOME/.cache:/root/.cache --ipc=host --name sglang_test lmsysorg/sglang:v0.4.1.post7-cu124 /bin/bash
docker exec -it sglang_test /bin/bash
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct |
v0.4.1.post7-cu124 v0.4.1.post7-cu124-srt |
SRT means SGLang Runtime Engine. I don't know the clear difference @zhyncs |
Checklist
Describe the bug
Description:
While using the lmsysorg/sglang:v0.4.1.post7-cu124 Docker image to launch the server, the following error occurred:
Error Log:
Thu Jan 23 11:55:50 2025[1,1]: scheduler.event_loop_overlap()
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
Thu Jan 23 11:55:50 2025[1,1]: return func(*args, **kwargs)
Thu Jan 23 11:55:50 2025[1,1]: File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 489, in event_loop_overlap
Thu Jan 23 11:55:50 2025[1,1]: batch = self.get_next_batch_to_run()
Thu Jan 23 11:55:50 2025[1,1]: File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 854, in get_next_batch_to_run
Thu Jan 23 11:55:50 2025[1,1]: new_batch = self.get_new_batch_prefill()
Thu Jan 23 11:55:50 2025[1,1]: File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 971, in get_new_batch_prefill
Thu Jan 23 11:55:50 2025[1,1]: new_batch.prepare_for_extend()
Thu Jan 23 11:55:50 2025[1,1]: File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 821, in prepare_for_extend
Thu Jan 23 11:55:50 2025[1,1]: write_req_to_token_pool_triton[(bs,)](
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in
Thu Jan 23 11:55:50 2025[1,1]: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 607, in run
Thu Jan 23 11:55:50 2025[1,1]: device = driver.active.get_current_device()
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 23, in getattr
Thu Jan 23 11:55:50 2025[1,1]: self._initialize_obj()
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 20, in _initialize_obj
Thu Jan 23 11:55:50 2025[1,1]: self._obj = self._init_fn()
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 9, in _create_driver
Thu Jan 23 11:55:50 2025[1,1]: return actives0
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 371, in init
Thu Jan 23 11:55:50 2025[1,1]: self.utils = CudaUtils() # TODO: make static
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 80, in init
Thu Jan 23 11:55:50 2025[1,1]: mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 62, in compile_module_from_src
Thu Jan 23 11:55:50 2025[1,1]: mod = importlib.util.module_from_spec(spec)
Thu Jan 23 11:55:50 2025[1,1]: File "", line 571, in module_from_spec
Thu Jan 23 11:55:50 2025[1,1]: File "", line 1176, in create_module
Thu Jan 23 11:55:50 2025[1,1]: File "", line 241, in _call_with_frames_removed
Thu Jan 23 11:55:50 2025[1,1]:ImportError: /root/.triton/cache_38806/41ce1f58e0a8aa9865e66b90d58b3307bb64c5a006830e49543444faf56202fc/cuda_utils.so: undefined symbol: cuModuleGetFunction
Thu Jan 23 11:55:50 2025[1,1]:
ImportError: /root/.triton/cache_xxxxxx/41ce1f58e0a8aa9865e66b90d58b3307bb64c5a006830e49543444faf56202fc/cuda_utils.so: undefined symbol: cuModuleGetFunction
Launch Command:
launch_server_command = [
"python3", "-m", "sglang.launch_server",
"--model-path", model_name,
"--tp", str(tp_size),
"--dist-init-addr", dist_init_addr,
"--nnodes", str(nnodes),
"--node-rank", str(rank), # rank is directly used
"--trust-remote-code", "--host", "0.0.0.0", "--port", str(port),
"--enable-torch-compile", "--disable-cuda-graph",
"--torch-compile-max-bs", "96",
"--mem-fraction-static", "0.8"
]
Reproduction
command:
"python3", "-m", "sglang.launch_server",
"--model-path", model_name,
"--tp", str(tp_size),
"--dist-init-addr", dist_init_addr,
"--nnodes", str(nnodes),
"--node-rank", str(rank), # rank is directly used
"--trust-remote-code", "--host", "0.0.0.0", "--port", str(port),
"--enable-torch-compile", "--disable-cuda-graph",
"--torch-compile-max-bs", "96",
"--mem-fraction-static", "0.8"
model:
deepseek_v3 model
Environment
python3 -m sglang.check_env
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
warnings.warn(message, UserWarning)
Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: CF-NG-HZZ1-O
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.183.06
PyTorch: 2.5.1+cu124
sglang: 0.4.1.post7
flashinfer: 0.1.6+cu124torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.8.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.27.1
interegular: 0.3.3
modelscope: 1.22.3
orjson: 3.10.15
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.59.8
anthropic: 0.43.1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE PIX NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X PIX NODE NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE NODE NODE NODE SYS SYS SYS SYS PIX X NODE NODE NODE NODE SYS SYS SYS SYS
NIC2 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE X NODE NODE NODE SYS SYS SYS SYS
NIC3 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS
NIC4 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS
NIC5 PIX NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE
NIC7 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE X NODE
NIC9 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
ulimit soft: 1048576
The text was updated successfully, but these errors were encountered: