Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]ImportError: undefined symbol: cuModuleGetFunction when using lmsysorg/sglang:v0.4.1.post7-cu124 #3065

Open
5 tasks done
aooxin opened this issue Jan 23, 2025 · 6 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@aooxin
Copy link

aooxin commented Jan 23, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

Description:
While using the lmsysorg/sglang:v0.4.1.post7-cu124 Docker image to launch the server, the following error occurred:

Error Log:
Thu Jan 23 11:55:50 2025[1,1]: scheduler.event_loop_overlap()
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
Thu Jan 23 11:55:50 2025[1,1]: return func(*args, **kwargs)
Thu Jan 23 11:55:50 2025[1,1]: File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 489, in event_loop_overlap
Thu Jan 23 11:55:50 2025[1,1]: batch = self.get_next_batch_to_run()
Thu Jan 23 11:55:50 2025[1,1]: File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 854, in get_next_batch_to_run
Thu Jan 23 11:55:50 2025[1,1]: new_batch = self.get_new_batch_prefill()
Thu Jan 23 11:55:50 2025[1,1]: File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 971, in get_new_batch_prefill
Thu Jan 23 11:55:50 2025[1,1]: new_batch.prepare_for_extend()
Thu Jan 23 11:55:50 2025[1,1]: File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 821, in prepare_for_extend
Thu Jan 23 11:55:50 2025[1,1]: write_req_to_token_pool_triton[(bs,)](
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in
Thu Jan 23 11:55:50 2025[1,1]: return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 607, in run
Thu Jan 23 11:55:50 2025[1,1]: device = driver.active.get_current_device()
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 23, in getattr
Thu Jan 23 11:55:50 2025[1,1]: self._initialize_obj()
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 20, in _initialize_obj
Thu Jan 23 11:55:50 2025[1,1]: self._obj = self._init_fn()
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/runtime/driver.py", line 9, in _create_driver
Thu Jan 23 11:55:50 2025[1,1]: return actives0
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 371, in init
Thu Jan 23 11:55:50 2025[1,1]: self.utils = CudaUtils() # TODO: make static
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 80, in init
Thu Jan 23 11:55:50 2025[1,1]: mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
Thu Jan 23 11:55:50 2025[1,1]: File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 62, in compile_module_from_src
Thu Jan 23 11:55:50 2025[1,1]: mod = importlib.util.module_from_spec(spec)
Thu Jan 23 11:55:50 2025[1,1]: File "", line 571, in module_from_spec
Thu Jan 23 11:55:50 2025[1,1]: File "", line 1176, in create_module
Thu Jan 23 11:55:50 2025[1,1]: File "", line 241, in _call_with_frames_removed
Thu Jan 23 11:55:50 2025[1,1]:ImportError: /root/.triton/cache_38806/41ce1f58e0a8aa9865e66b90d58b3307bb64c5a006830e49543444faf56202fc/cuda_utils.so: undefined symbol: cuModuleGetFunction
Thu Jan 23 11:55:50 2025[1,1]:

ImportError: /root/.triton/cache_xxxxxx/41ce1f58e0a8aa9865e66b90d58b3307bb64c5a006830e49543444faf56202fc/cuda_utils.so: undefined symbol: cuModuleGetFunction

Launch Command:
launch_server_command = [
"python3", "-m", "sglang.launch_server",
"--model-path", model_name,
"--tp", str(tp_size),
"--dist-init-addr", dist_init_addr,
"--nnodes", str(nnodes),
"--node-rank", str(rank), # rank is directly used
"--trust-remote-code", "--host", "0.0.0.0", "--port", str(port),
"--enable-torch-compile", "--disable-cuda-graph",
"--torch-compile-max-bs", "96",
"--mem-fraction-static", "0.8"
]

Reproduction

command:
"python3", "-m", "sglang.launch_server",
"--model-path", model_name,
"--tp", str(tp_size),
"--dist-init-addr", dist_init_addr,
"--nnodes", str(nnodes),
"--node-rank", str(rank), # rank is directly used
"--trust-remote-code", "--host", "0.0.0.0", "--port", str(port),
"--enable-torch-compile", "--disable-cuda-graph",
"--torch-compile-max-bs", "96",
"--mem-fraction-static", "0.8"
model:
deepseek_v3 model

Environment

python3 -m sglang.check_env
/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:

  • 'fields' has been removed
    warnings.warn(message, UserWarning)
    Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0]
    CUDA available: True
    GPU 0,1,2,3,4,5,6,7: CF-NG-HZZ1-O
    GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.131
    CUDA Driver Version: 535.183.06
    PyTorch: 2.5.1+cu124
    sglang: 0.4.1.post7
    flashinfer: 0.1.6+cu124torch2.4
    triton: 3.1.0
    transformers: 4.48.0
    torchao: 0.8.0
    numpy: 1.26.4
    aiohttp: 3.11.11
    fastapi: 0.115.6
    hf_transfer: 0.1.9
    huggingface_hub: 0.27.1
    interegular: 0.3.3
    modelscope: 1.22.3
    orjson: 3.10.15
    packaging: 24.2
    psutil: 6.1.1
    pydantic: 2.10.5
    multipart: 0.0.20
    zmq: 26.2.0
    uvicorn: 0.34.0
    uvloop: 0.21.0
    vllm: 0.6.4.post1
    openai: 1.59.8
    anthropic: 0.43.1
    decord: 0.6.0
    NVIDIA Topology:
    GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID
    GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-143 0 N/A
    GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
    GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
    GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE PIX NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
    GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A
    GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A
    GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A
    GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A
    NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X PIX NODE NODE NODE NODE SYS SYS SYS SYS
    NIC1 NODE NODE NODE NODE SYS SYS SYS SYS PIX X NODE NODE NODE NODE SYS SYS SYS SYS
    NIC2 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE X NODE NODE NODE SYS SYS SYS SYS
    NIC3 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE X NODE NODE SYS SYS SYS SYS
    NIC4 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE X NODE SYS SYS SYS SYS
    NIC5 PIX NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE X SYS SYS SYS SYS
    NIC6 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS X NODE NODE NODE
    NIC7 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS NODE X NODE NODE
    NIC8 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS NODE NODE X NODE
    NIC9 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS NODE NODE NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9

ulimit soft: 1048576

@aooxin aooxin changed the title [Bug]ImportError: undefined symbol: cuModuleGetFunction when using lmsysorg/sglang:v0.4.1.post6-cu124 [Bug]ImportError: undefined symbol: cuModuleGetFunction when using lmsysorg/sglang:v0.4.1.post7-cu124 Jan 23, 2025
@aooxin
Copy link
Author

aooxin commented Jan 23, 2025

Apologies, I just realized I made a mistake. The correct image version I am using is lmsysorg/sglang:v0.4.1.post7-cu124. The issue remains the same, with the ImportError: undefined symbol: cuModuleGetFunction error occurring. Looking forward to any insights on this!

@zhyncs
Copy link
Member

zhyncs commented Jan 23, 2025

Image Hi @aooxin It works well for me

@aooxin
Copy link
Author

aooxin commented Jan 23, 2025

Image Hi @aooxin It works well for me

Hello, could you share your environment details? Additionally, I wanted to ask if using a cu121 image for DeepSeek V3 would have any performance impact compared to cu124. Thanks!

@zhyncs
Copy link
Member

zhyncs commented Jan 23, 2025

I use the H200. Here is the command I used:

docker pull lmsysorg/sglang:v0.4.1.post7-cu124
docker run -itd --shm-size 32g --gpus all -v $HOME/.cache:/root/.cache --ipc=host --name sglang_test lmsysorg/sglang:v0.4.1.post7-cu124 /bin/bash
docker exec -it sglang_test /bin/bash
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct

@zhaochenyang20 zhaochenyang20 self-assigned this Jan 23, 2025
@zhaochenyang20 zhaochenyang20 added the help wanted Extra attention is needed label Jan 23, 2025
@Swipe4057
Copy link

I use the H200. Here is the command I used:

docker pull lmsysorg/sglang:v0.4.1.post7-cu124
docker run -itd --shm-size 32g --gpus all -v $HOME/.cache:/root/.cache --ipc=host --name sglang_test lmsysorg/sglang:v0.4.1.post7-cu124 /bin/bash
docker exec -it sglang_test /bin/bash
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct

v0.4.1.post7-cu124 v0.4.1.post7-cu124-srt
what is the difference between these images? what does srt mean?

@zhaochenyang20
Copy link
Collaborator

I use the H200. Here is the command I used:
docker pull lmsysorg/sglang:v0.4.1.post7-cu124
docker run -itd --shm-size 32g --gpus all -v $HOME/.cache:/root/.cache --ipc=host --name sglang_test lmsysorg/sglang:v0.4.1.post7-cu124 /bin/bash
docker exec -it sglang_test /bin/bash
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct

v0.4.1.post7-cu124 v0.4.1.post7-cu124-srt what is the difference between these images? what does srt mean?

SRT means SGLang Runtime Engine. I don't know the clear difference @zhyncs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants