Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modelopt-v0.23.2 not support Qwen2.5 series LLM model? #142

Closed
white-wolf-tech opened this issue Feb 27, 2025 · 7 comments
Closed

Modelopt-v0.23.2 not support Qwen2.5 series LLM model? #142

white-wolf-tech opened this issue Feb 27, 2025 · 7 comments
Assignees

Comments

@white-wolf-tech
Copy link

When I use the Qwen2.5-3B model and perform quantization using the Int8_sq algorithm.
The checkpoint_convert.py script that comes with the TensorrtLLM library (that is, the Int8_sq algorithm implemented by them, without using the Modelopt library), the compiled engine can be used normally by the tritonserver tensorrtllm-backend.

However, when using the Modelopt library and the same algorithm, the compiled engine cannot be used normally by the tritonserver tensorrtllm-backend. Is this because the current version does not support this model? Or what other problems could there be?

@kevalmorabia97
Copy link
Collaborator

What error do you see when using ModelOpt's quantized checkpoint with tritonserver?
Note that TensorRT-LLM under the hood also uses ModelOpt library for quantization

@white-wolf-tech
Copy link
Author

The detailed situation is here.
NVIDIA/TensorRT-LLM#2810

The phenomenon is that, with the same algorithm, when using the conversion script that comes with TensorrtLLM, the output result is normal. However, after compiling with ModelOpt, all the output tokens are 1023, and the structure after decoding is:

"xx.Componentlocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklock"

@cjluo-nv
Copy link
Collaborator

cjluo-nv commented Feb 27, 2025

Also have you tried the llm_ptq examples in this repo as well?

@white-wolf-tech
Copy link
Author

Also have you tried the llm_ptq examples in this repo as well?

the result is the same

@white-wolf-tech
Copy link
Author

Are the quantization features related to FP8 only available under the Hopper architecture, such as when using the H100 or H200? For example, for GPUs like the L20 and L40s, which are based on the Ada architecture, are they not supported?
@cjluo-nv @kevalmorabia97

@kevalmorabia97
Copy link
Collaborator

Yes Ada architecture supports FP8

@white-wolf-tech
Copy link
Author

There might have been problems with the installation of the dependent libraries before. I reinstalled tensorrt-llm and modelopt. The quantization results of modelopt should be correct. The main issue lies with the FP8 operators. To address this, I conducted the following experiments:

  1. Use w4a8_awq quantization, turn off kvcache fp8 quantization, and turn off use_fp8_context_fmha. The output of the model is normal.
  2. Turn on kvcache fp8 quantization, turn off use_fp8_context_fmha, and the output of the compiled model is garbled.
  3. Use w4a8_awq, turn on use_fp8_context_fmha, turn off kvcache fp8 quantization, and when compiling the model, an error is reported: "[TensorRT-LLM][ERROR] TllmXqaJit runtime error in tllmXqaJitCreateAndCompileProgram(&program, &context): NVRTC Internal Error ".
  4. With w4a8_awq quantization, turn on kvcache fp8 quantization and turn on use_fp8_context_fmha. The model can be successfully quantized, but the output is garbled.

Judging from these experiments above, there should be a problem with the tensorrt-llm library when handling FP8-related operators. Maybe these operations are not supported on the L20 currently? Perhaps I will conduct experiments on the H series later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants