-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modelopt-v0.23.2 not support Qwen2.5 series LLM model? #142
Comments
What error do you see when using ModelOpt's quantized checkpoint with tritonserver? |
The detailed situation is here. The phenomenon is that, with the same algorithm, when using the conversion script that comes with TensorrtLLM, the output result is normal. However, after compiling with ModelOpt, all the output tokens are 1023, and the structure after decoding is: "xx.Componentlocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklocklock" |
Also have you tried the llm_ptq examples in this repo as well? |
the result is the same |
Are the quantization features related to FP8 only available under the Hopper architecture, such as when using the H100 or H200? For example, for GPUs like the L20 and L40s, which are based on the Ada architecture, are they not supported? |
Yes Ada architecture supports FP8 |
There might have been problems with the installation of the dependent libraries before. I reinstalled tensorrt-llm and modelopt. The quantization results of modelopt should be correct. The main issue lies with the FP8 operators. To address this, I conducted the following experiments:
Judging from these experiments above, there should be a problem with the tensorrt-llm library when handling FP8-related operators. Maybe these operations are not supported on the L20 currently? Perhaps I will conduct experiments on the H series later. |
When I use the Qwen2.5-3B model and perform quantization using the Int8_sq algorithm.
The checkpoint_convert.py script that comes with the TensorrtLLM library (that is, the Int8_sq algorithm implemented by them, without using the Modelopt library), the compiled engine can be used normally by the tritonserver tensorrtllm-backend.
However, when using the Modelopt library and the same algorithm, the compiled engine cannot be used normally by the tritonserver tensorrtllm-backend. Is this because the current version does not support this model? Or what other problems could there be?
The text was updated successfully, but these errors were encountered: