You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
I am ok with the performance gains of w8a16 as my model doesnt degrade much at this quantization level. Is there a way to acheive the same in SGLang?
Thanks
Related resources
No response
The text was updated successfully, but these errors were encountered:
Checklist
Motivation
Hi,
I was using VLLM for inference and I am using A10 GPU which doesnt have w8a8 fp8 support. But when I use (without quantization beforehand)
./vllm_docker.sh meta-llama/Llama-3.1-8B-Instruct --quantization fp8
the server starts with
I am ok with the performance gains of w8a16 as my model doesnt degrade much at this quantization level. Is there a way to acheive the same in SGLang?
Thanks
Related resources
No response
The text was updated successfully, but these errors were encountered: