[Feature] FP8 weight only w8a16 quantization native support #3007

arunpatala · 2025-01-20T10:50:31Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Hi,

I was using VLLM for inference and I am using A10 GPU which doesnt have w8a8 fp8 support. But when I use (without quantization beforehand)

./vllm_docker.sh meta-llama/Llama-3.1-8B-Instruct --quantization fp8

the server starts with

Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

I am ok with the performance gains of w8a16 as my model doesnt degrade much at this quantization level. Is there a way to acheive the same in SGLang?

Thanks

Related resources

No response

The text was updated successfully, but these errors were encountered:

zhaochenyang20 · 2025-01-21T18:47:43Z

I‘ve asked fan for help. Stay tuned!

zhaochenyang20 added the quant LLM Quantization label Jan 21, 2025

zhaochenyang20 self-assigned this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] FP8 weight only w8a16 quantization native support #3007

[Feature] FP8 weight only w8a16 quantization native support #3007

arunpatala commented Jan 20, 2025

zhaochenyang20 commented Jan 21, 2025

[Feature] FP8 weight only w8a16 quantization native support #3007

[Feature] FP8 weight only w8a16 quantization native support #3007

Comments

arunpatala commented Jan 20, 2025

Checklist

Motivation

Related resources

zhaochenyang20 commented Jan 21, 2025