Restore functionality: lm_head option to disable quantization #138

michaelfeil · 2025-02-20T23:22:59Z

Release notes of 0.23.0:

Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.

Until 0.21.0:
The default behavior is no lm_head quantization. lm_head quantization is not possible without monkey-patching the config here:

TensorRT-Model-Optimizer/examples/diffusers/quantization/config.py

Line 22 in 25090b0

FP8_DEFAULT_CONFIG = {

that had the explicit option to disable lm_head quantization.

Starting from 0.23.0:
The lm_head quantization is forced, with no option to disable - this is a breaking change. The only way to avoid this, is not updating to tensor-llm 0.17.0.post1 - or not using quantization.
This makes the following warning inevitable, there is no setting for it.

UserWarning: Enable lm_head quantization. lm_head quantization may lead to additional accuracy loss.

#Solution
Please allow for a default option in:

TensorRT-Model-Optimizer/modelopt/torch/export/postprocess.py

Line 698 in 25090b0

disable_lm_head_quantization = False

The text was updated successfully, but these errors were encountered:

kevalmorabia97 · 2025-02-21T13:45:46Z

@cjluo-nv can you please take a look at this?

cjluo-nv · 2025-02-22T01:26:44Z

There are researches about lm_head quantization so we updated our export code to support lm_head quantization. There is a similar change in TRT LLM to allow quantized lm_head as well. Though the default behavior is still leaving lm_head quantization disabled (https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/quantization/config.py#L279). Is your question specific to diffusion models or LLM models?

michaelfeil · 2025-02-22T02:27:28Z

Thanks for the fast response!

@cjluo-nv we are using 0.17.0 - the default behavior here is to

e.g.

tensorrt_llm = 0.17.0.post1
nvidia-modelopt = 0.23.2

repro:

from tensorrt_llm.quantization import quantize_and_export

quantize_and_export(
                model_dir=str(self.checkpoint_dir), # git clone lfs to self.checkpoint_dir
                device="cuda",
                dtype="auto",
                qformat=self.quant_config.get_modelopt_qformat(),
                kv_cache_dtype=None
                calib_size=64,
                batch_size=1,
                calib_max_seq_length= 42,
                awq_block_size=128,
                output_dir=./dummy,
                tp_size=1,
                pp_size=1,
                cp_size=1,
                seed=1234,
                tokenizer_max_seq_length=42,
                max_draft_len=1,
            )

will print

UserWarning: Enable lm_head quantization. lm_head quantization may lead to additional accuracy loss.

It seems the code path above is hit.

cjluo-nv · 2025-02-27T19:27:33Z

Have you tried using the hf_ptq.py script in the llm_ptq example?

kevalmorabia97 assigned cjluo-nv Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore functionality: lm_head option to disable quantization #138

Restore functionality: lm_head option to disable quantization #138

michaelfeil commented Feb 20, 2025 •

edited

Loading

kevalmorabia97 commented Feb 21, 2025

cjluo-nv commented Feb 22, 2025

michaelfeil commented Feb 22, 2025 •

edited

Loading

cjluo-nv commented Feb 27, 2025

Restore functionality: lm_head option to disable quantization #138

Restore functionality: lm_head option to disable quantization #138

Comments

michaelfeil commented Feb 20, 2025 • edited Loading

kevalmorabia97 commented Feb 21, 2025

cjluo-nv commented Feb 22, 2025

michaelfeil commented Feb 22, 2025 • edited Loading

cjluo-nv commented Feb 27, 2025

michaelfeil commented Feb 20, 2025 •

edited

Loading

michaelfeil commented Feb 22, 2025 •

edited

Loading