Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore functionality: lm_head option to disable quantization #138

Open
michaelfeil opened this issue Feb 20, 2025 · 4 comments
Open

Restore functionality: lm_head option to disable quantization #138

michaelfeil opened this issue Feb 20, 2025 · 4 comments
Assignees

Comments

@michaelfeil
Copy link

michaelfeil commented Feb 20, 2025

Release notes of 0.23.0:

Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.

Until 0.21.0:
The default behavior is no lm_head quantization. lm_head quantization is not possible without monkey-patching the config here:

that had the explicit option to disable lm_head quantization.

Starting from 0.23.0:
The lm_head quantization is forced, with no option to disable - this is a breaking change. The only way to avoid this, is not updating to tensor-llm 0.17.0.post1 - or not using quantization.
This makes the following warning inevitable, there is no setting for it.

UserWarning: Enable lm_head quantization. lm_head quantization may lead to additional accuracy loss.

#Solution
Please allow for a default option in:

disable_lm_head_quantization = False

@kevalmorabia97
Copy link
Collaborator

@cjluo-nv can you please take a look at this?

@cjluo-nv
Copy link
Collaborator

There are researches about lm_head quantization so we updated our export code to support lm_head quantization. There is a similar change in TRT LLM to allow quantized lm_head as well. Though the default behavior is still leaving lm_head quantization disabled (https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/quantization/config.py#L279). Is your question specific to diffusion models or LLM models?

@michaelfeil
Copy link
Author

michaelfeil commented Feb 22, 2025

Thanks for the fast response!

@cjluo-nv we are using 0.17.0 - the default behavior here is to

e.g.

tensorrt_llm = 0.17.0.post1
nvidia-modelopt = 0.23.2

repro:

from tensorrt_llm.quantization import quantize_and_export

quantize_and_export(
                model_dir=str(self.checkpoint_dir), # git clone lfs to self.checkpoint_dir
                device="cuda",
                dtype="auto",
                qformat=self.quant_config.get_modelopt_qformat(),
                kv_cache_dtype=None
                calib_size=64,
                batch_size=1,
                calib_max_seq_length= 42,
                awq_block_size=128,
                output_dir=./dummy,
                tp_size=1,
                pp_size=1,
                cp_size=1,
                seed=1234,
                tokenizer_max_seq_length=42,
                max_draft_len=1,
            )

will print

UserWarning: Enable lm_head quantization. lm_head quantization may lead to additional accuracy loss.

It seems the code path above is hit.

@cjluo-nv
Copy link
Collaborator

Have you tried using the hf_ptq.py script in the llm_ptq example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants