-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore functionality: lm_head option to disable quantization #138
Comments
@cjluo-nv can you please take a look at this? |
There are researches about lm_head quantization so we updated our export code to support lm_head quantization. There is a similar change in TRT LLM to allow quantized lm_head as well. Though the default behavior is still leaving lm_head quantization disabled (https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/quantization/config.py#L279). Is your question specific to diffusion models or LLM models? |
Thanks for the fast response! @cjluo-nv we are using 0.17.0 - the default behavior here is to e.g.
repro: from tensorrt_llm.quantization import quantize_and_export
quantize_and_export(
model_dir=str(self.checkpoint_dir), # git clone lfs to self.checkpoint_dir
device="cuda",
dtype="auto",
qformat=self.quant_config.get_modelopt_qformat(),
kv_cache_dtype=None
calib_size=64,
batch_size=1,
calib_max_seq_length= 42,
awq_block_size=128,
output_dir=./dummy,
tp_size=1,
pp_size=1,
cp_size=1,
seed=1234,
tokenizer_max_seq_length=42,
max_draft_len=1,
) will print
It seems the code path above is hit. |
Have you tried using the hf_ptq.py script in the llm_ptq example? |
Release notes of 0.23.0:
Until 0.21.0:
The default behavior is no lm_head quantization. lm_head quantization is not possible without monkey-patching the config here:
TensorRT-Model-Optimizer/examples/diffusers/quantization/config.py
Line 22 in 25090b0
Starting from 0.23.0:
The lm_head quantization is forced, with no option to disable - this is a breaking change. The only way to avoid this, is not updating to tensor-llm 0.17.0.post1 - or not using quantization.
This makes the following warning inevitable, there is no setting for it.
#Solution
Please allow for a default option in:
TensorRT-Model-Optimizer/modelopt/torch/export/postprocess.py
Line 698 in 25090b0
The text was updated successfully, but these errors were encountered: