Releases · NVIDIA/TensorRT-Model-Optimizer

Backward Breaking Changes

Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
In the Huggingface examples, the trust_remote_code is by default set to false and require users to explicitly turning it on with --trust_remote_code flag.

New Features

Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
New models support in the llm_ptq example: Llama 3.3, Phi 4.
Added Minitron pruning support for NeMo 2.0 GPT models.
Exclude modules in TensorRT-LLM export configs are now wildcards
The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.

Provide feedback