Releases: NVIDIA/TensorRT-Model-Optimizer
Releases · NVIDIA/TensorRT-Model-Optimizer
ModelOpt 0.23.0 - First OSS Release!
Backward Breaking Changes
- Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
- Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
- ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
- In the Huggingface examples, the
trust_remote_code
is by default set to false and require users to explicitly turning it on with--trust_remote_code
flag.
New Features
- Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
- Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
- Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
- TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
- New models support in the
llm_ptq
example: Llama 3.3, Phi 4. - Added Minitron pruning support for NeMo 2.0 GPT models.
- Exclude modules in TensorRT-LLM export configs are now wildcards
- The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.