Skip to content

Releases: NVIDIA/TensorRT-Model-Optimizer

ModelOpt 0.23.0 - First OSS Release!

29 Jan 19:05
Compare
Choose a tag to compare

Backward Breaking Changes

  • Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
  • Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
  • ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
  • In the Huggingface examples, the trust_remote_code is by default set to false and require users to explicitly turning it on with --trust_remote_code flag.

New Features

  • Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
  • Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
  • Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
  • TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
  • New models support in the llm_ptq example: Llama 3.3, Phi 4.
  • Added Minitron pruning support for NeMo 2.0 GPT models.
  • Exclude modules in TensorRT-LLM export configs are now wildcards
  • The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.