Skip to content

Commit

Permalink
Update documentation for 0.15.0 release
Browse files Browse the repository at this point in the history
  • Loading branch information
kevalmorabia97 committed Jul 26, 2024
1 parent 822d7c6 commit 6de9560
Show file tree
Hide file tree
Showing 277 changed files with 11,577 additions and 2,729 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 89ada319c94fcb1610b7f80d777e8b12
config: 0ea2334c76c1e774d577e20446a79224
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file modified .doctrees/deployment/1_tensorrt_llm_deployment.doctree
Binary file not shown.
Binary file modified .doctrees/environment.pickle
Binary file not shown.
Binary file modified .doctrees/examples/0_all_examples.doctree
Binary file not shown.
Binary file modified .doctrees/getting_started/1_overview.doctree
Binary file not shown.
Binary file modified .doctrees/getting_started/2_installation.doctree
Binary file not shown.
Binary file modified .doctrees/getting_started/3_quantization.doctree
Binary file not shown.
Binary file added .doctrees/getting_started/5_distillation.doctree
Binary file not shown.
Binary file modified .doctrees/getting_started/6_sparsity.doctree
Binary file not shown.
Binary file modified .doctrees/guides/1_quantization.doctree
Binary file not shown.
Binary file added .doctrees/guides/4_distillation.doctree
Binary file not shown.
Binary file modified .doctrees/guides/5_sparsity.doctree
Binary file not shown.
Binary file modified .doctrees/guides/_basic_quantization.doctree
Binary file not shown.
Binary file modified .doctrees/guides/_onnx_quantization.doctree
Binary file not shown.
Binary file modified .doctrees/guides/_pytorch_quantization.doctree
Binary file not shown.
Binary file modified .doctrees/index.doctree
Binary file not shown.
Binary file modified .doctrees/reference/0_versions.doctree
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.deploy.doctree
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.deploy.llm.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.onnx.op_types.doctree
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.onnx.quantization.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.onnx.utils.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.torch.doctree
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.torch.export.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.torch.opt.hparam.doctree
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/reference/generated/modelopt.torch.opt.utils.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/support/1_contact.doctree
Binary file not shown.
Binary file modified .doctrees/support/2_faqs.doctree
Binary file not shown.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.doctrees/environment.pickle filter=lfs diff=lfs merge=lfs -text
27 changes: 16 additions & 11 deletions _sources/deployment/1_tensorrt_llm_deployment.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -90,50 +90,55 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
- Yes
- Yes
- No
* - Falcon RW 1B, 7B
* - MPT 7B, 30B
- Yes
- Yes
- Yes
- Yes
* - MPT 7B, 30B
* - Baichuan 1, 2
- Yes
- Yes
- Yes
- Yes
* - Baichuan 1, 2
* - ChatGLM2, 3 6B
- Yes
- No
- No
- Yes
* - Bloom
- Yes
- Yes
* - Qwen 7B, 14B
- Yes
- Yes
* - Phi-1, 2, 3
- Yes
- Yes
- Yes
* - ChatGLM2, 3 6B
- Yes
* - Nemotron 8
- Yes
- Yes
- No
- Yes
* - Bloom
* - Gemma 2B, 7B
- Yes
- Yes
- No
- Yes
* - Recurrent Gemma
- Yes
* - Phi-1, 2, 3
- Yes
- Yes
- Yes
* - StarCoder 2
- Yes
* - Nemotron 8
- Yes
- Yes
- No
- Yes
* - Gemma 2B, 7B
* - Qwen-1, 1.5
- Yes
- Yes
- Yes
- No
- Yes

Convert to TensorRT-LLM
Expand Down
8 changes: 4 additions & 4 deletions _sources/examples/0_all_examples.rst.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
All ModelOpt Examples
=====================
GitHub Examples
===============

Please visit the `TensorRT-Model-Optimizer GitHub repository <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_
for all ModelOpt examples.
All examples can be accessed from the ModelOpt GitHub repository at
`github.com/NVIDIA/TensorRT-Model-Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer/>`_.
19 changes: 11 additions & 8 deletions _sources/getting_started/1_overview.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ Overview
Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size.
The `NVIDIA TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ (referred to as Model Optimizer, or ModelOpt)
is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress model.
It accepts a torch or ONNX model as inputs and provides Python APIs for users to easily stack different model optimization
techniques to produce quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized
It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized
checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like
`TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_.
Further integrations are planned for `NVIDIA NeMo <https://github.com/NVIDIA/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_
for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on
`NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.

Model Optimizer is available for free for all developers on `NVIDIA PyPI <https://pypi.org/project/nvidia-modelopt/>`_.
Visit `/NVIDIA/TensorRT-Model-Optimizer repository <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ for end-to-end
Visit the `TensorRT Model Optimizer GitHub repository <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ for end-to-end
example scripts and recipes optimized for NVIDIA GPUs.

Techniques
Expand All @@ -34,8 +34,11 @@ for list of formats supported.
Sparsity
^^^^^^^^
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference.
Model Optimizer provides Python API :meth:`mts.sparsify() <modelopt.torch.sparsity.sparsification.sparsify>` to apply
weight sparsity to a given model. The ``mts.sparsify()`` API supports `NVIDIA 2:4 <https://arxiv.org/pdf/2104.0837>`_
sparsity pattern and various sparsification methods, such as NVIDIA `ASP <https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity>`_
and `SparseGPT <https://arxiv.org/abs/2301.00774>`_. It supports both post-training sparsity and sparsity with fine-tuning.
The latter workflow is recommended to minimize accuracy degradation.
Model Optimizer provides the Python API :meth:`mts.sparsify() <modelopt.torch.sparsity.sparsification.sparsify>` to
automatically apply weight sparsity to a given model. The
:meth:`mts.sparsify() <modelopt.torch.sparsity.sparsification.sparsify>` API supports
`NVIDIA 2:4 <https://arxiv.org/pdf/2104.0837>`_ sparsity pattern and various sparsification methods,
such as `NVIDIA ASP <https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity>`_ and
`SparseGPT <https://arxiv.org/abs/2301.00774>`_. It supports both post-training sparsity (PTS) and
sparsity-aware training (SAT). The latter workflow is recommended to minimize accuracy
degradation.
35 changes: 20 additions & 15 deletions _sources/getting_started/2_installation.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,19 @@ System requirements

Model Optimizer (``nvidia-modelopt``) currently has the following system requirements:

+----------------------+-----------------------------+
| OS | Linux, Windows |
+----------------------+-----------------------------+
| Architecture | x86_64, aarch64, win_amd64 |
+----------------------+-----------------------------+
| Python | >=3.8,<3.12 |
+----------------------+-----------------------------+
| PyTorch | >=1.11 |
+----------------------+-----------------------------+
| CUDA | >=11.8 (Recommended) |
+----------------------+-----------------------------+
+-------------------------+-----------------------------+
| OS | Linux |
+-------------------------+-----------------------------+
| Architecture | x86_64 |
+-------------------------+-----------------------------+
| Python | >=3.8,<3.13 |
+-------------------------+-----------------------------+
| CUDA | >=11.8 (Recommended) |
+-------------------------+-----------------------------+
| PyTorch (Optional) | >=1.11 |
+-------------------------+-----------------------------+
| TensorRT-LLM (Optional) | 0.11 |
+-------------------------+-----------------------------+

Install Model Optimizer
=======================
Expand All @@ -34,11 +36,11 @@ license terms of ModelOpt and any dependencies before use.
**Setting up a virtual environment**

We recommend setting up a virtual environment if you don't have one already. Run the following
command to set up and activate a ``conda`` virtual environment named ``modelopt`` with Python 3.11:
command to set up and activate a ``conda`` virtual environment named ``modelopt`` with Python 3.12:

.. code-block:: bash
conda create -n modelopt python=3.11 pip
conda create -n modelopt python=3.12 pip
.. code-block:: bash
Expand Down Expand Up @@ -89,11 +91,14 @@ license terms of ModelOpt and any dependencies before use.
* - ``transformers`` (Huggingface)
- ``[hf]``

If you want to install only partial dependencies, please replace ``[all]`` with the desired
optional dependencies for the below ``pip`` installation command.

**Install Model Optimizer** (``nvidia-modelopt``)

.. code-block:: bash
pip install "nvidia-modelopt[all]" --no-cache-dir --extra-index-url https://pypi.nvidia.com
pip install "nvidia-modelopt[all]" --extra-index-url https://pypi.nvidia.com
Check installation
==================
Expand All @@ -103,7 +108,7 @@ Check installation
When you use ModelOpt's PyTorch quantization APIs for the first time, it will compile the fast quantization kernels
using your installed torch and CUDA if available.
This may take a few minutes but subsequent quantization calls will be much faster.
To invoke the compilation now and check if it is successful, run the following command:
To invoke the compilation and check if it is successful or pre-compile for docker builds, run the following command:

.. code-block:: bash
Expand Down
10 changes: 5 additions & 5 deletions _sources/getting_started/3_quantization.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ Quantization is an effective technique to reduce the memory footprint of deep le
accelerate the inference speed.

ModelOpt's :meth:`mtq.quantize() <modelopt.torch.quantization.model_quant.quantize>` API enables
users to quantize a model with advanced algorithms like SmoothQuant, AWQ etc. ModelOpt supports both
Post Training Quantization (PTQ) and Quantization Aware Training (QAT).
users to quantize a model with advanced algorithms like SmoothQuant, AWQ, and more. ModelOpt
supports both Post Training Quantization (PTQ) and Quantization Aware Training (QAT).

.. tip::

Expand All @@ -21,7 +21,7 @@ PTQ for PyTorch models
-----------------------------

:meth:`mtq.quantize <modelopt.torch.quantization.model_quant.quantize>` requires the model,
the appropriate quantization configuration and a forward loop as inputs. Here is a quick example of
the appropriate quantization configuration, and a forward loop as inputs. Here is a quick example of
quantizing a model with int8 SmoothQuant using
:meth:`mtq.quantize <modelopt.torch.quantization.model_quant.quantize>`:

Expand Down Expand Up @@ -55,8 +55,8 @@ Deployment
The quantized model is just like a regular Pytorch model and is ready for evaluation or deployment.

Huggingface or Nemo LLM models can be exported to TensorRT-LLM using ModelOpt.
Please see :doc:`TensorRT-LLM Deployment <../deployment/1_tensorrt_llm_deployment>` guide for more
details.
Please see the :doc:`TensorRT-LLM Deployment <../deployment/1_tensorrt_llm_deployment>` guide for
more details.

The model can be also exported to ONNX using
`torch.onnx.export <https://pytorch.org/docs/stable/onnx_torchscript.html#torch.onnx.export>`_.
Expand Down
115 changes: 115 additions & 0 deletions _sources/getting_started/5_distillation.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@

=========================
Quick Start: Distillation
=========================

ModelOpt's :doc:`Distillation <../guides/4_distillation>` is a set of wrappers and utilities
to easily perform Knowledge Distillation among teacher and student models.
Given a pretrained teacher model, Distillation has the potential to train a smaller student model
faster and/or with higher accuracy than the student model could achieve on its own.

This quick-start guide shows the necessary steps to integrate Distillation into your
training pipeline.

Set up your base models
-----------------------

First obtain both a pretrained model to act as the teacher and a (usualy smaller) model to serve
as the student.

.. code-block:: python
from torchvision.models import resnet50, resnet18
# Define student
student_model = resnet18()
# Define callable which returns teacher
def teacher_factory():
teacher_model = resnet50()
teacher_model.load_state_dict(pretrained_weights)
return teacher_model
Set up the meta model
---------------------

As Knowledge Distillation involves (at least) two models, ModelOpt simplifies the integration
process by wrapping both student and teacher into one meta model.

Please see an example Distillation setup below. This example assumes the outputs
of ``teacher_model`` and ``student_model`` are logits.

.. code-block:: python
import modelopt.torch.distill as mtd
distillation_config = {
"teacher_model": teacher_factory, # model initializer
"criterion": mtd.LogitsDistillationLoss(), # callable receiving student and teacher outputs, in order
"loss_balancer": mtd.StaticLossBalancer(), # combines multiple losses; omit if only one distillation loss used
}
distillation_model = mtd.convert(student_model, mode=[("kd_loss", distillation_config)])
The ``teacher_model`` can be either a callable which returns an ``nn.Module`` or a tuple of ``(model_cls, args, kwargs)``.
The ``criterion`` is the distillation loss used between student and teacher tensors.
The ``loss_balancer`` determines how the original and distillation losses are combined (if needed).

See :doc:`Distillation <../guides/4_distillation>` for more info.


Distill during training
-----------------------

To Distill from teacher to student, simply use the meta model in the usual training loop, while
also using the meta model's ``.compute_kd_loss()`` method to compute the distillation loss, in addition to
the original user loss.

An example of Distillation training is given below:

.. code-block:: python
:emphasize-lines: 14
# Setup the data loaders. As example:
train_loader = get_train_loader()
# Define user loss function. As example:
loss_fn = get_user_loss_fn()
for input, labels in train_dataloader:
distillation_model.zero_grad()
# Forward through the wrapped models
out = distillation_model(input)
# Same loss as originally present
loss = loss_fn(out, labels)
# Combine distillation and user losses
loss_total = distillation_model.compute_kd_loss(student_loss=loss)
loss_total.backward()
.. note::
`DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`_ may
break ModelOpt's Distillation feature.
Note that `HuggingFace Trainer <https://huggingface.co/docs/transformers/en/main_classes/trainer>`_
uses DataParallel by default.


Export trained model
--------------------

The model can easily be reverted to its original class for further use (i.e deployment)
without any ModelOpt modifications attached.

.. code-block:: python
model = mtd.export(distillation_model)
--------------------------------

**Next steps**
* Learn more about :doc:`Distillation <../guides/4_distillation>`.
* See ModelOpt's :doc:`API documentation <../reference/1_modelopt_api>` for detailed
functionality and usage information.
21 changes: 15 additions & 6 deletions _sources/getting_started/6_sparsity.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ Sparsity
--------

ModelOpt's :doc:`sparsity<../guides/5_sparsity>` feature is an effective technique to reduce the
memory footprint of deep learning models and accelerate the inference speed. ModelOpt provides an
memory footprint of deep learning models and accelerate the inference speed. ModelOpt provides the
easy-to-use API :meth:`mts.sparsify() <modelopt.torch.sparsity.sparsification.sparsify>` to apply
weight sparsity to a given model.
:meth:`mts.sparsify() <modelopt.torch.sparsity.sparsification.sparsify>` supports
`NVIDIA 2:4 Sparsity <https://arxiv.org/abs/2104.08378>`_ sparsity pattern and various sparsification
methods, such as (`NVIDIA ASP <https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity>`_)
and (`SparseGPT <https://arxiv.org/abs/2301.00774>`_).
methods, such as `NVIDIA ASP <https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity>`_
and `SparseGPT <https://arxiv.org/abs/2301.00774>`_.

This guide provides a quick start to apply weight sparsity to a PyTorch model using ModelOpt.

Expand All @@ -38,7 +38,7 @@ Here is a quick example of sparsifying a model to 2:4 sparsity pattern with Spar
sparsity_config = {"data_loader": data_loader, "collect_func": lambda x: x}
# Sparsify the model and perform calibration (PTS)
model = mts.sparsity(model, mode="sparsegpt", config=sparsity_config)
model = mts.sparsify(model, mode="sparsegpt", config=sparsity_config)
.. note::
`data_loader` is only required in case of data-driven sparsity, e.g., SparseGPT for calibration.
Expand All @@ -48,10 +48,19 @@ Here is a quick example of sparsifying a model to 2:4 sparsity pattern with Spar
`data_loader` and `collect_func` can be substituted with a `forward_loop` that iterates the model through the
calibration dataset.

Sparsity-aware Training (SAT) for PyTorch models
------------------------------------------------

After sparsifying the model, you can save the checkpoint for the sparsified model and use it for
fine-tuning the sparsified model. Check out the
`GitHub end-to-end example <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/llm_sparsity>`_
to learn more about SAT.


--------------------------------

**Next Steps**
* Learn more about sparsity and advanced usage of ModelOpt sparsity in
:doc:`Sparsity guide <../guides/5_sparsity>`.
* Checkout out the end-to-end examples on GitHub for PTQ and QAT
`here <https://github.com/NVIDIA/TensorRT-Model-Optimizer?tab=readme-ov-file#examples>`_.
* Checkout out the `end-to-end example on GitHub <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/llm_sparsity>`_
for PTS and SAT.
Loading

0 comments on commit 6de9560

Please sign in to comment.