Remove most of the PTQ receipe (pytorch#1219)

Co-authored-by: Evan Smothers <[email protected]>
ass-a2s · Jul 29, 2024 · 2dc11d9 · 2dc11d9
1 parent 7dae04d
commit 2dc11d9
Show file tree

Hide file tree

Showing 6 changed files with 37 additions and 286 deletions.
diff --git a/docs/source/tutorials/e2e_flow.rst b/docs/source/tutorials/e2e_flow.rst
@@ -319,126 +319,22 @@ Bay Area!
 Speeding up Generation using Quantization
 -----------------------------------------
 
-We saw that the generation recipe took around 11.6 seconds to generate 300 tokens.
-One technique commonly used to speed up inference is quantization. torchtune provides
-an integration with the `TorchAO <https://github.com/pytorch-labs/ao>`_
-quantization APIs. Let's first quantize the model using 4-bit weights-only quantization
-and see if this improves generation speed.
+We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
+To quantize the fine-tuned model after installing torchao we can run the following command::
 
+  # we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
+  # https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
+  # for a full list of techniques that we support
+  from torchao.quantization.quant_api import quantize\_, int4_weight_only
+  quantize\_(model, int4_weight_only())
 
-For this, we'll use the
-`quantization recipe <https://github.com/pytorch/torchtune/blob/main/recipes/quantize.py>`_.
-
-
-Let's first copy over the config to our local working directory so we can make changes.
-
-.. code-block:: bash
-
-    tune cp quantization ./custom_quantization_config.yaml
-
-Let's modify ``custom_quantization_config.yaml`` to include the following changes.
-
-.. code-block:: yaml
-
-    checkpointer:
-        _component_: torchtune.utils.FullModelHFCheckpointer
-
-        # directory with the checkpoint files
-        # this should match the output_dir specified during
-        # finetuning
-        checkpoint_dir: <checkpoint_dir>
-
-        # checkpoint files for the fine-tuned model. This should
-        # match what's shown in the logs above
-        checkpoint_files: [
-            hf_model_0001_0.pt,
-            hf_model_0002_0.pt,
-        ]
-
-        output_dir: <checkpoint_dir>
-        model_type: LLAMA2
-
-
-Once the config is updated, let's kick off quantization! We'll use the default
-quantization method from the config.
-
-
-.. code-block:: bash
-
-    tune run quantize --config ./custom_quantization_config.yaml
-
-Once quantization is complete, you'll see the following in the logs.
-
-.. code-block:: bash
-
-    [quantize.py:68] Time for quantization: 19.76 sec
-    [quantize.py:69] Memory used: 13.95 GB
-    [quantize.py:82] Model checkpoint of size 3.67 GB saved to <checkpoint_dir>/hf_model_0001_0-4w.pt
-
-
-.. note::
-    Unlike the fine-tuned checkpoints, this outputs a single checkpoint file. This is
-    because our quantization APIs currently don't support any conversion across formats.
-    As a result you won't be able to use these quantized models outside of torchtune.
-    But you should be able to use these with the generation and evaluation recipes within
-    torchtune. These results will help inform which quantization methods you should use
-    with your favorite inference engine.
-
-Now that we have the quantized model, let's re-run generation.
-
-Modify ``custom_generation_config.yaml`` to include the following changes.
-
-.. code-block:: yaml
-
-    checkpointer:
-        # we need to use the custom torchtune checkpointer
-        # instead of the HF checkpointer for loading
-        # quantized models
-        _component_: torchtune.utils.FullModelTorchTuneCheckpointer
-
-        # directory with the checkpoint files
-        # this should match the output_dir specified during
-        # finetuning
-        checkpoint_dir: <checkpoint_dir>
-
-        # checkpoint files point to the quantized model
-        checkpoint_files: [
-            hf_model_0001_0-4w.pt,
-        ]
-
-        output_dir: <checkpoint_dir>
-        model_type: LLAMA2
-
-    # we also need to update the quantizer to what was used during
-    # quantization
-    quantizer:
-        _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
-        groupsize: 256
-
-
-Once the config is updated, let's kick off generation! We'll use the
-same sampling parameters as before. We'll also use the same prompt we did with the
-unquantized model.
-
-.. code-block:: bash
-
-    tune run generate --config ./custom_generation_config.yaml \
-    prompt="What are some interesting sites to visit in the Bay Area?"
-
-
-Once generation is complete, you'll see the following in the logs.
-
-
-.. code-block:: bash
-
-    [generate.py:92] A park in San Francisco that sits at the top of a big hill.
-                     There are lots of trees and a beautiful view of San Francisco...
+After quantization, we rely on torch.compile for speedups. For more details, please see `this example usage <https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#quantization-flow-example>`_.
 
-    [generate.py:96] Time for inference: 4.13 sec total, 72.62 tokens/sec
-    [generate.py:99] Memory used: 17.85 GB
+torchao also provides `this table <https://github.com/pytorch/ao#inference>`_ listing performance and accuracy results for ``llama2`` and ``llama3``.
 
-With quantization (and torch compile under the hood), we've sped up generation
-by almost 3x!
+For Llama models, you can run generation directly in torchao on the quantized model using their ``generate.py`` script as
+discussed in `this readme <https://github.com/pytorch/ao/tree/main/torchao/_models/llama>`_. This way you can compare your own results
+to those in the previously-linked table.
 
 |
 

diff --git a/docs/source/tutorials/llama3.rst b/docs/source/tutorials/llama3.rst
@@ -237,105 +237,23 @@ Running generation with our LoRA-finetuned model, we see the following output:
 Faster generation via quantization
 ----------------------------------
 
-We can see that the model took just under 11 seconds, generating almost 19 tokens per second.
-We can speed this up a bit by quantizing our model. Here we'll use 4-bit weights-only quantization
-as provided by `torchao <https://github.com/pytorch-labs/ao>`_.
+We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
+To quantize the fine-tuned model after installing torchao we can run the following command::
 
-If you've been following along this far, you know the drill by now.
-Let's copy the quantization config and point it at our fine-tuned model.
+  # we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
+  # https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
+  # for a full list of techniques that we support
+  from torchao.quantization.quant_api import quantize\_, int4_weight_only
+  quantize\_(model, int4_weight_only())
 
-.. code-block:: bash
-
-    tune cp quantization ./custom_quantization_config.yaml
-
-And update ``custom_quantization_config.yaml`` with the following:
-
-.. code-block:: yaml
-
-    # Model arguments
-    model:
-      _component_: torchtune.models.llama3.llama3_8b
-
-    checkpointer:
-      _component_: torchtune.utils.FullModelMetaCheckpointer
-
-      # directory with the checkpoint files
-      # this should match the output_dir specified during
-      # fine-tuning
-      checkpoint_dir: <checkpoint_dir>
-
-      # checkpoint files for the fine-tuned model. These will be logged
-      # at the end of your fine-tune
-      checkpoint_files: [
-        meta_model_0.pt
-      ]
-
-      output_dir: <checkpoint_dir>
-      model_type: LLAMA3
-
-To quantize the model, we can now run:
-
-.. code-block:: bash
+After quantization, we rely on torch.compile for speedups. For more details, please see `this example usage <https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#quantization-flow-example>`_.
 
-    tune run quantize --config ./custom_quantization_config.yaml
+torchao also provides `this table <https://github.com/pytorch/ao#inference>`_ listing performance and accuracy results for ``llama2`` and ``llama3``.
 
-    [quantize.py:90] Time for quantization: 2.93 sec
-    [quantize.py:91] Memory used: 23.13 GB
-    [quantize.py:104] Model checkpoint of size 4.92 GB saved to /tmp/Llama-3-8B-Instruct-hf/consolidated-4w.pt
-
-We can see that the model is now under 5 GB, or just over four bits for each of the 8B parameters.
-
-.. note::
-    Unlike the fine-tuned checkpoints, the quantization recipe outputs a single checkpoint file. This is
-    because our quantization APIs currently don't support any conversion across formats.
-    As a result you won't be able to use these quantized models outside of torchtune.
-    But you should be able to use these with the generation and evaluation recipes within
-    torchtune. These results will help inform which quantization methods you should use
-    with your favorite inference engine.
-
-Let's take our quantized model and run the same generation again.
-First, we'll make one more change to our ``custom_generation_config.yaml``.
-
-.. code-block:: yaml
-
-    checkpointer:
-      # we need to use the custom torchtune checkpointer
-      # instead of the HF checkpointer for loading
-      # quantized models
-      _component_: torchtune.utils.FullModelTorchTuneCheckpointer
-
-      # directory with the checkpoint files
-      # this should match the output_dir specified during
-      # fine-tuning
-      checkpoint_dir: <checkpoint_dir>
-
-      # checkpoint files point to the quantized model
-      checkpoint_files: [
-        consolidated-4w.pt,
-      ]
-
-      output_dir: <checkpoint_dir>
-      model_type: LLAMA3
-
-    # we also need to update the quantizer to what was used during
-    # quantization
-    quantizer:
-      _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
-      groupsize: 256
-
-Let's re-run generation!
-
-.. code-block:: bash
-
-    tune run generate --config ./custom_generation_config.yaml \
-    prompt="Hello, my name is"
-
-    [generate.py:122] Hello, my name is Jake.
-    I am a multi-disciplined artist with a passion for creating, drawing and painting.
-    ...
-    Time for inference: 1.62 sec total, 57.95 tokens/sec
+For Llama models, you can run generation directly in torchao on the quantized model using their ``generate.py`` script as
+discussed in `this readme <https://github.com/pytorch/ao/tree/main/torchao/_models/llama>`_. This way you can compare your own results
+to those in the previously-linked table.
 
-By quantizing the model and running ``torch.compile`` we get over a 3x speedup!
 
 This is just the beginning of what you can do with Meta Llama3 using torchtune and the broader ecosystem.
 We look forward to seeing what you build!
diff --git a/recipes/configs/quantization.yaml b/recipes/configs/quantization.yaml
@@ -24,5 +24,5 @@ dtype: bf16
 seed: 1234
 
 quantizer:
-  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
+  _component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
   groupsize: 256
diff --git a/recipes/quantization.md b/recipes/quantization.md
@@ -1,14 +1,11 @@
 # Quantization and Sparsity
 
-torchtune integrates with [torchao](https://github.com/pytorch/ao/) for architecture optimization techniques including quantization and sparsity. Currently only some quantization techniques are integrated, see the docstrings in the [quantization recipe](quantize.py) and the [QAT recipe](qat_distributed.py) for more details.
+torchtune integrates with [torchao](https://github.com/pytorch/ao/) for QAT and QLoRA. Currently only some quantization techniques are integrated, see the docstrings in the [quantization recipe](quantize.py) and the [QAT recipe](qat_distributed.py) for more details.
 
-#### Quantize
-To quantize a model (default is int4 weight only quantization):
-```
-tune run quantize --config quantization
-```
+For post training quantization, we recommend using `torchao` directly: https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md to quantize their model
+and do eval/benchmark in torchao as well: https://github.com/pytorch/ao/tree/main/torchao/_models/llama.
 
-#### Quantization-Aware Training (QAT)
+## Quantization-Aware Training (QAT)
 
 (PyTorch 2.4+)
 
@@ -35,8 +32,7 @@ is supported. This refers to int8 dynamic per token activation quantization
 combined with int4 grouped per axis weight quantization. For more details,
 please refer to the [torchao implementation](https://github.com/pytorch/ao/blob/950a89388e88e10f26bbbbe2ec0b1710ba3d33d1/torchao/quantization/prototype/qat.py#L22).
 
-
-#### Eval
+## Eval
 To evaluate a quantized model, make the following changes to the default [evaluation config](configs/eleuther_evaluation.yaml)
 
 
@@ -52,16 +48,18 @@ checkpointer:
 
 # Quantization specific args
 quantizer:
-  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
+  _component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
   groupsize: 256
 ```
 
+Noet: we can use `Int8DynActInt4WeightQuantizer` to load a QAT quantized model since it's the same type of quantization.
+
 and run evaluation:
 ```bash
 tune run eleuther_eval --config eleuther_evaluation
 ```
 
-#### Generate
+## Generate
 To run inference using a quantized model, make the following changes to the default [generation config](configs/generation.yaml)
 
 
@@ -77,7 +75,7 @@ checkpointer:
 
 # Quantization Arguments
 quantizer:
-  _component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
+  _component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
   groupsize: 256
 ```
 

diff --git a/recipes/quantize.py b/recipes/quantize.py
@@ -25,28 +25,6 @@ class QuantizationRecipe:
     Uses quantizer classes from torchao to quantize a model.
 
     Supported quantization modes are:
-    8w:
-        torchtune.utils.quantization.Int8WeightOnlyQuantizer
-        int8 weight only per axis group quantization
-
-    4w:
-        torchtune.utils.quantization.Int4WeightOnlyQuantizer
-        int4 weight only per axis group quantization
-        Args:
-            `groupsize` (int): a parameter of int4 weight only quantization,
-            it refers to the size of quantization groups which get independent quantization parameters
-            e.g. 32, 64, 128, 256, smaller numbers means more fine grained and higher accuracy
-
-    4w-gptq:
-        torchtune.utils.quantization.Int4WeightOnlyGPTQQuantizer
-        int4 weight only per axis group quantization with GPTQ
-        Args:
-            `groupsize`: see description in `4w`
-            `blocksize`: GPTQ is applied to a 'block' of columns at a time,
-                larger blocks trade off memory for perf, recommended to be a constant
-                multiple of groupsize.
-            `percdamp`: GPTQ stablization hyperparameter, recommended to be .01
-
     8da4w (PyTorch 2.3+):
         torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
         int8 per token dynamic activation with int4 weight only per axis group quantization