Skip to content

Commit

Permalink
Remove most of the PTQ receipe (pytorch#1219)
Browse files Browse the repository at this point in the history
Co-authored-by: Evan Smothers <[email protected]>
  • Loading branch information
jerryzh168 and ebsmothers authored Jul 29, 2024
1 parent 7dae04d commit 2dc11d9
Show file tree
Hide file tree
Showing 6 changed files with 37 additions and 286 deletions.
128 changes: 12 additions & 116 deletions docs/source/tutorials/e2e_flow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -319,126 +319,22 @@ Bay Area!
Speeding up Generation using Quantization
-----------------------------------------

We saw that the generation recipe took around 11.6 seconds to generate 300 tokens.
One technique commonly used to speed up inference is quantization. torchtune provides
an integration with the `TorchAO <https://github.com/pytorch-labs/ao>`_
quantization APIs. Let's first quantize the model using 4-bit weights-only quantization
and see if this improves generation speed.
We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
To quantize the fine-tuned model after installing torchao we can run the following command::

# we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
# https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
# for a full list of techniques that we support
from torchao.quantization.quant_api import quantize\_, int4_weight_only
quantize\_(model, int4_weight_only())

For this, we'll use the
`quantization recipe <https://github.com/pytorch/torchtune/blob/main/recipes/quantize.py>`_.


Let's first copy over the config to our local working directory so we can make changes.

.. code-block:: bash
tune cp quantization ./custom_quantization_config.yaml
Let's modify ``custom_quantization_config.yaml`` to include the following changes.

.. code-block:: yaml
checkpointer:
_component_: torchtune.utils.FullModelHFCheckpointer
# directory with the checkpoint files
# this should match the output_dir specified during
# finetuning
checkpoint_dir: <checkpoint_dir>
# checkpoint files for the fine-tuned model. This should
# match what's shown in the logs above
checkpoint_files: [
hf_model_0001_0.pt,
hf_model_0002_0.pt,
]
output_dir: <checkpoint_dir>
model_type: LLAMA2
Once the config is updated, let's kick off quantization! We'll use the default
quantization method from the config.


.. code-block:: bash
tune run quantize --config ./custom_quantization_config.yaml
Once quantization is complete, you'll see the following in the logs.

.. code-block:: bash
[quantize.py:68] Time for quantization: 19.76 sec
[quantize.py:69] Memory used: 13.95 GB
[quantize.py:82] Model checkpoint of size 3.67 GB saved to <checkpoint_dir>/hf_model_0001_0-4w.pt
.. note::
Unlike the fine-tuned checkpoints, this outputs a single checkpoint file. This is
because our quantization APIs currently don't support any conversion across formats.
As a result you won't be able to use these quantized models outside of torchtune.
But you should be able to use these with the generation and evaluation recipes within
torchtune. These results will help inform which quantization methods you should use
with your favorite inference engine.

Now that we have the quantized model, let's re-run generation.

Modify ``custom_generation_config.yaml`` to include the following changes.

.. code-block:: yaml
checkpointer:
# we need to use the custom torchtune checkpointer
# instead of the HF checkpointer for loading
# quantized models
_component_: torchtune.utils.FullModelTorchTuneCheckpointer
# directory with the checkpoint files
# this should match the output_dir specified during
# finetuning
checkpoint_dir: <checkpoint_dir>
# checkpoint files point to the quantized model
checkpoint_files: [
hf_model_0001_0-4w.pt,
]
output_dir: <checkpoint_dir>
model_type: LLAMA2
# we also need to update the quantizer to what was used during
# quantization
quantizer:
_component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
groupsize: 256
Once the config is updated, let's kick off generation! We'll use the
same sampling parameters as before. We'll also use the same prompt we did with the
unquantized model.

.. code-block:: bash
tune run generate --config ./custom_generation_config.yaml \
prompt="What are some interesting sites to visit in the Bay Area?"
Once generation is complete, you'll see the following in the logs.


.. code-block:: bash
[generate.py:92] A park in San Francisco that sits at the top of a big hill.
There are lots of trees and a beautiful view of San Francisco...
After quantization, we rely on torch.compile for speedups. For more details, please see `this example usage <https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#quantization-flow-example>`_.

[generate.py:96] Time for inference: 4.13 sec total, 72.62 tokens/sec
[generate.py:99] Memory used: 17.85 GB
torchao also provides `this table <https://github.com/pytorch/ao#inference>`_ listing performance and accuracy results for ``llama2`` and ``llama3``.

With quantization (and torch compile under the hood), we've sped up generation
by almost 3x!
For Llama models, you can run generation directly in torchao on the quantized model using their ``generate.py`` script as
discussed in `this readme <https://github.com/pytorch/ao/tree/main/torchao/_models/llama>`_. This way you can compare your own results
to those in the previously-linked table.

|
Expand Down
106 changes: 12 additions & 94 deletions docs/source/tutorials/llama3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -237,105 +237,23 @@ Running generation with our LoRA-finetuned model, we see the following output:
Faster generation via quantization
----------------------------------

We can see that the model took just under 11 seconds, generating almost 19 tokens per second.
We can speed this up a bit by quantizing our model. Here we'll use 4-bit weights-only quantization
as provided by `torchao <https://github.com/pytorch-labs/ao>`_.
We rely on `torchao <https://github.com/pytorch-labs/ao>`_ for `post-training quantization <https://github.com/pytorch/ao/tree/main/torchao/quantization#quantization>`_.
To quantize the fine-tuned model after installing torchao we can run the following command::

If you've been following along this far, you know the drill by now.
Let's copy the quantization config and point it at our fine-tuned model.
# we also support `int8_weight_only()` and `int8_dynamic_activation_int8_weight()`, see
# https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques
# for a full list of techniques that we support
from torchao.quantization.quant_api import quantize\_, int4_weight_only
quantize\_(model, int4_weight_only())

.. code-block:: bash
tune cp quantization ./custom_quantization_config.yaml
And update ``custom_quantization_config.yaml`` with the following:

.. code-block:: yaml
# Model arguments
model:
_component_: torchtune.models.llama3.llama3_8b
checkpointer:
_component_: torchtune.utils.FullModelMetaCheckpointer
# directory with the checkpoint files
# this should match the output_dir specified during
# fine-tuning
checkpoint_dir: <checkpoint_dir>
# checkpoint files for the fine-tuned model. These will be logged
# at the end of your fine-tune
checkpoint_files: [
meta_model_0.pt
]
output_dir: <checkpoint_dir>
model_type: LLAMA3
To quantize the model, we can now run:

.. code-block:: bash
After quantization, we rely on torch.compile for speedups. For more details, please see `this example usage <https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#quantization-flow-example>`_.

tune run quantize --config ./custom_quantization_config.yaml
torchao also provides `this table <https://github.com/pytorch/ao#inference>`_ listing performance and accuracy results for ``llama2`` and ``llama3``.

[quantize.py:90] Time for quantization: 2.93 sec
[quantize.py:91] Memory used: 23.13 GB
[quantize.py:104] Model checkpoint of size 4.92 GB saved to /tmp/Llama-3-8B-Instruct-hf/consolidated-4w.pt
We can see that the model is now under 5 GB, or just over four bits for each of the 8B parameters.

.. note::
Unlike the fine-tuned checkpoints, the quantization recipe outputs a single checkpoint file. This is
because our quantization APIs currently don't support any conversion across formats.
As a result you won't be able to use these quantized models outside of torchtune.
But you should be able to use these with the generation and evaluation recipes within
torchtune. These results will help inform which quantization methods you should use
with your favorite inference engine.

Let's take our quantized model and run the same generation again.
First, we'll make one more change to our ``custom_generation_config.yaml``.

.. code-block:: yaml
checkpointer:
# we need to use the custom torchtune checkpointer
# instead of the HF checkpointer for loading
# quantized models
_component_: torchtune.utils.FullModelTorchTuneCheckpointer
# directory with the checkpoint files
# this should match the output_dir specified during
# fine-tuning
checkpoint_dir: <checkpoint_dir>
# checkpoint files point to the quantized model
checkpoint_files: [
consolidated-4w.pt,
]
output_dir: <checkpoint_dir>
model_type: LLAMA3
# we also need to update the quantizer to what was used during
# quantization
quantizer:
_component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
groupsize: 256
Let's re-run generation!

.. code-block:: bash
tune run generate --config ./custom_generation_config.yaml \
prompt="Hello, my name is"
[generate.py:122] Hello, my name is Jake.
I am a multi-disciplined artist with a passion for creating, drawing and painting.
...
Time for inference: 1.62 sec total, 57.95 tokens/sec
For Llama models, you can run generation directly in torchao on the quantized model using their ``generate.py`` script as
discussed in `this readme <https://github.com/pytorch/ao/tree/main/torchao/_models/llama>`_. This way you can compare your own results
to those in the previously-linked table.

By quantizing the model and running ``torch.compile`` we get over a 3x speedup!

This is just the beginning of what you can do with Meta Llama3 using torchtune and the broader ecosystem.
We look forward to seeing what you build!
2 changes: 1 addition & 1 deletion recipes/configs/quantization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,5 @@ dtype: bf16
seed: 1234

quantizer:
_component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
_component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
groupsize: 256
22 changes: 10 additions & 12 deletions recipes/quantization.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,11 @@
# Quantization and Sparsity

torchtune integrates with [torchao](https://github.com/pytorch/ao/) for architecture optimization techniques including quantization and sparsity. Currently only some quantization techniques are integrated, see the docstrings in the [quantization recipe](quantize.py) and the [QAT recipe](qat_distributed.py) for more details.
torchtune integrates with [torchao](https://github.com/pytorch/ao/) for QAT and QLoRA. Currently only some quantization techniques are integrated, see the docstrings in the [quantization recipe](quantize.py) and the [QAT recipe](qat_distributed.py) for more details.

#### Quantize
To quantize a model (default is int4 weight only quantization):
```
tune run quantize --config quantization
```
For post training quantization, we recommend using `torchao` directly: https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md to quantize their model
and do eval/benchmark in torchao as well: https://github.com/pytorch/ao/tree/main/torchao/_models/llama.

#### Quantization-Aware Training (QAT)
## Quantization-Aware Training (QAT)

(PyTorch 2.4+)

Expand All @@ -35,8 +32,7 @@ is supported. This refers to int8 dynamic per token activation quantization
combined with int4 grouped per axis weight quantization. For more details,
please refer to the [torchao implementation](https://github.com/pytorch/ao/blob/950a89388e88e10f26bbbbe2ec0b1710ba3d33d1/torchao/quantization/prototype/qat.py#L22).


#### Eval
## Eval
To evaluate a quantized model, make the following changes to the default [evaluation config](configs/eleuther_evaluation.yaml)


Expand All @@ -52,16 +48,18 @@ checkpointer:
# Quantization specific args
quantizer:
_component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
_component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
groupsize: 256
```

Noet: we can use `Int8DynActInt4WeightQuantizer` to load a QAT quantized model since it's the same type of quantization.

and run evaluation:
```bash
tune run eleuther_eval --config eleuther_evaluation
```

#### Generate
## Generate
To run inference using a quantized model, make the following changes to the default [generation config](configs/generation.yaml)


Expand All @@ -77,7 +75,7 @@ checkpointer:
# Quantization Arguments
quantizer:
_component_: torchtune.utils.quantization.Int4WeightOnlyQuantizer
_component_: torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
groupsize: 256
```

Expand Down
22 changes: 0 additions & 22 deletions recipes/quantize.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,28 +25,6 @@ class QuantizationRecipe:
Uses quantizer classes from torchao to quantize a model.
Supported quantization modes are:
8w:
torchtune.utils.quantization.Int8WeightOnlyQuantizer
int8 weight only per axis group quantization
4w:
torchtune.utils.quantization.Int4WeightOnlyQuantizer
int4 weight only per axis group quantization
Args:
`groupsize` (int): a parameter of int4 weight only quantization,
it refers to the size of quantization groups which get independent quantization parameters
e.g. 32, 64, 128, 256, smaller numbers means more fine grained and higher accuracy
4w-gptq:
torchtune.utils.quantization.Int4WeightOnlyGPTQQuantizer
int4 weight only per axis group quantization with GPTQ
Args:
`groupsize`: see description in `4w`
`blocksize`: GPTQ is applied to a 'block' of columns at a time,
larger blocks trade off memory for perf, recommended to be a constant
multiple of groupsize.
`percdamp`: GPTQ stablization hyperparameter, recommended to be .01
8da4w (PyTorch 2.3+):
torchtune.utils.quantization.Int8DynActInt4WeightQuantizer
int8 per token dynamic activation with int4 weight only per axis group quantization
Expand Down
Loading

0 comments on commit 2dc11d9

Please sign in to comment.