Skip to content

Commit

Permalink
Implement HQQ quantization (#677)
Browse files Browse the repository at this point in the history
* Add dequant kernels and sketch forward for hqq

* Implement dequant for the rest

* Comments

* Add cpu kernels, still need simd

* Automatically use cpu impl otherwise

* Add quantization, need to add optimizer

* Add proximal legacy optimizer

* Clippy

* Refactor

* Connect everything up

* Refactor isq

* Create the IsqType

* Add isq for hqq

* Clippy

* WIP

* Works for HQQ8

* WIP

* Works for q4k

* Improve it

* Apply isq to lm head

* Fix cpu

* Handle minimum max threads

* Bump candle

* Complete merge

* Complete merge

* Disable hqq 3,2,1 for now

* Clippy

* Update docs

* Typo
  • Loading branch information
EricLBuehler authored Aug 16, 2024
1 parent a09a68a commit fd90d9a
Show file tree
Hide file tree
Showing 19 changed files with 2,098 additions and 18 deletions.
20 changes: 11 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,23 +64,26 @@ Mistal.rs supports several model categories:
## Description
**Fast**:
- Quantized model support: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit for faster inference and optimized memory usage.
- Apple silicon support with the Metal framework.
- CPU inference with `mkl`, `accelerate` support and optimized backend.
- CUDA support with flash attention and cuDNN.
- Continuous batching and PagedAttention support.
- Prefix caching.
- [Device mapping](docs/DEVICE_MAPPING.md): load and run some layers on the device and the rest on the CPU.
**Accelerator support**:
- Apple silicon support with the Metal framework.
- CPU inference with `mkl`, `accelerate` support and optimized backend.
- CUDA support with flash attention and cuDNN.
**Quantization**:
- [Details](docs/QUANTS.md)
- GGML: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit, with ISQ support.
- GPTQ: 2-bit, 3-bit, 4-bit and 8-bit
- HQQ: 4-bit and 8 bit, with ISQ support
- [ISQ](docs/ISQ.md) (In situ quantization): run `.safetensors` models directly from Hugging Face Hub by quantizing them after loading instead of creating a GGUF file.
- This loads the ISQ-able weights on CPU before quantizing with ISQ and then moving to the device to avoid memory spikes.
- Extremely fast due to working in parallel
**Easy**:
- Lightweight OpenAI API compatible HTTP server.
- Python API.
- Grammar support with Regex and Yacc.
- [ISQ](docs/ISQ.md) (In situ quantization): run `.safetensors` models directly from Hugging Face Hub by quantizing them after loading instead of creating a GGUF file.
- This loads the ISQ-able weights on CPU before quantizing with ISQ and then moving to the device to avoid memory spikes.
- Extremely fast due to working in parallel
**Powerful**:
- Fast LoRA support with weight merging.
Expand All @@ -98,7 +101,6 @@ Mistal.rs supports several model categories:
- Please suggest more by raising an issue!
- Tool calling: [docs](docs/TOOL_CALLING.md)
- Prompt chunking (only without PagedAttention for now): handle larger prompts where the activation size would cause an OOM by sending chunks
- Various quantizations (GGUF, GPTQ, ISQ): [docs](docs/QUANTS.md)
This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
Expand Down
3 changes: 3 additions & 0 deletions docs/ISQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,12 @@ Possible values for ISQ quantization:
- Q5K
- Q6K
- Q8K
- HQQ4
- HQQ8

When using ISQ, it will automatically load ISQ-able weights into CPU memory before applying ISQ. The ISQ application process moves the weights to device memory. This process is implemented to avoid memory spikes from loading the model in full precision.

**Fallback rules for GGUF quantization**
If a tensor cannot be quantized, the fallback process is as follows:
1) If using a `K` quant, fallback to a similar `Q` quant.
2) If that is not possible, use `F32` as the data type.
Expand Down
10 changes: 8 additions & 2 deletions docs/QUANTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,22 @@ Mistral.rs supports the following quantization:
- GGUF/GGML
- Q, K type
- Supported in GGUF/GGML and GGUF/GGML adapter models
- Supported in all plain and adapter models
- I quants coming!
- CPU, CUDA, Metal (all supported devices)
- 2, 3, 4, 5, 6, 8 bit
- GPTQ
- Supported in all plain and adapter models
- CUDA only
- 2, 3, 4, 8 bit
- HQQ
- Supported in all plain and adapter models via ISQ
- CUDA and CPU only
- 4, 8 bit
- ISQ
- Q, K type GGUF quants
- Supported in all plain and adapter models
- I quants coming!
- GPTQ quants coming!
- HQQ quants
- CPU, CUDA, Metal (all supported devices)

## Using a GGUF quantized model
Expand Down
12 changes: 11 additions & 1 deletion mistralrs-core/src/pipeline/isq.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ use crate::device_map::DeviceMapper;
/// - `Q5K`
/// - `Q6K`
/// - `Q8K`
/// - `HQQ1`
/// - `HQQ2`
/// - `HQQ3`
/// - `HQQ4`
/// - `HQQ8`
pub fn parse_isq_value(s: &str) -> Result<IsqType, String> {
match s.to_lowercase().as_str() {
"q4_0" => Ok(IsqType::Q4_0),
Expand All @@ -37,7 +42,12 @@ pub fn parse_isq_value(s: &str) -> Result<IsqType, String> {
"q5k" => Ok(IsqType::Q5K),
"q6k" => Ok(IsqType::Q6K),
"q8k" => Ok(IsqType::Q8K),
_ => Err(format!("GGML type {s} unknown, choose one of `Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q8_0`, `Q8_1`, `Q2K`, `Q3K`, `Q4K`, `Q5K`, `Q6K`, `Q8K`.")),
"hqq8" => Ok(IsqType::HQQ8),
"hqq4" => Ok(IsqType::HQQ4),
// "hqq3" => Ok(IsqType::HQQ3),
// "hqq2" => Ok(IsqType::HQQ2),
// "hqq1" => Ok(IsqType::HQQ1),
_ => Err(format!("GGML type {s} unknown, choose one of `Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q8_0`, `Q8_1`, `Q2K`, `Q3K`, `Q4K`, `Q5K`, `Q6K`, `Q8K`, `HQQ8`, `HQQ4`.")),
}
}

Expand Down
6 changes: 5 additions & 1 deletion mistralrs-quant/build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,11 @@ fn main() {
use std::{path::PathBuf, vec};
println!("cargo:rerun-if-changed=build.rs");
let build_dir = PathBuf::from(std::env::var("OUT_DIR").unwrap());
let lib_files = vec!["kernels/gptq/q_gemm.cu"];
let lib_files = vec![
"kernels/gptq/q_gemm.cu",
"kernels/hqq/hqq.cu",
"kernels/ops/ops.cu",
];
for lib_file in lib_files.iter() {
println!("cargo:rerun-if-changed={lib_file}");
}
Expand Down
Loading

0 comments on commit fd90d9a

Please sign in to comment.