Implement HQQ quantization (#677)

* Add dequant kernels and sketch forward for hqq * Implement dequant for the rest * Comments * Add cpu kernels, still need simd * Automatically use cpu impl otherwise * Add quantization, need to add optimizer * Add proximal legacy optimizer * Clippy * Refactor * Connect everything up * Refactor isq * Create the IsqType * Add isq for hqq * Clippy * WIP * Works for HQQ8 * WIP * Works for q4k * Improve it * Apply isq to lm head * Fix cpu * Handle minimum max threads * Bump candle * Complete merge * Complete merge * Disable hqq 3,2,1 for now * Clippy * Update docs * Typo
EricLBuehler · Aug 16, 2024 · fd90d9a · fd90d9a
1 parent a09a68a
commit fd90d9a
Show file tree

Hide file tree

Showing 19 changed files with 2,098 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -64,23 +64,26 @@ Mistal.rs supports several model categories:
 
 ## Description
 **Fast**:
-- Quantized model support: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit for faster inference and optimized memory usage.
+- Apple silicon support with the Metal framework.
+- CPU inference with `mkl`, `accelerate` support and optimized backend.
+- CUDA support with flash attention and cuDNN.
 - Continuous batching and PagedAttention support.
 - Prefix caching.
 - [Device mapping](docs/DEVICE_MAPPING.md): load and run some layers on the device and the rest on the CPU.
 
-**Accelerator support**:
-- Apple silicon support with the Metal framework.
-- CPU inference with `mkl`, `accelerate` support and optimized backend.
-- CUDA support with flash attention and cuDNN.
+**Quantization**:
+- [Details](docs/QUANTS.md)
+- GGML: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit, with ISQ support.
+- GPTQ: 2-bit, 3-bit, 4-bit and 8-bit
+- HQQ: 4-bit and 8 bit, with ISQ support
+- [ISQ](docs/ISQ.md) (In situ quantization): run `.safetensors` models directly from Hugging Face Hub by quantizing them after loading instead of creating a GGUF file.
+    - This loads the ISQ-able weights on CPU before quantizing with ISQ and then moving to the device to avoid memory spikes.
+    - Extremely fast due to working in parallel
 
 **Easy**:
 - Lightweight OpenAI API compatible HTTP server.
 - Python API.
 - Grammar support with Regex and Yacc.
-- [ISQ](docs/ISQ.md) (In situ quantization): run `.safetensors` models directly from Hugging Face Hub by quantizing them after loading instead of creating a GGUF file.
-    - This loads the ISQ-able weights on CPU before quantizing with ISQ and then moving to the device to avoid memory spikes.
-    - Extremely fast due to working in parallel
 
 **Powerful**:
 - Fast LoRA support with weight merging.
@@ -98,7 +101,6 @@ Mistal.rs supports several model categories:
     - Please suggest more by raising an issue!
 - Tool calling: [docs](docs/TOOL_CALLING.md)
 - Prompt chunking (only without PagedAttention for now): handle larger prompts where the activation size would cause an OOM by sending chunks
-- Various quantizations (GGUF, GPTQ, ISQ): [docs](docs/QUANTS.md)
 
 
 This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.

diff --git a/docs/ISQ.md b/docs/ISQ.md
@@ -15,9 +15,12 @@ Possible values for ISQ quantization:
 - Q5K
 - Q6K
 - Q8K
+- HQQ4
+- HQQ8
 
 When using ISQ, it will automatically load ISQ-able weights into CPU memory before applying ISQ. The ISQ application process moves the weights to device memory. This process is implemented to avoid memory spikes from loading the model in full precision.
 
+**Fallback rules for GGUF quantization**
 If a tensor cannot be quantized, the fallback process is as follows:
 1) If using a `K` quant, fallback to a similar `Q` quant.
 2) If that is not possible, use `F32` as the data type.

diff --git a/docs/QUANTS.md b/docs/QUANTS.md
@@ -4,16 +4,22 @@ Mistral.rs supports the following quantization:
 - GGUF/GGML
     - Q, K type
     - Supported in GGUF/GGML and GGUF/GGML adapter models
+    - Supported in all plain and adapter models
     - I quants coming!
     - CPU, CUDA, Metal (all supported devices)
+    - 2, 3, 4, 5, 6, 8 bit
 - GPTQ
     - Supported in all plain and adapter models
     - CUDA only
+    - 2, 3, 4, 8 bit
+- HQQ
+    - Supported in all plain and adapter models via ISQ
+    - CUDA and CPU only
+    - 4, 8 bit
 - ISQ
     - Q, K type GGUF quants
     - Supported in all plain and adapter models
-    - I quants coming!
-    - GPTQ quants coming!
+    - HQQ quants
     - CPU, CUDA, Metal (all supported devices)
 
 ## Using a GGUF quantized model

diff --git a/mistralrs-core/src/pipeline/isq.rs b/mistralrs-core/src/pipeline/isq.rs
@@ -23,6 +23,11 @@ use crate::device_map::DeviceMapper;
 /// - `Q5K`
 /// - `Q6K`
 /// - `Q8K`
+/// - `HQQ1`
+/// - `HQQ2`
+/// - `HQQ3`
+/// - `HQQ4`
+/// - `HQQ8`
 pub fn parse_isq_value(s: &str) -> Result<IsqType, String> {
     match s.to_lowercase().as_str() {
         "q4_0" => Ok(IsqType::Q4_0),
@@ -37,7 +42,12 @@ pub fn parse_isq_value(s: &str) -> Result<IsqType, String> {
         "q5k" => Ok(IsqType::Q5K),
         "q6k" => Ok(IsqType::Q6K),
         "q8k" => Ok(IsqType::Q8K),
-        _ => Err(format!("GGML type {s} unknown, choose one of `Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q8_0`, `Q8_1`, `Q2K`, `Q3K`, `Q4K`, `Q5K`, `Q6K`, `Q8K`.")),
+        "hqq8" => Ok(IsqType::HQQ8),
+        "hqq4" => Ok(IsqType::HQQ4),
+        // "hqq3" => Ok(IsqType::HQQ3),
+        // "hqq2" => Ok(IsqType::HQQ2),
+        // "hqq1" => Ok(IsqType::HQQ1),
+        _ => Err(format!("GGML type {s} unknown, choose one of `Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q8_0`, `Q8_1`, `Q2K`, `Q3K`, `Q4K`, `Q5K`, `Q6K`, `Q8K`, `HQQ8`, `HQQ4`.")),
     }
 }
 

diff --git a/mistralrs-quant/build.rs b/mistralrs-quant/build.rs
@@ -7,7 +7,11 @@ fn main() {
         use std::{path::PathBuf, vec};
         println!("cargo:rerun-if-changed=build.rs");
         let build_dir = PathBuf::from(std::env::var("OUT_DIR").unwrap());
-        let lib_files = vec!["kernels/gptq/q_gemm.cu"];
+        let lib_files = vec![
+            "kernels/gptq/q_gemm.cu",
+            "kernels/hqq/hqq.cu",
+            "kernels/ops/ops.cu",
+        ];
         for lib_file in lib_files.iter() {
             println!("cargo:rerun-if-changed={lib_file}");
         }