LlamaFamily · Rayrtfr · Aug 17, 2023 · Aug 16, 2023 · Aug 16, 2023 · Aug 17, 2023
diff --git a/README.md b/README.md
@@ -371,20 +371,25 @@ print(text)
 ```
 
 ## 🚀 推理加速
-随着大模型参数规模的不断增长，在有限的算力资源下，提升模型的推理速度逐渐变为一个重要的研究方向。常用的推理加速框架包含FasterTransformer和vLLM等。
+随着大模型参数规模的不断增长，在有限的算力资源下，提升模型的推理速度逐渐变为一个重要的研究方向。常用的推理加速框架包含 lmdeploy、FasterTransformer 和 vLLM 等。
+
+### lmdeploy
+[lmdeploy](https://github.com/InternLM/lmdeploy/) 由上海人工智能实验室开发，推理使用 C++/CUDA，对外提供 python/gRPC/http 接口和 WebUI 界面，支持 tensor parallel 分布式推理、支持 fp16/weight int4/kv cache int8 量化。
+
+详细的推理文档见：[inference-speed/GPU/lmdeploy_example](https://github.com/FlagAlpha/Llama2-Chinese/tree/main/inference-speed/GPU/lmdeploy_example)
+
 ### FasterTransformer
 [FasterTransformer](https://github.com/NVIDIA/FasterTransformer)由NVIDIA开发，采用C++/CUDA编写，支持分布式推理，transformer编码器和解码器均可进行加速。
 通过FasterTransformer和[Triton](https://github.com/openai/triton)加速LLama2模型推理，目前支持FP16或者Int8推理，Int4目前还不支持。
 
 详细的推理文档见：[inference-speed/GPU/FasterTransformer_example](https://github.com/FlagAlpha/Llama2-Chinese/tree/main/inference-speed/GPU/FasterTransformer_example)
+
 ### vLLM
 [vLLM](https://github.com/vllm-project/vllm)由加州大学伯克利分校开发，核心技术是PageAttention，吞吐量比HuggingFace Transformers高出24倍。相较与FasterTrainsformer，vLLM更加的简单易用，不需要额外进行模型的转换，支持fp16推理。
 
 详细的推理文档见：[inference-speed/GPU/vllm_example](https://github.com/FlagAlpha/Llama2-Chinese/blob/main/inference-speed/GPU/vllm_example/README.md)
 
 
-
-
 ## 🥇 模型评测
 为了能够更加清晰地了解Llama2模型的中文问答能力，我们筛选了一些具有代表性的中文问题，对Llama2模型进行提问。我们测试的模型包含Meta公开的Llama2-7B-Chat和Llama2-13B-Chat两个版本，没有做任何微调和训练。测试问题筛选自[AtomBulb](https://github.com/AtomEcho/AtomBulb)，共95个测试问题，包含：通用知识、语言理解、创作能力、逻辑推理、代码编程、工作技能、使用工具、人格特征八个大的类别。
 

diff --git a/docs/inference_speed_guide.md b/docs/inference_speed_guide.md
@@ -14,6 +14,9 @@
 
 [使用说明](../inference-speed/GPU/FasterTransformer_example/README.md)
 
+### 方案三：lmdeploy
+
+[使用说明](../inference-speed/GPU/lmdeploy_example/README.md)
 
 
 ## 2. CPU 推理方案

diff --git a/inference-speed/GPU/lmdeploy_example/README.md b/inference-speed/GPU/lmdeploy_example/README.md
@@ -0,0 +1,89 @@
+#  lmdeploy 安装和使用
+
+lmdeploy 支持 transformer 结构（例如 LLaMA、LLaMa2、InternLM、Vicuna 等），目前支持 fp16，int8 和 int4。
+
+## 一、安装
+
+安装预编译的 python 包
+```
+$ python3 -m pip install lmdeploy
+```
+
+## 二、fp16 推理
+
+把模型转成 lmdeploy 推理格式，假设 huggingface 版 LLaMa2 模型已下载到 `/models/llama-2-7b-chat` 目录，结果会存到 `workspace` 文件夹
+
+```shell
+$ python3 -m lmdeploy.serve.turbomind.deploy llama2 /models/llama-2-7b-chat
+```
+
+在命令行中测试聊天效果
+
+```shell
+$ python3 -m lmdeploy.turbomind.chat ./workspace
+..
+double enter to end input >>> who are you
+
+..
+Hello! I'm just an AI assistant ..
+```
+
+也可以用 gradio 启动 WebUI 来聊天
+```shell
+python3 -m lmdeploy.serve.gradio.app ./workspace
+```
+
+lmdeploy 同样支持原始的 facebook 格式模型、支持 70B 模型分布式推理，用法请查看 [lmdeploy 官方文档](https://github.com/internlm/lmdeploy)。
+
+## 三、kv cache int8 量化
+
+lmdeploy 实现了 kv cache int8 量化，同样的显存可以服务更多并发用户。
+
+首先获取量化参数，结果保存到 fp16 转换好的 `workspace/triton_models/weights` 下，7B 模型也不需要 tensor parallel。 
+
+```shell
+$ python3 -m lmdeploy.lite.apis.kv_qparams
+  --work_dir /models/llama-2-7b-chat
+  --turbomind_dir ./workspace/triton_models/weights
+  --kv_sym False
+  --num_tp 1
+```
+
+然后修改推理配置，开启 kv cache int8。编辑 `workspace/triton_models/weights/config.ini` 
+* 把 `use_context_fmha` 改为 0，表示关闭 flashattention
+* 把 `quant_policy` 设为 4，表示打开 kv cache 量化
+
+最终执行测试即可
+```shell
+$ python3 -m lmdeploy.turbomind.chat ./workspace
+```
+
+[点击这里](https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/quantization.md) 查看 kv cache int8 量化的精度和显存测试报告。
+
+## 四、weight int4 量化
+
+lmdeploy 基于 [AWQ 算法](https://arxiv.org/abs/2306.00978) 实现了 weight int4 量化，相对 fp16 版本，速度是 3.16 倍、显存从 16G 降低到 6.3G。
+
+这里有 AWQ 算法优化好的模型，直接下载。
+
+```shell
+$ git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
+```
+
+执行以下命令，即可在终端与模型对话：
+
+```shell
+## 转换模型的layout，存放在默认路径 ./workspace 下
+python3 -m lmdeploy.serve.turbomind.deploy \
+    --model-name llama2 \
+    --model-path ./llama2-chat-7b-w4 \
+    --model-format awq \
+    --group-size 128
+
+## 推理
+python3 -m lmdeploy.turbomind.chat ./workspace
+```
+
+[点击这里](https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/w4a16.md) 查看 weight int4 量化的显存和速度测试结果。
+
+额外说明，weight int4 和 kv cache int8 二者并不冲突、可以同时打开，节约更多显存。