Merge pull request #3 from SamuraiBUPT/dev-llama-backend

fix the int8_mode and decoupled mode backend support
triton-inference-server · Jul 10, 2023 · 9fbf442 · CN-COTER · Sep 20, 2023 · 9fbf442
2 parents b7ba3dc + 4d3292c
commit 9fbf442
Show file tree

Hide file tree

Showing 2 changed files with 15 additions and 7 deletions.
diff --git a/docs/llama_guide.md b/docs/llama_guide.md
@@ -5,7 +5,8 @@ We have deployed LLaMa on triton inference server with faster transformer backen
 + Ubuntu 20.04
 + docker: 24.0.2
 + cmake
-+ python
++ python: 3.10.6
++ pip: 23.1.2
 
 Hardware:
 + RTX 3090 (24G VMEM) * 2
@@ -26,12 +27,14 @@ We will expand our work in `llama_deploy` directory.
 ## 1. build docker image
 To reproduce all further steps that would be easier to run everything into Docker container. So it's necessary to build a triton docker image for next steps.
 
+The reason why we choose the image tag:23.04 is that this may support the decoupled mode. See this [issue](https://github.com/triton-inference-server/server/issues/6002#issuecomment-1617106369) for more info.
+
 ```bash
 git clone https://github.com/void-main/fastertransformer_backend.git
 
 cd fastertransformer_backend
 
-sudo docker build --rm --build-arg TRITON_VERSION=22.12 -t triton_ft_backend:22.12 -f docker/Dockerfile .
+sudo docker build --rm --build-arg TRITON_VERSION=23.04 -t triton_ft_backend:23.04 -f docker/Dockerfile .
 ```
 
 The build process may take more than five minutes, depending on your hardware.
@@ -41,7 +44,7 @@ When finished, launch the container:
 ```bash
 cd ../
 
-sudo docker run -it --rm --gpus=all --net=host --shm-size=4G  -v $(pwd):/ft_workspace -p8888:8888 -p8000:8000 -p8001:8001 -p8002:8002 triton_ft_backend:22.12 bash 
+sudo docker run -it --rm --gpus=all --net=host --shm-size=4G  -v $(pwd):/ft_workspace -p8888:8888 -p8000:8000 -p8001:8001 -p8002:8002 triton_ft_backend:23.04 bash 
 ```
 
 We have mapped the `llama_deploy` directory to `/ft_workspace` inside the container.
@@ -194,4 +197,8 @@ I0628 02:59:06.177982 11650 http_server.cc:3477] Started HTTPService at 0.0.0.0:
 I0628 02:59:06.219577 11650 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
 ```
 
-That means the program was launched successfully.
+That means the program was launched successfully.
+
+# Update
++ offer `int8_mode` support in `libfastertransformer.cc` to make sure the compiler can find matching function.
++ fix the `decoupled mode` support, you may get access to decoupled mode with a higher version of tritonserver base image! (23.04 tested)
diff --git a/src/libfastertransformer.cc b/src/libfastertransformer.cc
@@ -333,13 +333,14 @@ std::shared_ptr<AbstractTransformerModel> ModelState::ModelFactory(
     }
   } else if (model_type == "Llama") {
     if (data_type == "fp16") {
-      ft_model = std::make_shared<LlamaTritonModel<half>>(tp, pp, custom_ar, model_dir);
+      const int         int8_mode  = param_get_int(param, "int8_mode");
+      ft_model = std::make_shared<LlamaTritonModel<half>>(tp, pp, custom_ar, model_dir, int8_mode);
 #ifdef ENABLE_BF16
     } else if (data_type == "bf16") {
-      ft_model = std::make_shared<LlamaTritonModel<__nv_bfloat16>>(tp, pp, custom_ar, model_dir);
+      ft_model = std::make_shared<LlamaTritonModel<__nv_bfloat16>>(tp, pp, custom_ar, model_dir, int8_mode);
 #endif
     } else if (data_type == "fp32") {
-      ft_model = std::make_shared<LlamaTritonModel<float>>(tp, pp, custom_ar, model_dir);
+      ft_model = std::make_shared<LlamaTritonModel<float>>(tp, pp, custom_ar, model_dir, int8_mode);
     } else {
       LOG_MESSAGE(TRITONSERVER_LOG_ERROR, dt_message.c_str());
     }