Skip to content

Commit

Permalink
Merge pull request #3 from SamuraiBUPT/dev-llama-backend
Browse files Browse the repository at this point in the history
fix the int8_mode and decoupled mode backend support
  • Loading branch information
void-main authored Jul 10, 2023
2 parents b7ba3dc + 4d3292c commit 9fbf442
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 7 deletions.
15 changes: 11 additions & 4 deletions docs/llama_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ We have deployed LLaMa on triton inference server with faster transformer backen
+ Ubuntu 20.04
+ docker: 24.0.2
+ cmake
+ python
+ python: 3.10.6
+ pip: 23.1.2

Hardware:
+ RTX 3090 (24G VMEM) * 2
Expand All @@ -26,12 +27,14 @@ We will expand our work in `llama_deploy` directory.
## 1. build docker image
To reproduce all further steps that would be easier to run everything into Docker container. So it's necessary to build a triton docker image for next steps.

The reason why we choose the image tag:23.04 is that this may support the decoupled mode. See this [issue](https://github.com/triton-inference-server/server/issues/6002#issuecomment-1617106369) for more info.

```bash
git clone https://github.com/void-main/fastertransformer_backend.git

cd fastertransformer_backend

sudo docker build --rm --build-arg TRITON_VERSION=22.12 -t triton_ft_backend:22.12 -f docker/Dockerfile .
sudo docker build --rm --build-arg TRITON_VERSION=23.04 -t triton_ft_backend:23.04 -f docker/Dockerfile .
```

The build process may take more than five minutes, depending on your hardware.
Expand All @@ -41,7 +44,7 @@ When finished, launch the container:
```bash
cd ../

sudo docker run -it --rm --gpus=all --net=host --shm-size=4G -v $(pwd):/ft_workspace -p8888:8888 -p8000:8000 -p8001:8001 -p8002:8002 triton_ft_backend:22.12 bash
sudo docker run -it --rm --gpus=all --net=host --shm-size=4G -v $(pwd):/ft_workspace -p8888:8888 -p8000:8000 -p8001:8001 -p8002:8002 triton_ft_backend:23.04 bash
```

We have mapped the `llama_deploy` directory to `/ft_workspace` inside the container.
Expand Down Expand Up @@ -194,4 +197,8 @@ I0628 02:59:06.177982 11650 http_server.cc:3477] Started HTTPService at 0.0.0.0:
I0628 02:59:06.219577 11650 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
```

That means the program was launched successfully.
That means the program was launched successfully.

# Update
+ offer `int8_mode` support in `libfastertransformer.cc` to make sure the compiler can find matching function.
+ fix the `decoupled mode` support, you may get access to decoupled mode with a higher version of tritonserver base image! (23.04 tested)

This comment has been minimized.

Copy link
@CN-COTER

CN-COTER Sep 20, 2023

Hi, do you test infer with both int_8 mode & decoupled mode. @samiur

7 changes: 4 additions & 3 deletions src/libfastertransformer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -333,13 +333,14 @@ std::shared_ptr<AbstractTransformerModel> ModelState::ModelFactory(
}
} else if (model_type == "Llama") {
if (data_type == "fp16") {
ft_model = std::make_shared<LlamaTritonModel<half>>(tp, pp, custom_ar, model_dir);
const int int8_mode = param_get_int(param, "int8_mode");
ft_model = std::make_shared<LlamaTritonModel<half>>(tp, pp, custom_ar, model_dir, int8_mode);
#ifdef ENABLE_BF16
} else if (data_type == "bf16") {
ft_model = std::make_shared<LlamaTritonModel<__nv_bfloat16>>(tp, pp, custom_ar, model_dir);
ft_model = std::make_shared<LlamaTritonModel<__nv_bfloat16>>(tp, pp, custom_ar, model_dir, int8_mode);
#endif
} else if (data_type == "fp32") {
ft_model = std::make_shared<LlamaTritonModel<float>>(tp, pp, custom_ar, model_dir);
ft_model = std::make_shared<LlamaTritonModel<float>>(tp, pp, custom_ar, model_dir, int8_mode);
} else {
LOG_MESSAGE(TRITONSERVER_LOG_ERROR, dt_message.c_str());
}
Expand Down

0 comments on commit 9fbf442

Please sign in to comment.