Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
maryyufei21 committed Sep 21, 2024
1 parent 70dc060 commit fd2cab6
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 13 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,5 @@ Anaconda3-2024.02-1-Linux-x86_64.sh
*.log
*.csv
*.req_*
*.schedule
*.schedule
!figures/*
37 changes: 25 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,32 +7,41 @@
</p>


# Introduction

NanoFlow is a throughput-oriented high-performance serving framework for LLMs. With key techniques including *intra-device parallelism, asynchronous CPU scheduling, and SSD offloading*, NanoFlow consistently delivers superior throughput compared to vLLM, Deepspeed-FastGen, and TensorRT-LLM. Comprehensive evaluations, ranging from Llama2-70B, Qwen2-72B, DeepSeek-67B, Mixtral-8x7B, and LLaMA3-8B on a 8xA100 80GB DGX node, show that **NanoFlow achieves up to 1.91x throughput boost compared to TensorRT-LLM.**
NanoFlow is a throughput-oriented high-performance serving framework for LLMs. NanoFlow consistently delivers superior throughput compared to vLLM, Deepspeed-FastGen, and TensorRT-LLM. **NanoFlow achieves up to 1.91x throughput boost compared to TensorRT-LLM.** The key features of NanoFlow include:

- **Intra-device parallelism**: Maximizes hardware utilization by exploiting nano-batching and execution unit scheduling to overlap different resource demands inside a single device.
- **Asynchronous CPU scheduling**: Achieves highly efficient CPU scheduling by adopting asynchronous control flow for GPU execution, CPU batch formation and KV-cache management.



## News
- [2024/09] 🚀 We supported Llama2 70B, Llama3 70B, Llama3.1 70B, Llama3 8B, Llama3.1 8B and Qwen2 72B models, and released experiment scripts for evaluation results.

## Introduction



The key insight behinds NanoFlow is that traditional pipeline design of existing frameworks under-utilizes hardware resources due to the sequential execution of operations. Therefore, NanoFlow proposes intra-device parallelism (as shown in the following gif), which use nano-batches to schedule the compute-, memory-, network-bound operations for simultaneous execution. Such overlapping leaves compute-bound operations on the critical path and boost the resource utilization.
<p align="center">
<img src="./figures/SystemDesign.png" alt="system design" width="90%">
</p>
<p align="center"><em>Overview of NanoFlow's key components</em></p>

The key insight behinds NanoFlow is that traditional pipeline design of existing frameworks under-utilizes hardware resources due to the sequential execution of operations. Therefore, NanoFlow proposes intra-device parallelism (as shown in the following gif), which use nano-batches to schedule the compute-, memory-, network-bound operations for simultaneous execution. Such overlapping leaves compute-bound operations on the critical path and boost the resource utilization.

<p align="center">
<img src="./figures/pipeline.gif" alt="system design" width="90%">
</p>
<p align="center"><em>Illustration of intra-device parallelism</em></p>

With highly utilized GPU, the overhead of CPU, which consists of KV-Cache management, batch formation, and retired requests selection, takes significant part ($>10$%) of inference time. Therefore, NanoFlow adopts an asyncronous control flow as shown in the following figure. At any iteration $i$, NanoFlow makes batching decisions and allocates the KV-cache entries for the next iteration before the end of the current iteration. NanoFlow directly launches iteration $i + 1$ without detecting the end-of-sequence (EOS) tokens generated in iteration $i$ and retires completed requests at iteration $i+2$.
With highly utilized GPU, the overhead of CPU, which consists of KV-cache management, batch formation, and retired requests selection, takes significant part ($>10$%) of inference time. Therefore, NanoFlow adopts an asyncronous control flow as shown in the following figure. At any iteration $i$, NanoFlow makes batching decisions and allocates the KV-cache entries for the next iteration before the end of the current iteration. NanoFlow directly launches iteration $i + 1$ without detecting the end-of-sequence (EOS) tokens generated in iteration $i$ and retires completed requests at iteration $i+2$.


<p align="center">
<img src="./figures/async-schedule.png" alt="system design" width="90%">
</p>
<p align="center"><em>Explanation of asyncronous control flow scheduling</em></p>

To avoid recomputation and reuse the KV-Cache from multi-round conversations, NanoFlow eagerly offloads the KV-Cache of finished requests to SSDs. In one iteration, NanoFlow selects the KV-Cache of the retired requests and copies them to the host in parallel to the on-the-fly inference operations, via a layer-by-layer manner. Our calculation shows that only 5GB/s are needed for the offloading bandwidth of serving LLaMA2-70B, while a single SSD can reach 3GB/s.
To avoid recomputation and reuse the KV-cache from multi-round conversations, NanoFlow eagerly offloads the KV-cache of finished requests to SSDs. In one iteration, NanoFlow selects the KV-cache of the retired requests and copies them to the host in parallel to the on-the-fly inference operations, via a layer-by-layer manner. Our calculation shows that only 5GB/s are needed for the offloading bandwidth of serving LLaMA2-70B, while a single SSD can reach 3GB/s.

With all mentioned techniques implemented, we now open-source NanoFlow of a Cpp-based backend and a Python-based demo frontend in ~4K lines. NanoFlow integrates state-of-the-art kernel libraries including [CUTLASS](https://github.com/NVIDIA/cutlass) for GEMM, [FlashInfer](https://github.com/flashinfer-ai/flashinfer) for Attention, and [MSCCL++](https://github.com/microsoft/mscclpp) for Network. This codebase also contains necessary scripts for environment setup and experiment reproduction.

Expand Down Expand Up @@ -91,19 +100,23 @@ chmod +x ./installAnaconda.sh
yes | ./setup.sh
```

### Download the model
### Serve different models
```bash
./modelDownload.sh
./serve.sh
```
![Nanoflow](./figures/serve.png)

![Nanoflow](./figures/SampleOutput.png)


## Evaluation

## Serving datasets
```bash
./serve.sh
./perf.sh
```
![Nanoflow](./figures/SampleOutput.png)
Result figures can be found in `Nanoflow/pipeline/eval`.


## Evaluation Results
![Nanoflow](./figures/OfflineThroughput.png)

## Citation
Expand Down
Binary file modified figures/feasibility.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/serve.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit fd2cab6

Please sign in to comment.