The throughput is measured with a batch size of 1 to directly mirror the user experience.
GPU | Model | Draft | Throughput (tokens/sec) | |
---|---|---|---|---|
Stochastic | Greedy | |||
RTX 4090 | Llama3.1-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 7.2 | 8.6 |
Llama3.3-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 7.0 | 7.4 | |
Llama3.1-8B-Instruct | Llama3.2-1B-Instruct | 100.7 | 108.1 | |
RTX 4080 SUPER | Llama3.1-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 7.4 | 8.4 |
Llama3.3-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 6.7 | 7.2 | |
RTX 4070 Ti | Llama3.1-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 5.5 | 6.1 |
Llama3.3-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 5.2 | 5.5 | |
L40 | Llama3.1-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 37.0 | 38.5 |
Llama3.3-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 36.3 | 37.1 |
Evaluated on ananyarn/Algorithm_and_Python_Source_Code
.
GPU | Model | Draft | Throughput (tokens/sec) |
---|---|---|---|
RTX 4090 | Llama3.1-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 11.4 |
Llama3.3-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 11.2 | |
Llama3.1-8B-Instruct | CodeDrafter-500M | 174.8 | |
RTX 4080 SUPER | Llama3.1-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 12.2 |
Llama3.3-70B-Instruct-AWQ | Llama3.1-8B-Instruct-AWQ | 12.1 | |
Llama3.1-8B-Instruct-AWQ | CodeDrafter-500M | 195.3 | |
RTX 4070 Ti | Llama3.1-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 9.7 |
Llama3.3-70B-Instruct-AWQ | Llama3.2-1B-Instruct | 9.6 | |
Llama3.1-8B-Instruct-AWQ | CodeDrafter-500M | 162.3 | |
L40 | Llama3.1-70B-Instruct-AWQ | CodeDrafter-500M | 45.6 |
Llama3.3-70B-Instruct-AWQ | CodeDrafter-500M | 45.0 |
Offloading experiments heavily rely on the status of PCIE, and may vary across instances.
❌ UMbreLLa is not designed for large-scale LLM serving.
conda create -n umbrella python=3.10
bash install.sh
cd app
python chatbot.py --configuration ../configs/chat_config_24gb.json
Then you can chat with the LLM specified in chat_config_24gb.json
.
cd app
python gradio_chat.py --configuration ../configs/chat_config_24gb.json
Then you can chat with the LLM specified in chat_config_24gb.json
in Gradio.
cd app
python api.py --configuration ../configs/chat_config_24gb.json --max_client 1 --port 65432
configuration
specifies the LLM and speculative decoding details.
max_client
is the maximum clients that can connect to the server.
port
is the port of the server.
After the server is started, Client can be started and connect to the server by
from umbrella.api.client import APIClient
client = APIClient(port=port) #port should be the same as the server
client.run()
To get the LLM output,
input1 = {"context": text1, "max_new_tokens": 512, "temperature": 0.0}
output1 = client.get_output(**input1)
{
"model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
"draft_model": "meta-llama/Llama-3.2-1B-Instruct",
"offload": true,
"cuda_graph": false,
"max_length": 4096,
"num_cache_layers": 0,
"generation_length": 256,
"max_turns": 12,
"topk": 32,
"temperature": 0.6,
"topp": 0.9,
"repetition_penalty": 1.05,
"growmap_path": "../umbrella/trees/sequoia_tree-3x4.json",
"width": 16,
"num_beams": 24,
"depth": 16,
"engine": "dynamic",
"template": "meta-llama3"
}
- model: Specifies the target LLM to serve, e.g.,
"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
. - draft_model: Lightweight draft model, e.g.,
"meta-llama/Llama-3.2-1B-Instruct"
. - offload: Enables offloading of the target model to host memory or disk (
true
orfalse
). - cuda_graph: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).
- max_length: The maximum token length for input and output combined.
- num_cache_layers: Sets the number of layers cached during inference (e.g., for memory optimization).
- generation_length: Maximum length of generated responses in tokens.
- max_turns: Limits the number of conversational turns retained in memory.
- topk: Limits token selection during generation to the top
k
most likely tokens. - temperature: Controls randomness in token selection (lower values = more deterministic outputs).
- topp: Enables nucleus sampling by limiting token selection to those with cumulative probability ≤
p
. - repetition_penalty: Penalizes repetitive text generation (values > 1 discourage repetition).
- growmap_path: Path to the speculative decoding tree used by the static engine (e.g.,
"../umbrella/trees/sequoia_tree-3x4.json"
).
- engine: Defines the decoding strategy. Choose between:
"static"
: Optimized for on-device execution."dynamic"
: Designed for offloading scenarios.
- width, num_beams, depth: Hyperparameters for speculative decoding in dynamic engines.
- template: Defines the structure for input prompts. Supported values include:
"llama3-code"
: Optimized for code-related tasks."meta-llama3"
: General-purpose instruction-following template.
./configs
and ./umbrella/trees
.
from umbrella.speculation.auto_engine import AutoEngine
DEVICE = "cuda:0"
engine = AutoEngine.from_config(device=DEVICE, **config)
engine.initialize()
GEN_LEN = 512
text1 = "Tell me what you know about Reinforcement Learning in 100 words."
text2 = "Tell me what you know about LSH in 100 words."
engine.prefill(text1) # The first operation must be prefilling
engine.speculative_decoding(max_new_tokens=GEN_LEN)
engine.append(text2)
engine.speculative_decoding(max_new_tokens=GEN_LEN)
output = engine.generate(
context=prompt,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition_penalty,
)
# return a dict containing token ids and detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]
stream = engine.generate_stream(
context=prompt,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
repetition_penalty=repetition_penalty,
)
# return a stream containing detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]
@article{chen2024sequoia,
title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
journal={arXiv preprint arXiv:2402.12374},
year={2024}
}
@article{svirschevski2024specexec,
title={SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices},
author={Svirschevski, Ruslan and May, Avner and Chen, Zhuoming and Chen, Beidi and Jia, Zhihao and Ryabinin, Max},
journal={arXiv preprint arXiv:2406.02532},
year={2024}
}