UMbreLLa: Deploying LLMs for Personal Agents

UMbreLLa combines offloading, speculative decoding and quantization, tailored to single-user LLM deployment scenarios. Using UMbreLLa, 70B-level models can achieve performance comparable to human reading speed on an RTX 4070Ti, delivering exceptional efficiency and responsiveness, and especially expertised on coding tasks.

Deploy 4bit Llama3.1-70B model on RTX 4070Ti with UMbreLLa

1. Models Supported and Benchmarks

The throughput is measured with a batch size of 1 to directly mirror the user experience.

1.1 MT Bench

GPU	Model	Draft	Throughput (tokens/sec)
			Stochastic	Greedy
RTX 4090	Llama3.1-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	7.2	8.6
	Llama3.3-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	7.0	7.4
	Llama3.1-8B-Instruct	Llama3.2-1B-Instruct	100.7	108.1
RTX 4080 SUPER	Llama3.1-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	7.4	8.4
RTX 4080 SUPER	Llama3.3-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	6.7	7.2
RTX 4070 Ti	Llama3.1-70B-Instruct-AWQ	Llama3.2-1B-Instruct	5.5	6.1
RTX 4070 Ti	Llama3.3-70B-Instruct-AWQ	Llama3.2-1B-Instruct	5.2	5.5
L40	Llama3.1-70B-Instruct-AWQ	Llama3.2-1B-Instruct	37.0	38.5
L40	Llama3.3-70B-Instruct-AWQ	Llama3.2-1B-Instruct	36.3	37.1

1.2 Code Completion

Evaluated on ananyarn/Algorithm_and_Python_Source_Code.

GPU	Model	Draft	Throughput (tokens/sec)
RTX 4090	Llama3.1-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	11.4
	Llama3.3-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	11.2
	Llama3.1-8B-Instruct	CodeDrafter-500M	174.8
RTX 4080 SUPER	Llama3.1-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	12.2
	Llama3.3-70B-Instruct-AWQ	Llama3.1-8B-Instruct-AWQ	12.1
	Llama3.1-8B-Instruct-AWQ	CodeDrafter-500M	195.3
RTX 4070 Ti	Llama3.1-70B-Instruct-AWQ	Llama3.2-1B-Instruct	9.7
	Llama3.3-70B-Instruct-AWQ	Llama3.2-1B-Instruct	9.6
	Llama3.1-8B-Instruct-AWQ	CodeDrafter-500M	162.3
L40	Llama3.1-70B-Instruct-AWQ	CodeDrafter-500M	45.6
L40	Llama3.3-70B-Instruct-AWQ	CodeDrafter-500M	45.0

Offloading experiments heavily rely on the status of PCIE, and may vary across instances.

❌ UMbreLLa is not designed for large-scale LLM serving.

2 Deploying your LLMs with UMbreLLa

2.1 Install

conda create -n umbrella python=3.10
bash install.sh

2.2 CLI Chatbot

cd app
python chatbot.py --configuration ../configs/chat_config_24gb.json

Then you can chat with the LLM specified in chat_config_24gb.json.

2.3 Gradio Chatbot

cd app
python gradio_chat.py --configuration ../configs/chat_config_24gb.json

Then you can chat with the LLM specified in chat_config_24gb.json in Gradio.

2.4 API Server/Client

2.4.1 Server

cd app
python api.py --configuration ../configs/chat_config_24gb.json --max_client 1 --port 65432

configuration specifies the LLM and speculative decoding details.

max_client is the maximum clients that can connect to the server.

port is the port of the server.

2.4.2 Client

After the server is started, Client can be started and connect to the server by

from umbrella.api.client import APIClient
client = APIClient(port=port) #port should be the same as the server
client.run()

To get the LLM output,

input1 = {"context": text1, "max_new_tokens": 512, "temperature": 0.0}
output1 = client.get_output(**input1)

3 Config the LLM Engine

{
    "model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", 
    "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
    "offload": true,
    "cuda_graph": false,
    "max_length": 4096,
    "num_cache_layers": 0,
    "generation_length": 256,
    "max_turns": 12,
    "topk": 32,
    "temperature": 0.6,
    "topp": 0.9,
    "repetition_penalty": 1.05,
    "growmap_path": "../umbrella/trees/sequoia_tree-3x4.json",
    "width": 16,
    "num_beams": 24,
    "depth": 16,
    "engine": "dynamic",
    "template": "meta-llama3"
}

Key Configuration Options

model: Specifies the target LLM to serve, e.g., "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4".
draft_model: Lightweight draft model, e.g., "meta-llama/Llama-3.2-1B-Instruct".
offload: Enables offloading of the target model to host memory or disk (true or false).
cuda_graph: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).
max_length: The maximum token length for input and output combined.
num_cache_layers: Sets the number of layers cached during inference (e.g., for memory optimization).
generation_length: Maximum length of generated responses in tokens.
max_turns: Limits the number of conversational turns retained in memory.
topk: Limits token selection during generation to the top k most likely tokens.
temperature: Controls randomness in token selection (lower values = more deterministic outputs).
topp: Enables nucleus sampling by limiting token selection to those with cumulative probability ≤ p.
repetition_penalty: Penalizes repetitive text generation (values > 1 discourage repetition).
growmap_path: Path to the speculative decoding tree used by the static engine (e.g., "../umbrella/trees/sequoia_tree-3x4.json").

Dynamic Engine-Specific Hyperparameters

engine: Defines the decoding strategy. Choose between:
- "static": Optimized for on-device execution.
- "dynamic": Designed for offloading scenarios.
width, num_beams, depth: Hyperparameters for speculative decoding in dynamic engines.

Prompt Template

template: Defines the structure for input prompts. Supported values include:
- "llama3-code": Optimized for code-related tasks.
- "meta-llama3": General-purpose instruction-following template.

⚠️Notice: width, num_beams, depth, and growmap_path require tuning according to GPUs. Several examples are provided in ./configs and ./umbrella/trees.

4 Basic Usage

4.1 Initialize a Speculation Engine

from umbrella.speculation.auto_engine import AutoEngine
DEVICE = "cuda:0"
engine = AutoEngine.from_config(device=DEVICE, **config)
engine.initialize()

4.2 Prefill, Append and Decode

GEN_LEN = 512
text1 = "Tell me what you know about Reinforcement Learning in 100 words."
text2 = "Tell me what you know about LSH in 100 words."

engine.prefill(text1) # The first operation must be prefilling
engine.speculative_decoding(max_new_tokens=GEN_LEN)

engine.append(text2)
engine.speculative_decoding(max_new_tokens=GEN_LEN)

4.3 Other functions for API and Gradio

output = engine.generate(
        context=prompt, 
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
    )
# return a dict containing token ids and detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]

stream = engine.generate_stream(
        context=prompt, 
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
    )
# return a stream containing detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]

Reference

@article{chen2024sequoia,
  title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
  author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
  journal={arXiv preprint arXiv:2402.12374},
  year={2024}
}
@article{svirschevski2024specexec,
  title={SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices},
  author={Svirschevski, Ruslan and May, Avner and Chen, Zhuoming and Chen, Beidi and Jia, Zhihao and Ryabinin, Max},
  journal={arXiv preprint arXiv:2406.02532},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
app		app
assets		assets
configs		configs
examples		examples
umbrella		umbrella
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UMbreLLa: Deploying LLMs for Personal Agents

1. Models Supported and Benchmarks

1.1 MT Bench

1.2 Code Completion

2 Deploying your LLMs with UMbreLLa

2.1 Install

2.2 CLI Chatbot

2.3 Gradio Chatbot

2.4 API Server/Client

2.4.1 Server

2.4.2 Client

3 Config the LLM Engine

Key Configuration Options

Dynamic Engine-Specific Hyperparameters

Prompt Template

4 Basic Usage

4.1 Initialize a Speculation Engine

4.2 Prefill, Append and Decode

4.3 Other functions for API and Gradio

Reference

About

Releases

Packages

Contributors 2

Languages

License

Infini-AI-Lab/UMbreLLa

Folders and files

Latest commit

History

Repository files navigation

UMbreLLa: Deploying LLMs for Personal Agents

1. Models Supported and Benchmarks

1.1 MT Bench

1.2 Code Completion

2 Deploying your LLMs with UMbreLLa

2.1 Install

2.2 CLI Chatbot

2.3 Gradio Chatbot

2.4 API Server/Client

2.4.1 Server

2.4.2 Client

3 Config the LLM Engine

Key Configuration Options

Dynamic Engine-Specific Hyperparameters

Prompt Template

4 Basic Usage

4.1 Initialize a Speculation Engine

4.2 Prefill, Append and Decode

4.3 Other functions for API and Gradio

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages