Skip to content

Infini-AI-Lab/UMbreLLa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UMbreLLa: Deploying LLMs for Personal Agents

UMbreLLa combines offloading, speculative decoding and quantization, tailored to single-user LLM deployment scenarios. Using UMbreLLa, 70B-level models can achieve performance comparable to human reading speed on an RTX 4070Ti, delivering exceptional efficiency and responsiveness, and especially expertised on coding tasks.

demogif

Deploy 4bit Llama3.1-70B model on RTX 4070Ti with UMbreLLa

1. Models Supported and Benchmarks

The throughput is measured with a batch size of 1 to directly mirror the user experience.

1.1 MT Bench

GPU Model Draft Throughput (tokens/sec)
Stochastic Greedy
RTX 4090 Llama3.1-70B-Instruct-AWQ Llama3.1-8B-Instruct-AWQ 7.2 8.6
Llama3.3-70B-Instruct-AWQ Llama3.1-8B-Instruct-AWQ 7.0 7.4
Llama3.1-8B-Instruct Llama3.2-1B-Instruct 100.7 108.1
RTX 4080 SUPER Llama3.1-70B-Instruct-AWQ Llama3.1-8B-Instruct-AWQ 7.4 8.4
Llama3.3-70B-Instruct-AWQ Llama3.1-8B-Instruct-AWQ 6.7 7.2
RTX 4070 Ti Llama3.1-70B-Instruct-AWQ Llama3.2-1B-Instruct 5.5 6.1
Llama3.3-70B-Instruct-AWQ Llama3.2-1B-Instruct 5.2 5.5
L40 Llama3.1-70B-Instruct-AWQ Llama3.2-1B-Instruct 37.0 38.5
Llama3.3-70B-Instruct-AWQ Llama3.2-1B-Instruct 36.3 37.1

1.2 Code Completion

Evaluated on ananyarn/Algorithm_and_Python_Source_Code.

GPU Model Draft Throughput (tokens/sec)
RTX 4090 Llama3.1-70B-Instruct-AWQ Llama3.1-8B-Instruct-AWQ 11.4
Llama3.3-70B-Instruct-AWQ Llama3.1-8B-Instruct-AWQ 11.2
Llama3.1-8B-Instruct CodeDrafter-500M 174.8
RTX 4080 SUPER Llama3.1-70B-Instruct-AWQ Llama3.1-8B-Instruct-AWQ 12.2
Llama3.3-70B-Instruct-AWQ Llama3.1-8B-Instruct-AWQ 12.1
Llama3.1-8B-Instruct-AWQ CodeDrafter-500M 195.3
RTX 4070 Ti Llama3.1-70B-Instruct-AWQ Llama3.2-1B-Instruct 9.7
Llama3.3-70B-Instruct-AWQ Llama3.2-1B-Instruct 9.6
Llama3.1-8B-Instruct-AWQ CodeDrafter-500M 162.3
L40 Llama3.1-70B-Instruct-AWQ CodeDrafter-500M 45.6
Llama3.3-70B-Instruct-AWQ CodeDrafter-500M 45.0

Offloading experiments heavily rely on the status of PCIE, and may vary across instances.

❌ UMbreLLa is not designed for large-scale LLM serving.

2 Deploying your LLMs with UMbreLLa

2.1 Install

conda create -n umbrella python=3.10
bash install.sh

2.2 CLI Chatbot

cd app
python chatbot.py --configuration ../configs/chat_config_24gb.json

Then you can chat with the LLM specified in chat_config_24gb.json.

2.3 Gradio Chatbot

cd app
python gradio_chat.py --configuration ../configs/chat_config_24gb.json

Then you can chat with the LLM specified in chat_config_24gb.json in Gradio.

2.4 API Server/Client

2.4.1 Server

cd app
python api.py --configuration ../configs/chat_config_24gb.json --max_client 1 --port 65432

configuration specifies the LLM and speculative decoding details.

max_client is the maximum clients that can connect to the server.

port is the port of the server.

2.4.2 Client

After the server is started, Client can be started and connect to the server by

from umbrella.api.client import APIClient
client = APIClient(port=port) #port should be the same as the server
client.run()

To get the LLM output,

input1 = {"context": text1, "max_new_tokens": 512, "temperature": 0.0}
output1 = client.get_output(**input1)

3 Config the LLM Engine

{
    "model": "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", 
    "draft_model": "meta-llama/Llama-3.2-1B-Instruct",
    "offload": true,
    "cuda_graph": false,
    "max_length": 4096,
    "num_cache_layers": 0,
    "generation_length": 256,
    "max_turns": 12,
    "topk": 32,
    "temperature": 0.6,
    "topp": 0.9,
    "repetition_penalty": 1.05,
    "growmap_path": "../umbrella/trees/sequoia_tree-3x4.json",
    "width": 16,
    "num_beams": 24,
    "depth": 16,
    "engine": "dynamic",
    "template": "meta-llama3"
}

Key Configuration Options

  • model: Specifies the target LLM to serve, e.g., "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4".
  • draft_model: Lightweight draft model, e.g., "meta-llama/Llama-3.2-1B-Instruct".
  • offload: Enables offloading of the target model to host memory or disk (true or false).
  • cuda_graph: Toggles CUDA graph optimization for the draft model (currently unsupported for AWQ models).
  • max_length: The maximum token length for input and output combined.
  • num_cache_layers: Sets the number of layers cached during inference (e.g., for memory optimization).
  • generation_length: Maximum length of generated responses in tokens.
  • max_turns: Limits the number of conversational turns retained in memory.
  • topk: Limits token selection during generation to the top k most likely tokens.
  • temperature: Controls randomness in token selection (lower values = more deterministic outputs).
  • topp: Enables nucleus sampling by limiting token selection to those with cumulative probability ≤ p.
  • repetition_penalty: Penalizes repetitive text generation (values > 1 discourage repetition).
  • growmap_path: Path to the speculative decoding tree used by the static engine (e.g., "../umbrella/trees/sequoia_tree-3x4.json").

Dynamic Engine-Specific Hyperparameters

  • engine: Defines the decoding strategy. Choose between:
    • "static": Optimized for on-device execution.
    • "dynamic": Designed for offloading scenarios.
  • width, num_beams, depth: Hyperparameters for speculative decoding in dynamic engines.

Prompt Template

  • template: Defines the structure for input prompts. Supported values include:
    • "llama3-code": Optimized for code-related tasks.
    • "meta-llama3": General-purpose instruction-following template.

⚠️Notice: width, num_beams, depth, and growmap_path require tuning according to GPUs. Several examples are provided in ./configs and ./umbrella/trees.

4 Basic Usage

4.1 Initialize a Speculation Engine

from umbrella.speculation.auto_engine import AutoEngine
DEVICE = "cuda:0"
engine = AutoEngine.from_config(device=DEVICE, **config)
engine.initialize()

4.2 Prefill, Append and Decode

GEN_LEN = 512
text1 = "Tell me what you know about Reinforcement Learning in 100 words."
text2 = "Tell me what you know about LSH in 100 words."

engine.prefill(text1) # The first operation must be prefilling
engine.speculative_decoding(max_new_tokens=GEN_LEN)

engine.append(text2)
engine.speculative_decoding(max_new_tokens=GEN_LEN)

4.3 Other functions for API and Gradio

output = engine.generate(
        context=prompt, 
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
    )
# return a dict containing token ids and detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]

stream = engine.generate_stream(
        context=prompt, 
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_p=top_p,
        repetition_penalty=repetition_penalty,
    )
# return a stream containing detokenized texts
# context=prompt (str) can be replaced by input_ids=tokens list[int]

Reference

@article{chen2024sequoia,
  title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
  author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
  journal={arXiv preprint arXiv:2402.12374},
  year={2024}
}
@article{svirschevski2024specexec,
  title={SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices},
  author={Svirschevski, Ruslan and May, Avner and Chen, Zhuoming and Chen, Beidi and Jia, Zhihao and Ryabinin, Max},
  journal={arXiv preprint arXiv:2406.02532},
  year={2024}
}