From a1077848a3188db6236f836957eb40074dc6660f Mon Sep 17 00:00:00 2001 From: Tom Stesco Date: Thu, 6 Feb 2025 23:16:45 -0500 Subject: [PATCH 1/3] tstescp/mark-experimental-models (#102) * update documentation to show all supported models and give correct links * setup.sh supports all base and Instruct models, marks experimental models as preview --- README.md | 45 ++++++++----- setup.sh | 112 +++++++++++++-------------------- vllm-tt-metal-llama3/README.md | 8 ++- 3 files changed, 79 insertions(+), 86 deletions(-) diff --git a/README.md b/README.md index b3ac2c8..25debf0 100644 --- a/README.md +++ b/README.md @@ -8,17 +8,34 @@ Tenstorrent Inference Server (`tt-inference-server`) is the repo of available mo ## Getting Started -Please follow setup instructions found in each model folder's README.md doc - --------------------------------------------------------------------------------------------------------------- - -## Model Implementations -| Model | Hardware | -|----------------|-----------------------------| -| [Qwen 2.5 72B](vllm-tt-metal-llama3/README.md) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | -| [LLaMa 3.3 70B](vllm-tt-metal-llama3/README.md) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | -| [LLaMa 3.2 11B Vision](vllm-tt-metal-llama3/README.md) | [n300](https://tenstorrent.com/hardware/wormhole) | -| [LLaMa 3.2 3B](vllm-tt-metal-llama3/README.md) | [n150](https://tenstorrent.com/hardware/wormhole) | -| [LLaMa 3.2 1B](vllm-tt-metal-llama3/README.md) | [n150](https://tenstorrent.com/hardware/wormhole) | -| [LLaMa 3.1 70B](vllm-tt-metal-llama3/README.md) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | -| [LLaMa 3.1 8B](vllm-tt-metal-llama3/README.md) | [n150](https://tenstorrent.com/hardware/wormhole) | +Please follow setup instructions for the model you want to serve, `Model Name` in tables below link to corresponding implementation. + +Note: models with Status [🔍 preview] are under active development. If you encounter setup or stability problems please [file an issue](https://github.com/tenstorrent/tt-inference-server/issues/new?template=Blank+issue) and our team will address it. + +## LLMs + +| Model Name | Model URL | Hardware | Status | Minimum Release Version | +| ----------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------ | ----------- | -------------------------------------------------------------------------------- | +| [Qwen2.5-72B-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | 🔍 preview | [v0.0.2](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.2) | +| [Qwen2.5-72B](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/Qwen/Qwen2.5-72B) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | 🔍 preview | [v0.0.2](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.2) | +| [Qwen2.5-7B-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | [n150](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.2](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.2) | +| [Qwen2.5-7B](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/Qwen/Qwen2.5-7B) | [n150](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.2](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.2) | +| [Llama-3.3-70B-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.3-70B](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.3-70B) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.2-11B-Vision-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) | [n300](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.2-11B-Vision](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) | [n300](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.2-3B-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [n150](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.2-3B](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-3B) | [n150](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.2-1B-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | [n150](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.2-1B](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-1B) | [n150](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.1-70B-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.1-70B](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.1-70B) | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.1-8B-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [n150](https://tenstorrent.com/hardware/wormhole) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | +| [Llama-3.1-8B](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.1-8B) | [n150](https://tenstorrent.com/hardware/wormhole) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | + +# CNNs + +| Model Name | Model URL | Hardware | Status | Minimum Release Version | +| ----------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------ | ----------- | -------------------------------------------------------------------------------- | +| [YOLOv4](tt-metal-yolov4/README.md) | [GH Repo](https://github.com/AlexeyAB/darknet) | [n150](https://tenstorrent.com/hardware/wormhole) | 🔍 preview | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) | + diff --git a/setup.sh b/setup.sh index a1be09a..68c6f38 100755 --- a/setup.sh +++ b/setup.sh @@ -9,13 +9,19 @@ set -euo pipefail # Exit on error, print commands, unset variables treated as e usage() { echo "Usage: $0 " echo "Available model types:" - echo " Qwen2.5-72B-Instruct" - echo " Qwen2.5-7B-Instruct" - echo " DeepSeek-R1-Distill-Llama-70B" + echo " Qwen2.5-72B-Instruct (preview)" + echo " Qwen2.5-72B (preview)" + echo " Qwen2.5-7B-Instruct (preview)" + echo " Qwen2.5-7B (preview)" + echo " DeepSeek-R1-Distill-Llama-70B (preview)" echo " Llama-3.3-70B-Instruct" - echo " Llama-3.2-11B-Vision-Instruct" - echo " Llama-3.2-3B-Instruct" - echo " Llama-3.2-1B-Instruct" + echo " Llama-3.3-70B" + echo " Llama-3.2-11B-Vision-Instruct (preview)" + echo " Llama-3.2-11B-Vision (preview)" + echo " Llama-3.2-3B-Instruct (preview)" + echo " Llama-3.2-3B (preview)" + echo " Llama-3.2-1B-Instruct (preview)" + echo " Llama-3.2-1B (preview)" echo " Llama-3.1-70B-Instruct" echo " Llama-3.1-70B" echo " Llama-3.1-8B-Instruct" @@ -163,18 +169,18 @@ setup_model_environment() { # Set environment variables based on the model selection # note: MODEL_NAME is the directory name for the model weights case "$1" in - "Qwen2.5-72B-Instruct") + "Qwen2.5-72B"|"Qwen2.5-72B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Qwen2.5-72B-Instruct" - HF_MODEL_REPO_ID="Qwen/Qwen2.5-72B-Instruct" + MODEL_NAME="Qwen2.5-72B${1#Qwen2.5-72B}" + HF_MODEL_REPO_ID="Qwen/Qwen2.5-72B${1#Qwen2.5-72B}" META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 ;; - "Qwen2.5-7B-Instruct") + "Qwen2.5-7B"|"Qwen2.5-7B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Qwen2.5-7B-Instruct" - HF_MODEL_REPO_ID="Qwen/Qwen2.5-7B-Instruct" + MODEL_NAME="Qwen2.5-7B${1#Qwen2.5-7B}" + HF_MODEL_REPO_ID="Qwen/Qwen2.5-7B${1#Qwen2.5-7B}" META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 @@ -187,10 +193,10 @@ setup_model_environment() { META_DIR_FILTER="" REPACKED=0 ;; - "Llama-3.3-70B-Instruct") + "Llama-3.3-70B"|"Llama-3.3-70B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Llama-3.3-70B-Instruct" - HF_MODEL_REPO_ID="meta-llama/Llama-3.3-70B-Instruct" + MODEL_NAME="Llama-3.3-70B${1#Llama-3.3-70B}" + HF_MODEL_REPO_ID="meta-llama/Llama-3.3-70B${1#Llama-3.3-70B}" META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=1 @@ -203,83 +209,51 @@ setup_model_environment() { META_DIR_FILTER="" REPACKED=0 ;; - "Llama-3.2-3B-Instruct") + "Llama-3.2-3B"|"Llama-3.2-3B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Llama-3.2-3B-Instruct" - HF_MODEL_REPO_ID="meta-llama/Llama-3.2-3B-Instruct" + MODEL_NAME="Llama-3.2-3B${1#Llama-3.2-3B}" + HF_MODEL_REPO_ID="meta-llama/Llama-3.2-3B${1#Llama-3.2-3B}" META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 ;; - "Llama-3.2-1B-Instruct") + "Llama-3.2-1B"|"Llama-3.2-1B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Llama-3.2-1B-Instruct" - HF_MODEL_REPO_ID="meta-llama/Llama-3.2-1B-Instruct" + MODEL_NAME="Llama-3.2-1B${1#Llama-3.2-1B}" + HF_MODEL_REPO_ID="meta-llama/Llama-3.2-1B${1#Llama-3.2-1B}" META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 ;; - "Llama-3.1-70B-Instruct") + "Llama-3.1-70B"|"Llama-3.1-70B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Llama-3.1-70B-Instruct" - HF_MODEL_REPO_ID="meta-llama/Llama-3.1-70B-Instruct" - META_MODEL_NAME="Meta-Llama-3.1-70B-Instruct" + MODEL_NAME="Llama-3.1-70B${1#Llama-3.1-70B}" + HF_MODEL_REPO_ID="meta-llama/Llama-3.1-70B${1#Llama-3.1-70B}" + META_MODEL_NAME="Meta-Llama-3.1-70B${1#Llama-3.1-70B}" META_DIR_FILTER="llama3_1" REPACKED=1 ;; - "Llama-3.1-70B") + "Llama-3.1-8B"|"Llama-3.1-8B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Llama-3.1-70B" - HF_MODEL_REPO_ID="meta-llama/Llama-3.1-70B" - META_MODEL_NAME="Meta-Llama-3.1-70B" - META_DIR_FILTER="llama3_1" - REPACKED=1 - ;; - "Llama-3.1-8B-Instruct") - IMPL_ID="tt-metal" - MODEL_NAME="Llama-3.1-8B-Instruct" - HF_MODEL_REPO_ID="meta-llama/Llama-3.1-8B-Instruct" - META_MODEL_NAME="Meta-Llama-3.1-8B-Instruct" - META_DIR_FILTER="llama3_1" - REPACKED=0 - ;; - "Llama-3.1-8B") - IMPL_ID="tt-metal" - MODEL_NAME="Llama-3.1-8B" - HF_MODEL_REPO_ID="meta-llama/Llama-3.1-8B" - META_MODEL_NAME="Meta-Llama-3.1-8B" + MODEL_NAME="Llama-3.1-8B${1#Llama-3.1-8B}" + HF_MODEL_REPO_ID="meta-llama/Llama-3.1-8B${1#Llama-3.1-8B}" + META_MODEL_NAME="Meta-Llama-3.1-8B${1#Llama-3.1-8B}" META_DIR_FILTER="llama3_1" REPACKED=0 ;; - "Llama-3-70B-Instruct") + "Llama-3-70B"|"Llama-3-70B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Llama-3-70B-Instruct" - HF_MODEL_REPO_ID="meta-llama/Llama-3-70B-Instruct" - META_MODEL_NAME="Meta-Llama-3-70B-Instruct" + MODEL_NAME="Llama-3-70B${1#Llama-3-70B}" + HF_MODEL_REPO_ID="meta-llama/Llama-3-70B${1#Llama-3-70B}" + META_MODEL_NAME="Meta-Llama-3-70B${1#Llama-3-70B}" META_DIR_FILTER="llama3" REPACKED=1 ;; - "Llama-3-70B") - IMPL_ID="tt-metal" - MODEL_NAME="Llama-3-70B" - HF_MODEL_REPO_ID="meta-llama/Llama-3-70B" - META_MODEL_NAME="Meta-Llama-3-70B" - META_DIR_FILTER="llama3" - REPACKED=1 - ;; - "Llama-3-8B-Instruct") - IMPL_ID="tt-metal" - MODEL_NAME="Llama-3-8B-Instruct" - HF_MODEL_REPO_ID="meta-llama/Llama-3-8B-Instruct" - META_MODEL_NAME="Meta-Llama-3-8B-Instruct" - META_DIR_FILTER="llama3" - REPACKED=0 - ;; - "Llama-3-8B") + "Llama-3-8B"|"Llama-3-8B-Instruct") IMPL_ID="tt-metal" - MODEL_NAME="Llama-3-8B" - HF_MODEL_REPO_ID="meta-llama/Llama-3-8B" - META_MODEL_NAME="Meta-Llama-3-8B" + MODEL_NAME="Llama-3-8B${1#Llama-3-8B}" + HF_MODEL_REPO_ID="meta-llama/Llama-3-8B${1#Llama-3-8B}" + META_MODEL_NAME="Meta-Llama-3-8B${1#Llama-3-8B}" META_DIR_FILTER="llama3" REPACKED=0 ;; diff --git a/vllm-tt-metal-llama3/README.md b/vllm-tt-metal-llama3/README.md index e726184..bebfd4b 100644 --- a/vllm-tt-metal-llama3/README.md +++ b/vllm-tt-metal-llama3/README.md @@ -1,6 +1,8 @@ -# vLLM TT Metalium Llama 3.3 70B Inference API +# vLLM TT Metalium TT-Transformer Inference API -This implementation supports Llama 3.1 70B with vLLM at https://github.com/tenstorrent/vllm/tree/dev +This implementation supports the following models in the [LLM model list](../README.md#llms) with vLLM at https://github.com/tenstorrent/vllm/tree/dev + +The examples below are using `MODEL_NAME=Llama-3.3-70B-Instruct`. It is recommended to use Instruct fine-tuned models for interactive use. Start with this if you're unsure. ## Table of Contents @@ -18,7 +20,7 @@ This implementation supports Llama 3.1 70B with vLLM at https://github.com/tenst If first run setup has already been completed, start here. If first run setup has not been run please see the instructions below for [First run setup](#first-run-setup). -### Docker Run - vLLM llama3 inference server +### Docker Run - vLLM inference server Run the container from the project root at `tt-inference-server`: ```bash From ea7d18840b724136c676d07a9c79fb7ea223047f Mon Sep 17 00:00:00 2001 From: Tom Stesco Date: Tue, 18 Feb 2025 22:41:39 -0500 Subject: [PATCH 2/3] tstesco/fix-setup-script (#106) * removing preview marker in setup.sh from preview models * adding check_disk_space and check_ram to setup.sh --- setup.sh | 78 ++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 67 insertions(+), 11 deletions(-) diff --git a/setup.sh b/setup.sh index 68c6f38..125560e 100755 --- a/setup.sh +++ b/setup.sh @@ -9,19 +9,19 @@ set -euo pipefail # Exit on error, print commands, unset variables treated as e usage() { echo "Usage: $0 " echo "Available model types:" - echo " Qwen2.5-72B-Instruct (preview)" - echo " Qwen2.5-72B (preview)" - echo " Qwen2.5-7B-Instruct (preview)" - echo " Qwen2.5-7B (preview)" - echo " DeepSeek-R1-Distill-Llama-70B (preview)" + echo " Qwen2.5-72B-Instruct" + echo " Qwen2.5-72B" + echo " Qwen2.5-7B-Instruct" + echo " Qwen2.5-7B" + echo " DeepSeek-R1-Distill-Llama-70B" echo " Llama-3.3-70B-Instruct" echo " Llama-3.3-70B" - echo " Llama-3.2-11B-Vision-Instruct (preview)" - echo " Llama-3.2-11B-Vision (preview)" - echo " Llama-3.2-3B-Instruct (preview)" - echo " Llama-3.2-3B (preview)" - echo " Llama-3.2-1B-Instruct (preview)" - echo " Llama-3.2-1B (preview)" + echo " Llama-3.2-11B-Vision-Instruct" + echo " Llama-3.2-11B-Vision" + echo " Llama-3.2-3B-Instruct" + echo " Llama-3.2-3B" + echo " Llama-3.2-1B-Instruct" + echo " Llama-3.2-1B" echo " Llama-3.1-70B-Instruct" echo " Llama-3.1-70B" echo " Llama-3.1-8B-Instruct" @@ -127,6 +127,34 @@ check_hf_access() { return 0 } +# Function to check available disk space +check_disk_space() { + local min_disk=$1 + local available_disk + available_disk=$(df --block-size=1G / | awk 'NR==2 {print $4}') # Get available disk space in GB + if (( available_disk >= min_disk )); then + echo "✅ Sufficient disk space available: ${available_disk}GB, Required: ${min_disk}GB" + return 0 + else + echo "❌ ERROR: Insufficient disk space! Available: ${available_disk}GB, Required: ${min_disk}GB" + return 1 + fi +} + +# Function to check available RAM +check_ram() { + local min_ram=$1 + local available_ram + available_ram=$(free -g | awk '/^Mem:/ {print $7}') # Get available RAM in GB + if (( available_ram >= min_ram )); then + echo "✅ Sufficient RAM available: ${available_ram}GB, Required: ${min_ram}GB" + return 0 + else + echo "❌ ERROR: Insufficient RAM! Available: ${available_ram}GB, Required: ${min_ram}GB" + return 1 + fi +} + get_hf_env_vars() { # get HF_TOKEN if [ -z "${HF_TOKEN:-}" ]; then @@ -168,6 +196,8 @@ get_hf_env_vars() { setup_model_environment() { # Set environment variables based on the model selection # note: MODEL_NAME is the directory name for the model weights + # MIN_DISK: safe lower bound on available disk (based on 2 bytes per parameter and 2.5 copies: HF cache, model weights, tt-metal cache) + # MIN_RAM: safe lower bound on RAM needed (based on repacking 70B models) case "$1" in "Qwen2.5-72B"|"Qwen2.5-72B-Instruct") IMPL_ID="tt-metal" @@ -176,6 +206,8 @@ setup_model_environment() { META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 + MIN_DISK=360 + MIN_RAM=360 ;; "Qwen2.5-7B"|"Qwen2.5-7B-Instruct") IMPL_ID="tt-metal" @@ -184,6 +216,8 @@ setup_model_environment() { META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 + MIN_DISK=28 + MIN_RAM=35 ;; "DeepSeek-R1-Distill-Llama-70B") IMPL_ID="tt-metal" @@ -192,6 +226,8 @@ setup_model_environment() { META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 + MIN_DISK=350 + MIN_RAM=350 ;; "Llama-3.3-70B"|"Llama-3.3-70B-Instruct") IMPL_ID="tt-metal" @@ -200,6 +236,8 @@ setup_model_environment() { META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=1 + MIN_DISK=350 + MIN_RAM=350 ;; "Llama-3.2-11B-Vision-Instruct") IMPL_ID="tt-metal" @@ -208,6 +246,8 @@ setup_model_environment() { META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 + MIN_DISK=44 + MIN_RAM=55 ;; "Llama-3.2-3B"|"Llama-3.2-3B-Instruct") IMPL_ID="tt-metal" @@ -216,6 +256,8 @@ setup_model_environment() { META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 + MIN_DISK=12 + MIN_RAM=15 ;; "Llama-3.2-1B"|"Llama-3.2-1B-Instruct") IMPL_ID="tt-metal" @@ -224,6 +266,8 @@ setup_model_environment() { META_MODEL_NAME="" META_DIR_FILTER="" REPACKED=0 + MIN_DISK=4 + MIN_RAM=5 ;; "Llama-3.1-70B"|"Llama-3.1-70B-Instruct") IMPL_ID="tt-metal" @@ -232,6 +276,8 @@ setup_model_environment() { META_MODEL_NAME="Meta-Llama-3.1-70B${1#Llama-3.1-70B}" META_DIR_FILTER="llama3_1" REPACKED=1 + MIN_DISK=350 + MIN_RAM=350 ;; "Llama-3.1-8B"|"Llama-3.1-8B-Instruct") IMPL_ID="tt-metal" @@ -240,6 +286,8 @@ setup_model_environment() { META_MODEL_NAME="Meta-Llama-3.1-8B${1#Llama-3.1-8B}" META_DIR_FILTER="llama3_1" REPACKED=0 + MIN_DISK=32 + MIN_RAM=40 ;; "Llama-3-70B"|"Llama-3-70B-Instruct") IMPL_ID="tt-metal" @@ -248,6 +296,8 @@ setup_model_environment() { META_MODEL_NAME="Meta-Llama-3-70B${1#Llama-3-70B}" META_DIR_FILTER="llama3" REPACKED=1 + MIN_DISK=350 + MIN_RAM=350 ;; "Llama-3-8B"|"Llama-3-8B-Instruct") IMPL_ID="tt-metal" @@ -256,6 +306,8 @@ setup_model_environment() { META_MODEL_NAME="Meta-Llama-3-8B${1#Llama-3-8B}" META_DIR_FILTER="llama3" REPACKED=0 + MIN_DISK=32 + MIN_RAM=40 ;; *) echo "⛔ Invalid model choice." @@ -264,6 +316,10 @@ setup_model_environment() { ;; esac + # fail fast if host has insufficient resources + check_disk_space "$MIN_DISK" || exit 1 + check_ram "$MIN_RAM" || exit 1 + # Set default values for environment variables DEFAULT_PERSISTENT_VOLUME_ROOT=${REPO_ROOT}/persistent_volume # Safely handle potentially unset environment variables using default values From f61d97fe7ebe39b94873b73ab9a384f6a0f2c0b2 Mon Sep 17 00:00:00 2001 From: Tom Stesco Date: Tue, 18 Feb 2025 23:12:48 -0500 Subject: [PATCH 3/3] put setup and installation instructions first (#107) --- vllm-tt-metal-llama3/README.md | 117 +++++++++++++++++---------------- 1 file changed, 60 insertions(+), 57 deletions(-) diff --git a/vllm-tt-metal-llama3/README.md b/vllm-tt-metal-llama3/README.md index bebfd4b..ae1c612 100644 --- a/vllm-tt-metal-llama3/README.md +++ b/vllm-tt-metal-llama3/README.md @@ -2,72 +2,24 @@ This implementation supports the following models in the [LLM model list](../README.md#llms) with vLLM at https://github.com/tenstorrent/vllm/tree/dev -The examples below are using `MODEL_NAME=Llama-3.3-70B-Instruct`. It is recommended to use Instruct fine-tuned models for interactive use. Start with this if you're unsure. +You can setup the model being deployed using the `setup.sh` script and `MODEL_NAME` environment variable to point to it as shown below. The examples below are using `MODEL_NAME=Llama-3.3-70B-Instruct`. It is recommended to use Instruct fine-tuned models for interactive use. Start with this if you're unsure. ## Table of Contents -- [Quick run](#quick-run) - - [Docker Run - vLLM llama3 inference server](#docker-run---vllm-llama3-inference-server) -- [First run setup](#first-run-setup) +- [Setup and installation](#setup-and-installation) - [1. Docker install](#1-docker-install) - [2. Ensure system dependencies installed](#2-ensure-system-dependencies-installed) - [3. CPU performance setting](#3-cpu-performance-setting) - [4. Docker image](#4-docker-image) - [5. Automated Setup: environment variables and weights files](#5-automated-setup-environment-variables-and-weights-files) +- [Quick run](#quick-run) + - [Docker Run - vLLM inference server](#docker-run---vllm-inference-server) - [Additional Documentation](#additional-documentation) -## Quick run - -If first run setup has already been completed, start here. If first run setup has not been run please see the instructions below for [First run setup](#first-run-setup). - -### Docker Run - vLLM inference server - -Run the container from the project root at `tt-inference-server`: -```bash -cd tt-inference-server -# make sure if you already set up the model weights and cache you use the correct persistent volume -export MODEL_NAME=Llama-3.3-70B-Instruct -export MODEL_VOLUME=$PWD/persistent_volume/volume_id_tt-metal-${MODEL_NAME}-v0.0.1/ -docker run \ - --rm \ - -it \ - --env-file persistent_volume/model_envs/${MODEL_NAME}.env \ - --cap-add ALL \ - --device /dev/tenstorrent:/dev/tenstorrent \ - --volume /dev/hugepages-1G:/dev/hugepages-1G:rw \ - --volume ${MODEL_VOLUME?ERROR env var MODEL_VOLUME must be set}:/home/container_app_user/cache_root:rw \ - --shm-size 32G \ - --publish 7000:7000 \ - ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64:v0.0.1-b6ecf68e706b-b9564bf364e9 -``` - -By default the Docker container will start running the entrypoint command wrapped in `src/run_vllm_api_server.py`. -This can be run manually if you override the the container default command with an interactive shell via `bash`. -In an interactive shell you can start the vLLM API server via: -```bash -# run server manually -python run_vllm_api_server.py -``` - -The vLLM inference API server takes 3-5 minutes to start up (~40-60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell. - -### Example clients - -You can use `docker exec --user 1000 -it bash` (--user uid must match container you are using, default is 1000) to create a shell in the docker container or run the client scripts on the host (ensuring the correct port mappings and python dependencies): - -#### Run example clients from within Docker container: -```bash -# oneliner to enter interactive shell on most recently ran container -docker exec -it $(docker ps -q | head -n1) bash - -# inside interactive shell, run example clients script to send prompt request to vLLM server: -cd ~/app/src -python example_requests_client.py -``` - -## First run setup +## Setup and installation -Tested starting condition is from a fresh installation of Ubuntu 20.04 with Tenstorrent system dependencies installed. +This guide was tested starting condition is from a fresh installation of Ubuntu 20.04 with Tenstorrent system dependencies installed. +Ubuntu 22.04 should also work for most if not all models. ### 1. Docker install @@ -77,12 +29,12 @@ Recommended to follow postinstall guide to allow $USER to run docker without sud ### 2. Ensure system dependencies installed -Follow TT guide software installation at: https://docs.tenstorrent.com/quickstart.html +Follow TT guide software installation at: https://docs.tenstorrent.com/getting-started/README Ensure all set up: - firmware: tt-firmware (https://github.com/tenstorrent/tt-firmware) - drivers: tt-kmd (https://github.com/tenstorrent/tt-kmd) -- hugepages: see https://docs.tenstorrent.com/quickstart.html#step-4-setup-hugepages and https://github.com/tenstorrent/tt-system-tools +- hugepages: see https://docs.tenstorrent.com/getting-started/README#step-4-set-up-hugepages - tt-smi: https://github.com/tenstorrent/tt-smi If running on a TT-LoudBox or TT-QuietBox, you will also need: @@ -113,6 +65,8 @@ Either download the Docker image from GitHub Container Registry (recommended for docker pull ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64:v0.0.1-b6ecf68e706b-b9564bf364e9 ``` +Note: as the docker image is downloading you can continue to the next step and download the model weights in parallel. + #### Option B: Build Docker Image For instructions on building the Docker imagem locally see: [vllm-tt-metal-llama3/docs/development](../vllm-tt-metal-llama3/docs/development.md#step-1-build-docker-image) @@ -132,6 +86,55 @@ chmod +x setup.sh ./setup.sh Llama-3.3-70B-Instruct ``` +## Quick run + +If first run setup above has already been completed, start here. If first run setup has not been completed, complete [Setup and installation](#setup-and-installation). + +### Docker Run - vLLM inference server + +Run the container from the project root at `tt-inference-server`: +```bash +cd tt-inference-server +# make sure if you already set up the model weights and cache you use the correct persistent volume +export MODEL_NAME=Llama-3.3-70B-Instruct +export MODEL_VOLUME=$PWD/persistent_volume/volume_id_tt-metal-${MODEL_NAME}-v0.0.1/ +docker run \ + --rm \ + -it \ + --env-file persistent_volume/model_envs/${MODEL_NAME}.env \ + --cap-add ALL \ + --device /dev/tenstorrent:/dev/tenstorrent \ + --volume /dev/hugepages-1G:/dev/hugepages-1G:rw \ + --volume ${MODEL_VOLUME?ERROR env var MODEL_VOLUME must be set}:/home/container_app_user/cache_root:rw \ + --shm-size 32G \ + --publish 7000:7000 \ + ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64:v0.0.1-b6ecf68e706b-b9564bf364e9 +``` + +By default the Docker container will start running the entrypoint command wrapped in `src/run_vllm_api_server.py`. +This can be run manually if you override the the container default command with an interactive shell via `bash`. +In an interactive shell you can start the vLLM API server via: +```bash +# run server manually +python run_vllm_api_server.py +``` + +The vLLM inference API server takes 3-5 minutes to start up (~40-60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell. + +### Example clients + +You can use `docker exec --user 1000 -it bash` (--user uid must match container you are using, default is 1000) to create a shell in the docker container or run the client scripts on the host (ensuring the correct port mappings and python dependencies): + +#### Run example clients from within Docker container: +```bash +# oneliner to enter interactive shell on most recently ran container +docker exec -it $(docker ps -q | head -n1) bash + +# inside interactive shell, run example clients script to send prompt request to vLLM server: +cd ~/app/src +python example_requests_client.py +``` + # Additional Documentation - [Development](docs/development.md)