From a1077848a3188db6236f836957eb40074dc6660f Mon Sep 17 00:00:00 2001
From: Tom Stesco <tstesco@tenstorrent.com>
Date: Thu, 6 Feb 2025 23:16:45 -0500
Subject: [PATCH 1/3] tstescp/mark-experimental-models (#102)

* update documentation to show all supported models and give correct links
* setup.sh supports all base and Instruct models, marks experimental models as preview
---
 README.md                      |  45 ++++++++-----
 setup.sh                       | 112 +++++++++++++--------------------
 vllm-tt-metal-llama3/README.md |   8 ++-
 3 files changed, 79 insertions(+), 86 deletions(-)

diff --git a/README.md b/README.md
index b3ac2c8..25debf0 100644
--- a/README.md
+++ b/README.md
@@ -8,17 +8,34 @@ Tenstorrent Inference Server (`tt-inference-server`) is the repo of available mo
 
 
 ## Getting Started
-Please follow setup instructions found in each model folder's README.md doc
-
---------------------------------------------------------------------------------------------------------------
-
-## Model Implementations
-| Model          | Hardware                    |
-|----------------|-----------------------------|
-| [Qwen 2.5 72B](vllm-tt-metal-llama3/README.md)   | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox)    |
-| [LLaMa 3.3 70B](vllm-tt-metal-llama3/README.md)  | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox)    |
-| [LLaMa 3.2 11B Vision](vllm-tt-metal-llama3/README.md)  | [n300](https://tenstorrent.com/hardware/wormhole) |
-| [LLaMa 3.2 3B](vllm-tt-metal-llama3/README.md)  | [n150](https://tenstorrent.com/hardware/wormhole) |
-| [LLaMa 3.2 1B](vllm-tt-metal-llama3/README.md)  | [n150](https://tenstorrent.com/hardware/wormhole) |
-| [LLaMa 3.1 70B](vllm-tt-metal-llama3/README.md)  | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) |
-| [LLaMa 3.1 8B](vllm-tt-metal-llama3/README.md)  | [n150](https://tenstorrent.com/hardware/wormhole)    |
+Please follow setup instructions for the model you want to serve, `Model Name` in tables below link to corresponding implementation.
+
+Note: models with Status [🔍 preview] are under active development. If you encounter setup or stability problems please [file an issue](https://github.com/tenstorrent/tt-inference-server/issues/new?template=Blank+issue) and our team will address it.
+
+## LLMs
+
+| Model Name                    | Model URL                                                             | Hardware                                                                 | Status      | Minimum Release Version                                                          |
+| ----------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------ | ----------- | -------------------------------------------------------------------------------- |
+| [Qwen2.5-72B-Instruct](vllm-tt-metal-llama3/README.md)          | [HF Repo](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct)           | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | 🔍 preview  | [v0.0.2](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.2) |
+| [Qwen2.5-72B](vllm-tt-metal-llama3/README.md)                   | [HF Repo](https://huggingface.co/Qwen/Qwen2.5-72B)                    | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | 🔍 preview  | [v0.0.2](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.2) |
+| [Qwen2.5-7B-Instruct](vllm-tt-metal-llama3/README.md)           | [HF Repo](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)            | [n150](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.2](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.2) |
+| [Qwen2.5-7B](vllm-tt-metal-llama3/README.md)                    | [HF Repo](https://huggingface.co/Qwen/Qwen2.5-7B)                     | [n150](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.2](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.2) |
+| [Llama-3.3-70B-Instruct](vllm-tt-metal-llama3/README.md)        | [HF Repo](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)        | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.3-70B](vllm-tt-metal-llama3/README.md)                 | [HF Repo](https://huggingface.co/meta-llama/Llama-3.3-70B)                 | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.2-11B-Vision-Instruct](vllm-tt-metal-llama3/README.md) | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) | [n300](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.2-11B-Vision](vllm-tt-metal-llama3/README.md)          | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)          | [n300](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.2-3B-Instruct](vllm-tt-metal-llama3/README.md)         | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)         | [n150](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.2-3B](vllm-tt-metal-llama3/README.md)                  | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-3B)                  | [n150](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.2-1B-Instruct](vllm-tt-metal-llama3/README.md)         | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)         | [n150](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.2-1B](vllm-tt-metal-llama3/README.md)                  | [HF Repo](https://huggingface.co/meta-llama/Llama-3.2-1B)                  | [n150](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.1-70B-Instruct](vllm-tt-metal-llama3/README.md)        | [HF Repo](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)        | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.1-70B](vllm-tt-metal-llama3/README.md)                 | [HF Repo](https://huggingface.co/meta-llama/Llama-3.1-70B)                 | [TT-QuietBox & TT-LoudBox](https://tenstorrent.com/hardware/tt-quietbox) | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.1-8B-Instruct](vllm-tt-metal-llama3/README.md)         | [HF Repo](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)         | [n150](https://tenstorrent.com/hardware/wormhole)                        | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+| [Llama-3.1-8B](vllm-tt-metal-llama3/README.md)                  | [HF Repo](https://huggingface.co/meta-llama/Llama-3.1-8B)                  | [n150](https://tenstorrent.com/hardware/wormhole)                        | ✅ supported | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+
+# CNNs
+
+| Model Name                    | Model URL                                                             | Hardware                                                                 | Status      | Minimum Release Version                                                          |
+| ----------------------------- | --------------------------------------------------------------------- | ------------------------------------------------------------------------ | ----------- | -------------------------------------------------------------------------------- |
+| [YOLOv4](tt-metal-yolov4/README.md)                        | [GH Repo](https://github.com/AlexeyAB/darknet)                    | [n150](https://tenstorrent.com/hardware/wormhole)                        | 🔍 preview  | [v0.0.1](https://github.com/tenstorrent/tt-inference-server/releases/tag/v0.0.1) |
+
diff --git a/setup.sh b/setup.sh
index a1be09a..68c6f38 100755
--- a/setup.sh
+++ b/setup.sh
@@ -9,13 +9,19 @@ set -euo pipefail  # Exit on error, print commands, unset variables treated as e
 usage() {
     echo "Usage: $0 <model_type>"
     echo "Available model types:"
-    echo "  Qwen2.5-72B-Instruct"
-    echo "  Qwen2.5-7B-Instruct"
-    echo "  DeepSeek-R1-Distill-Llama-70B"
+    echo "  Qwen2.5-72B-Instruct (preview)"
+    echo "  Qwen2.5-72B (preview)"
+    echo "  Qwen2.5-7B-Instruct (preview)"
+    echo "  Qwen2.5-7B (preview)"
+    echo "  DeepSeek-R1-Distill-Llama-70B (preview)"
     echo "  Llama-3.3-70B-Instruct"
-    echo "  Llama-3.2-11B-Vision-Instruct"
-    echo "  Llama-3.2-3B-Instruct"
-    echo "  Llama-3.2-1B-Instruct"
+    echo "  Llama-3.3-70B"
+    echo "  Llama-3.2-11B-Vision-Instruct (preview)"
+    echo "  Llama-3.2-11B-Vision (preview)"
+    echo "  Llama-3.2-3B-Instruct (preview)"
+    echo "  Llama-3.2-3B (preview)"
+    echo "  Llama-3.2-1B-Instruct (preview)"
+    echo "  Llama-3.2-1B (preview)"
     echo "  Llama-3.1-70B-Instruct"
     echo "  Llama-3.1-70B"
     echo "  Llama-3.1-8B-Instruct"
@@ -163,18 +169,18 @@ setup_model_environment() {
     # Set environment variables based on the model selection
     # note: MODEL_NAME is the directory name for the model weights
     case "$1" in
-        "Qwen2.5-72B-Instruct")
+        "Qwen2.5-72B"|"Qwen2.5-72B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Qwen2.5-72B-Instruct"
-        HF_MODEL_REPO_ID="Qwen/Qwen2.5-72B-Instruct"
+        MODEL_NAME="Qwen2.5-72B${1#Qwen2.5-72B}"
+        HF_MODEL_REPO_ID="Qwen/Qwen2.5-72B${1#Qwen2.5-72B}"
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
         ;;
-        "Qwen2.5-7B-Instruct")
+        "Qwen2.5-7B"|"Qwen2.5-7B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Qwen2.5-7B-Instruct"
-        HF_MODEL_REPO_ID="Qwen/Qwen2.5-7B-Instruct"
+        MODEL_NAME="Qwen2.5-7B${1#Qwen2.5-7B}"
+        HF_MODEL_REPO_ID="Qwen/Qwen2.5-7B${1#Qwen2.5-7B}"
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
@@ -187,10 +193,10 @@ setup_model_environment() {
         META_DIR_FILTER=""
         REPACKED=0
         ;;
-        "Llama-3.3-70B-Instruct")
+        "Llama-3.3-70B"|"Llama-3.3-70B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3.3-70B-Instruct"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3.3-70B-Instruct"
+        MODEL_NAME="Llama-3.3-70B${1#Llama-3.3-70B}"
+        HF_MODEL_REPO_ID="meta-llama/Llama-3.3-70B${1#Llama-3.3-70B}"
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=1
@@ -203,83 +209,51 @@ setup_model_environment() {
         META_DIR_FILTER=""
         REPACKED=0
         ;;
-        "Llama-3.2-3B-Instruct")
+        "Llama-3.2-3B"|"Llama-3.2-3B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3.2-3B-Instruct"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3.2-3B-Instruct"
+        MODEL_NAME="Llama-3.2-3B${1#Llama-3.2-3B}"
+        HF_MODEL_REPO_ID="meta-llama/Llama-3.2-3B${1#Llama-3.2-3B}"
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
         ;;
-        "Llama-3.2-1B-Instruct")
+        "Llama-3.2-1B"|"Llama-3.2-1B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3.2-1B-Instruct"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3.2-1B-Instruct"
+        MODEL_NAME="Llama-3.2-1B${1#Llama-3.2-1B}"
+        HF_MODEL_REPO_ID="meta-llama/Llama-3.2-1B${1#Llama-3.2-1B}"
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
         ;;
-        "Llama-3.1-70B-Instruct")
+        "Llama-3.1-70B"|"Llama-3.1-70B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3.1-70B-Instruct"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3.1-70B-Instruct"
-        META_MODEL_NAME="Meta-Llama-3.1-70B-Instruct"
+        MODEL_NAME="Llama-3.1-70B${1#Llama-3.1-70B}"
+        HF_MODEL_REPO_ID="meta-llama/Llama-3.1-70B${1#Llama-3.1-70B}"
+        META_MODEL_NAME="Meta-Llama-3.1-70B${1#Llama-3.1-70B}"
         META_DIR_FILTER="llama3_1"
         REPACKED=1
         ;;
-        "Llama-3.1-70B")
+        "Llama-3.1-8B"|"Llama-3.1-8B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3.1-70B"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3.1-70B"
-        META_MODEL_NAME="Meta-Llama-3.1-70B"
-        META_DIR_FILTER="llama3_1"
-        REPACKED=1
-        ;;
-        "Llama-3.1-8B-Instruct")
-        IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3.1-8B-Instruct"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3.1-8B-Instruct"
-        META_MODEL_NAME="Meta-Llama-3.1-8B-Instruct"
-        META_DIR_FILTER="llama3_1"
-        REPACKED=0
-        ;;
-        "Llama-3.1-8B")
-        IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3.1-8B"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3.1-8B"
-        META_MODEL_NAME="Meta-Llama-3.1-8B"
+        MODEL_NAME="Llama-3.1-8B${1#Llama-3.1-8B}"
+        HF_MODEL_REPO_ID="meta-llama/Llama-3.1-8B${1#Llama-3.1-8B}"
+        META_MODEL_NAME="Meta-Llama-3.1-8B${1#Llama-3.1-8B}"
         META_DIR_FILTER="llama3_1"
         REPACKED=0
         ;;
-        "Llama-3-70B-Instruct")
+        "Llama-3-70B"|"Llama-3-70B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3-70B-Instruct"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3-70B-Instruct"
-        META_MODEL_NAME="Meta-Llama-3-70B-Instruct"
+        MODEL_NAME="Llama-3-70B${1#Llama-3-70B}"
+        HF_MODEL_REPO_ID="meta-llama/Llama-3-70B${1#Llama-3-70B}"
+        META_MODEL_NAME="Meta-Llama-3-70B${1#Llama-3-70B}"
         META_DIR_FILTER="llama3"
         REPACKED=1
         ;;
-        "Llama-3-70B")
-        IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3-70B"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3-70B"
-        META_MODEL_NAME="Meta-Llama-3-70B"
-        META_DIR_FILTER="llama3"
-        REPACKED=1
-        ;;
-        "Llama-3-8B-Instruct")
-        IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3-8B-Instruct"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3-8B-Instruct"
-        META_MODEL_NAME="Meta-Llama-3-8B-Instruct"
-        META_DIR_FILTER="llama3"
-        REPACKED=0
-        ;;
-        "Llama-3-8B")
+        "Llama-3-8B"|"Llama-3-8B-Instruct")
         IMPL_ID="tt-metal"
-        MODEL_NAME="Llama-3-8B"
-        HF_MODEL_REPO_ID="meta-llama/Llama-3-8B"
-        META_MODEL_NAME="Meta-Llama-3-8B"
+        MODEL_NAME="Llama-3-8B${1#Llama-3-8B}"
+        HF_MODEL_REPO_ID="meta-llama/Llama-3-8B${1#Llama-3-8B}"
+        META_MODEL_NAME="Meta-Llama-3-8B${1#Llama-3-8B}"
         META_DIR_FILTER="llama3"
         REPACKED=0
         ;;
diff --git a/vllm-tt-metal-llama3/README.md b/vllm-tt-metal-llama3/README.md
index e726184..bebfd4b 100644
--- a/vllm-tt-metal-llama3/README.md
+++ b/vllm-tt-metal-llama3/README.md
@@ -1,6 +1,8 @@
-# vLLM TT Metalium Llama 3.3 70B Inference API
+# vLLM TT Metalium TT-Transformer Inference API
 
-This implementation supports Llama 3.1 70B with vLLM at https://github.com/tenstorrent/vllm/tree/dev
+This implementation supports the following models in the [LLM model list](../README.md#llms) with vLLM at https://github.com/tenstorrent/vllm/tree/dev
+
+The examples below are using `MODEL_NAME=Llama-3.3-70B-Instruct`. It is recommended to use Instruct fine-tuned models for interactive use. Start with this if you're unsure. 
 
 ## Table of Contents
 
@@ -18,7 +20,7 @@ This implementation supports Llama 3.1 70B with vLLM at https://github.com/tenst
 
 If first run setup has already been completed, start here. If first run setup has not been run please see the instructions below for [First run setup](#first-run-setup).
 
-### Docker Run - vLLM llama3 inference server
+### Docker Run - vLLM inference server
 
 Run the container from the project root at `tt-inference-server`:
 ```bash

From ea7d18840b724136c676d07a9c79fb7ea223047f Mon Sep 17 00:00:00 2001
From: Tom Stesco <tstesco@tenstorrent.com>
Date: Tue, 18 Feb 2025 22:41:39 -0500
Subject: [PATCH 2/3] tstesco/fix-setup-script (#106)

* removing preview marker in setup.sh from preview models

* adding check_disk_space and check_ram to setup.sh
---
 setup.sh | 78 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 67 insertions(+), 11 deletions(-)

diff --git a/setup.sh b/setup.sh
index 68c6f38..125560e 100755
--- a/setup.sh
+++ b/setup.sh
@@ -9,19 +9,19 @@ set -euo pipefail  # Exit on error, print commands, unset variables treated as e
 usage() {
     echo "Usage: $0 <model_type>"
     echo "Available model types:"
-    echo "  Qwen2.5-72B-Instruct (preview)"
-    echo "  Qwen2.5-72B (preview)"
-    echo "  Qwen2.5-7B-Instruct (preview)"
-    echo "  Qwen2.5-7B (preview)"
-    echo "  DeepSeek-R1-Distill-Llama-70B (preview)"
+    echo "  Qwen2.5-72B-Instruct"
+    echo "  Qwen2.5-72B"
+    echo "  Qwen2.5-7B-Instruct"
+    echo "  Qwen2.5-7B"
+    echo "  DeepSeek-R1-Distill-Llama-70B"
     echo "  Llama-3.3-70B-Instruct"
     echo "  Llama-3.3-70B"
-    echo "  Llama-3.2-11B-Vision-Instruct (preview)"
-    echo "  Llama-3.2-11B-Vision (preview)"
-    echo "  Llama-3.2-3B-Instruct (preview)"
-    echo "  Llama-3.2-3B (preview)"
-    echo "  Llama-3.2-1B-Instruct (preview)"
-    echo "  Llama-3.2-1B (preview)"
+    echo "  Llama-3.2-11B-Vision-Instruct"
+    echo "  Llama-3.2-11B-Vision"
+    echo "  Llama-3.2-3B-Instruct"
+    echo "  Llama-3.2-3B"
+    echo "  Llama-3.2-1B-Instruct"
+    echo "  Llama-3.2-1B"
     echo "  Llama-3.1-70B-Instruct"
     echo "  Llama-3.1-70B"
     echo "  Llama-3.1-8B-Instruct"
@@ -127,6 +127,34 @@ check_hf_access() {
     return 0
 }
 
+# Function to check available disk space
+check_disk_space() {
+    local min_disk=$1
+    local available_disk
+    available_disk=$(df --block-size=1G / | awk 'NR==2 {print $4}') # Get available disk space in GB
+    if (( available_disk >= min_disk )); then
+        echo "✅ Sufficient disk space available: ${available_disk}GB, Required: ${min_disk}GB"
+        return 0
+    else
+        echo "❌ ERROR: Insufficient disk space! Available: ${available_disk}GB, Required: ${min_disk}GB"
+        return 1
+    fi
+}
+
+# Function to check available RAM
+check_ram() {
+    local min_ram=$1
+    local available_ram
+    available_ram=$(free -g | awk '/^Mem:/ {print $7}') # Get available RAM in GB
+    if (( available_ram >= min_ram )); then
+        echo "✅ Sufficient RAM available: ${available_ram}GB, Required: ${min_ram}GB"
+        return 0
+    else
+        echo "❌ ERROR: Insufficient RAM! Available: ${available_ram}GB, Required: ${min_ram}GB"
+        return 1
+    fi
+}
+
 get_hf_env_vars() {
     # get HF_TOKEN
     if [ -z "${HF_TOKEN:-}" ]; then
@@ -168,6 +196,8 @@ get_hf_env_vars() {
 setup_model_environment() {
     # Set environment variables based on the model selection
     # note: MODEL_NAME is the directory name for the model weights
+    # MIN_DISK: safe lower bound on available disk (based on 2 bytes per parameter and 2.5 copies: HF cache, model weights, tt-metal cache)
+    # MIN_RAM: safe lower bound on RAM needed (based on repacking 70B models)
     case "$1" in
         "Qwen2.5-72B"|"Qwen2.5-72B-Instruct")
         IMPL_ID="tt-metal"
@@ -176,6 +206,8 @@ setup_model_environment() {
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
+        MIN_DISK=360
+        MIN_RAM=360
         ;;
         "Qwen2.5-7B"|"Qwen2.5-7B-Instruct")
         IMPL_ID="tt-metal"
@@ -184,6 +216,8 @@ setup_model_environment() {
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
+        MIN_DISK=28
+        MIN_RAM=35
         ;;
         "DeepSeek-R1-Distill-Llama-70B")
         IMPL_ID="tt-metal"
@@ -192,6 +226,8 @@ setup_model_environment() {
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
+        MIN_DISK=350
+        MIN_RAM=350
         ;;
         "Llama-3.3-70B"|"Llama-3.3-70B-Instruct")
         IMPL_ID="tt-metal"
@@ -200,6 +236,8 @@ setup_model_environment() {
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=1
+        MIN_DISK=350
+        MIN_RAM=350
         ;;
         "Llama-3.2-11B-Vision-Instruct")
         IMPL_ID="tt-metal"
@@ -208,6 +246,8 @@ setup_model_environment() {
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
+        MIN_DISK=44
+        MIN_RAM=55
         ;;
         "Llama-3.2-3B"|"Llama-3.2-3B-Instruct")
         IMPL_ID="tt-metal"
@@ -216,6 +256,8 @@ setup_model_environment() {
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
+        MIN_DISK=12
+        MIN_RAM=15
         ;;
         "Llama-3.2-1B"|"Llama-3.2-1B-Instruct")
         IMPL_ID="tt-metal"
@@ -224,6 +266,8 @@ setup_model_environment() {
         META_MODEL_NAME=""
         META_DIR_FILTER=""
         REPACKED=0
+        MIN_DISK=4
+        MIN_RAM=5
         ;;
         "Llama-3.1-70B"|"Llama-3.1-70B-Instruct")
         IMPL_ID="tt-metal"
@@ -232,6 +276,8 @@ setup_model_environment() {
         META_MODEL_NAME="Meta-Llama-3.1-70B${1#Llama-3.1-70B}"
         META_DIR_FILTER="llama3_1"
         REPACKED=1
+        MIN_DISK=350
+        MIN_RAM=350
         ;;
         "Llama-3.1-8B"|"Llama-3.1-8B-Instruct")
         IMPL_ID="tt-metal"
@@ -240,6 +286,8 @@ setup_model_environment() {
         META_MODEL_NAME="Meta-Llama-3.1-8B${1#Llama-3.1-8B}"
         META_DIR_FILTER="llama3_1"
         REPACKED=0
+        MIN_DISK=32
+        MIN_RAM=40
         ;;
         "Llama-3-70B"|"Llama-3-70B-Instruct")
         IMPL_ID="tt-metal"
@@ -248,6 +296,8 @@ setup_model_environment() {
         META_MODEL_NAME="Meta-Llama-3-70B${1#Llama-3-70B}"
         META_DIR_FILTER="llama3"
         REPACKED=1
+        MIN_DISK=350
+        MIN_RAM=350
         ;;
         "Llama-3-8B"|"Llama-3-8B-Instruct")
         IMPL_ID="tt-metal"
@@ -256,6 +306,8 @@ setup_model_environment() {
         META_MODEL_NAME="Meta-Llama-3-8B${1#Llama-3-8B}"
         META_DIR_FILTER="llama3"
         REPACKED=0
+        MIN_DISK=32
+        MIN_RAM=40
         ;;
         *)
         echo "⛔ Invalid model choice."
@@ -264,6 +316,10 @@ setup_model_environment() {
         ;;
     esac
 
+    # fail fast if host has insufficient resources
+    check_disk_space "$MIN_DISK" || exit 1
+    check_ram "$MIN_RAM" || exit 1
+
     # Set default values for environment variables
     DEFAULT_PERSISTENT_VOLUME_ROOT=${REPO_ROOT}/persistent_volume
     # Safely handle potentially unset environment variables using default values

From f61d97fe7ebe39b94873b73ab9a384f6a0f2c0b2 Mon Sep 17 00:00:00 2001
From: Tom Stesco <tstesco@tenstorrent.com>
Date: Tue, 18 Feb 2025 23:12:48 -0500
Subject: [PATCH 3/3] put setup and installation instructions first (#107)

---
 vllm-tt-metal-llama3/README.md | 117 +++++++++++++++++----------------
 1 file changed, 60 insertions(+), 57 deletions(-)

diff --git a/vllm-tt-metal-llama3/README.md b/vllm-tt-metal-llama3/README.md
index bebfd4b..ae1c612 100644
--- a/vllm-tt-metal-llama3/README.md
+++ b/vllm-tt-metal-llama3/README.md
@@ -2,72 +2,24 @@
 
 This implementation supports the following models in the [LLM model list](../README.md#llms) with vLLM at https://github.com/tenstorrent/vllm/tree/dev
 
-The examples below are using `MODEL_NAME=Llama-3.3-70B-Instruct`. It is recommended to use Instruct fine-tuned models for interactive use. Start with this if you're unsure. 
+You can setup the model being deployed using the `setup.sh` script and `MODEL_NAME` environment variable to point to it as shown below. The examples below are using `MODEL_NAME=Llama-3.3-70B-Instruct`. It is recommended to use Instruct fine-tuned models for interactive use. Start with this if you're unsure.
 
 ## Table of Contents
 
-- [Quick run](#quick-run)
-  - [Docker Run - vLLM llama3 inference server](#docker-run---vllm-llama3-inference-server)
-- [First run setup](#first-run-setup)
+- [Setup and installation](#setup-and-installation)
   - [1. Docker install](#1-docker-install)
   - [2. Ensure system dependencies installed](#2-ensure-system-dependencies-installed)
   - [3. CPU performance setting](#3-cpu-performance-setting)
   - [4. Docker image](#4-docker-image)
   - [5. Automated Setup: environment variables and weights files](#5-automated-setup-environment-variables-and-weights-files)
+- [Quick run](#quick-run)
+  - [Docker Run - vLLM inference server](#docker-run---vllm-inference-server)
 - [Additional Documentation](#additional-documentation)
 
-## Quick run
-
-If first run setup has already been completed, start here. If first run setup has not been run please see the instructions below for [First run setup](#first-run-setup).
-
-### Docker Run - vLLM inference server
-
-Run the container from the project root at `tt-inference-server`:
-```bash
-cd tt-inference-server
-# make sure if you already set up the model weights and cache you use the correct persistent volume
-export MODEL_NAME=Llama-3.3-70B-Instruct
-export MODEL_VOLUME=$PWD/persistent_volume/volume_id_tt-metal-${MODEL_NAME}-v0.0.1/
-docker run \
-  --rm \
-  -it \
-  --env-file persistent_volume/model_envs/${MODEL_NAME}.env \
-  --cap-add ALL \
-  --device /dev/tenstorrent:/dev/tenstorrent \
-  --volume /dev/hugepages-1G:/dev/hugepages-1G:rw \
-  --volume ${MODEL_VOLUME?ERROR env var MODEL_VOLUME must be set}:/home/container_app_user/cache_root:rw \
-  --shm-size 32G \
-  --publish 7000:7000 \
-  ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64:v0.0.1-b6ecf68e706b-b9564bf364e9
-```
-
-By default the Docker container will start running the entrypoint command wrapped in `src/run_vllm_api_server.py`.
-This can be run manually if you override the the container default command with an interactive shell via `bash`. 
-In an interactive shell you can start the vLLM API server via:
-```bash
-# run server manually
-python run_vllm_api_server.py
-```
-
-The vLLM inference API server takes 3-5 minutes to start up (~40-60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell. 
-
-### Example clients
-
-You can use `docker exec --user 1000 -it <container-id> bash` (--user uid must match container you are using, default is 1000) to create a shell in the docker container or run the client scripts on the host (ensuring the correct port mappings and python dependencies):
-
-#### Run example clients from within Docker container:
-```bash
-# oneliner to enter interactive shell on most recently ran container
-docker exec -it $(docker ps -q | head -n1) bash
-
-# inside interactive shell, run example clients script to send prompt request to vLLM server:
-cd ~/app/src
-python example_requests_client.py
-```
-
-## First run setup
+## Setup and installation
 
-Tested starting condition is from a fresh installation of Ubuntu 20.04 with Tenstorrent system dependencies installed.
+This guide was tested starting condition is from a fresh installation of Ubuntu 20.04 with Tenstorrent system dependencies installed. 
+Ubuntu 22.04 should also work for most if not all models. 
 
 ### 1. Docker install
 
@@ -77,12 +29,12 @@ Recommended to follow postinstall guide to allow $USER to run docker without sud
 
 ### 2. Ensure system dependencies installed
 
-Follow TT guide software installation at: https://docs.tenstorrent.com/quickstart.html
+Follow TT guide software installation at: https://docs.tenstorrent.com/getting-started/README
 
 Ensure all set up:
 - firmware: tt-firmware (https://github.com/tenstorrent/tt-firmware)
 - drivers: tt-kmd (https://github.com/tenstorrent/tt-kmd)
-- hugepages: see https://docs.tenstorrent.com/quickstart.html#step-4-setup-hugepages and https://github.com/tenstorrent/tt-system-tools
+- hugepages: see https://docs.tenstorrent.com/getting-started/README#step-4-set-up-hugepages
 - tt-smi: https://github.com/tenstorrent/tt-smi
 
 If running on a TT-LoudBox or TT-QuietBox, you will also need:
@@ -113,6 +65,8 @@ Either download the Docker image from GitHub Container Registry (recommended for
 docker pull ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64:v0.0.1-b6ecf68e706b-b9564bf364e9
 ```
 
+Note: as the docker image is downloading you can continue to the next step and download the model weights in parallel.
+
 #### Option B: Build Docker Image
 
 For instructions on building the Docker imagem locally see: [vllm-tt-metal-llama3/docs/development](../vllm-tt-metal-llama3/docs/development.md#step-1-build-docker-image)
@@ -132,6 +86,55 @@ chmod +x setup.sh
 ./setup.sh Llama-3.3-70B-Instruct
 ```
 
+## Quick run
+
+If first run setup above has already been completed, start here. If first run setup has not been completed, complete [Setup and installation](#setup-and-installation).
+
+### Docker Run - vLLM inference server
+
+Run the container from the project root at `tt-inference-server`:
+```bash
+cd tt-inference-server
+# make sure if you already set up the model weights and cache you use the correct persistent volume
+export MODEL_NAME=Llama-3.3-70B-Instruct
+export MODEL_VOLUME=$PWD/persistent_volume/volume_id_tt-metal-${MODEL_NAME}-v0.0.1/
+docker run \
+  --rm \
+  -it \
+  --env-file persistent_volume/model_envs/${MODEL_NAME}.env \
+  --cap-add ALL \
+  --device /dev/tenstorrent:/dev/tenstorrent \
+  --volume /dev/hugepages-1G:/dev/hugepages-1G:rw \
+  --volume ${MODEL_VOLUME?ERROR env var MODEL_VOLUME must be set}:/home/container_app_user/cache_root:rw \
+  --shm-size 32G \
+  --publish 7000:7000 \
+  ghcr.io/tenstorrent/tt-inference-server/vllm-llama3-src-dev-ubuntu-20.04-amd64:v0.0.1-b6ecf68e706b-b9564bf364e9
+```
+
+By default the Docker container will start running the entrypoint command wrapped in `src/run_vllm_api_server.py`.
+This can be run manually if you override the the container default command with an interactive shell via `bash`. 
+In an interactive shell you can start the vLLM API server via:
+```bash
+# run server manually
+python run_vllm_api_server.py
+```
+
+The vLLM inference API server takes 3-5 minutes to start up (~40-60 minutes on first run when generating caches) then will start serving requests. To send HTTP requests to the inference server run the example scripts in a separate bash shell. 
+
+### Example clients
+
+You can use `docker exec --user 1000 -it <container-id> bash` (--user uid must match container you are using, default is 1000) to create a shell in the docker container or run the client scripts on the host (ensuring the correct port mappings and python dependencies):
+
+#### Run example clients from within Docker container:
+```bash
+# oneliner to enter interactive shell on most recently ran container
+docker exec -it $(docker ps -q | head -n1) bash
+
+# inside interactive shell, run example clients script to send prompt request to vLLM server:
+cd ~/app/src
+python example_requests_client.py
+```
+
 # Additional Documentation
 
 - [Development](docs/development.md)