readme.md.

b4rtaz · Feb 15, 2025 · 4cda910 · 4cda910
1 parent aec85b9
commit 4cda910
Showing 1 changed file with 67 additions and 102 deletions.
diff --git a/README.md b/README.md
@@ -18,14 +18,14 @@ Supports Linux, macOS, and Windows. Optimized for ARM and x86_64 AVX2 CPUs.
 
 Python 3 and C++ compiler required. The command will download the model and the tokenizer.
 
-| Model                             | Purpose   | Size     | Command                                              |
-| --------------------------------- | --------- | -------- | ---------------------------------------------------- |
-| Llama 3.1 8B Instruct Q40         | Chat, API | 6.32 GB  | `python launch.py llama3_1_8b_instruct_q40`          |
-| Llama 3.1 405B Instruct Q40.      | Chat, API | 238 GB   | `python launch.py llama3_1_405b_instruct_q40`.       |
-| Llama 3.2 1B Instruct Q40         | Chat, API | 1.7 GB   | `python launch.py llama3_2_1b_instruct_q40`          |
-| Llama 3.2 3B Instruct Q40         | Chat, API | 3.4 GB   | `python launch.py llama3_2_3b_instruct_q40`          |
-| Llama 3.3 70B Instruct Q40        | Chat, API | 40 GB    | `python launch.py llama3_3_70b_instruct_q40`         |
-| DeepSeek R1 Distill Llama 8B Q40  | Chat, API | 6.32 GB  | `python launch.py deepseek_r1_distill_llama_8b_q40`  |
+| Model                             | Size     | Command                                              |
+| --------------------------------- | -------- | ---------------------------------------------------- |
+| Llama 3.1 8B Instruct Q40         | 6.32 GB  | `python launch.py llama3_1_8b_instruct_q40`          |
+| Llama 3.1 405B Instruct Q40.      | 238 GB   | `python launch.py llama3_1_405b_instruct_q40`.       |
+| Llama 3.2 1B Instruct Q40         | 1.7 GB   | `python launch.py llama3_2_1b_instruct_q40`          |
+| Llama 3.2 3B Instruct Q40         | 3.4 GB   | `python launch.py llama3_2_3b_instruct_q40`          |
+| Llama 3.3 70B Instruct Q40        | 40 GB    | `python launch.py llama3_3_70b_instruct_q40`         |
+| DeepSeek R1 Distill Llama 8B Q40  | 6.32 GB  | `python launch.py deepseek_r1_distill_llama_8b_q40`  |
 
 ### 🛠️ Convert Model Manually
 
@@ -54,7 +54,11 @@ You always need the root node and you can add 2^n - 1 worker nodes to speed up t
 * `dllama worker` - run the worker node,
 * `dllama-api` - run the API server.
 
-Inference, Chat, API
+<details>
+
+<summary>🎹 Supported Arguments</summary>
+
+<br />Inference, Chat, API
 
 | Argument                     | Description                                                      | Example                                |
 | ---------------------------- | ---------------------------------------------------------------- | -------------------------------------- |
@@ -83,66 +87,79 @@ Inference
 | `--prompt <prompt>`          | Initial prompt.                | `"Hello World"`    |
 | `--steps <steps>`            | Number of tokens to generate.  | `256`              |
 
+</details>
+
 ## 📊 Measurements
 
-### Average Token Generation Time
+Please check the [discussions](https://github.com/b4rtaz/distributed-llama/discussions) section, where many measurements were published on different configurations.
 
-I - inference time of the root node, T - network transfer time of the root node.
+## 🚀 Setup
 
-**Raspberry Pi 5 8GB**
+Select and expand one of the sections below:
 
-<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.3.1 version</sup></sub>
+<details>
 
-| Model       | 1 x RasPi 5 8 GB                                                    | 2 x RasPi 5 8 GB                                                    | 4 x RasPi 5 8 GB                                                    |
-|-------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|
-| Llama 2 7B  | **441.09 ms**, 2.26 t/s<br><sub><sup>I: 434.84 ms, T: 5.25 ms</sup></sub> | **341.46 ms**, 2.92 t/s<br><sub><sup>I: 257.78 ms, T: 83.27 ms</sup></sub>   | **219.08 ms**, 4.56 t/s 🔥<br><sub><sup>I: 163.42 ms, T: 55.25 ms</sup></sub> |
-| Llama 3 8B  | **564.31 ms**, 1.77 t/s<br><sub><sup>I: 556.67 ms, T: 6.17 ms</sup></sub> | **444.27 ms**, 2.25 t/s<br><sub><sup>I: 362.73 ms, T: 80.11 ms</sup></sub>   | **331.47 ms**, 3.01 t/s 🔥<br><sub><sup>I: 267.62 ms, T: 62.34 ms</sup></sub> |
+<summary>💻 MacOS, Linux, or Windows</summary>
+
+<br />You need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs.
+
+#### MacOS or Linux
 
-**Raspberry Pi 4B 8 GB**
+The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS.
 
-<p align="center">
-  <img src=".github/8raspi2.jpg" width="25%" alt="8 x Raspberry Pi 4B 8GB" /><br />
-  <sub><sup>8 x Raspberry Pi 4B 8GB</sup></sub>
-</p>
+1. Install Git and GCC:
+```sh
+sudo apt install git build-essential
+```
+2. Clone this repository and compile Distributed Llama on all computers:
+```sh
+git clone https://github.com/b4rtaz/distributed-llama.git
+make dllama
+make dllama-api
+```
 
-<p align="center">
-  <img src=".github/8raspi.jpg" width="35%" alt="Distributed Llama running on 8 Raspberry Pi 4B devices" /><br />
-  <sub><sup>Distributed Llama running Llama 2 70B Q40 on 8 Raspberry Pi 4B devices</sup></sub>
-</p>
+Continue to point 3.
 
-<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.1.0 version</sup></sub>
+#### Windows
 
-| Model       | 1 x RasPi 4B 8 GB                                                   | 2 x RasPi 4B 8 GB                                                     | 4 x RasPi 4B 8 GB                                                                    | 8 x RasPi 4B 8 GB                                                    |
-|-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------|----------------------------------------------------------------------|
-| Llama 2 7B  | **1312.50 ms**<br><sub><sup>I: 1307.94 ms, T: 1.81 ms</sup></sub> | **793.69 ms**<br><sub><sup>I: 739.00 ms, T: 52.50 ms</sup></sub>    | **494.00 ms** 🔥               <br><sub><sup>I: 458.81 ms, T: 34.06 ms</sup></sub> | **588.19 ms**<br><sub><sup>I: 296.69 ms, T: 289.75 ms</sup></sub>  |
-| Llama 2 13B | <sub><sup>Not enough RAM</sup></sub>                                | **1497.19 ms**<br><sub><sup>I: 1465.06 ms, T: 30.88 ms</sup></sub>  | **848.19 ms** 🔥<br><sub><sup>I: 746.88 ms, T: 99.50 ms</sup></sub>                | **1114.88 ms**<br><sub><sup>I: 460.8 ms, T: 652.88 ms</sup></sub>  |
-| Llama 2 70B | <sub><sup>Not enough RAM</sup></sub>                                | <sub><sup>Not enough RAM</sup></sub>                                  | <sub><sup>Not enough RAM</sup></sub>                                                 | **4842.81 ms** 🔥<br><sub><sup>I: 2121.94 ms, T: 2719.62 ms</sup></sub> |
+1. Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)):
+```powershell
+choco install mingw
+```
+2. Clone this repository and compile Distributed Llama on all computers:
+```sh
+git clone https://github.com/b4rtaz/distributed-llama.git
+make dllama
+make dllama-api
+```
 
-**x86_64 CPU Cloud Server**
+Continue to point 3.
 
-<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, VMs = [c3d-highcpu-30](https://github.com/b4rtaz/distributed-llama/discussions/9), tested on 0.1.0 version</sup></sub>
+#### Run Cluster
 
-| Model       | 1 x VM                                                              | 2 x VM                                                                | 4 x VM                                                                               |
-|-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------|
-| Llama 2 7B  | **101.81 ms**<br><sub><sup>I: 101.06 ms, T: 0.19 ms</sup></sub>   | **69.69 ms**<br><sub><sup>I: 61.50 ms, T: 7.62 ms</sup></sub>       | **53.69 ms** 🔥<br><sub><sup>I: 40.25 ms, T: 12.81 ms</sup></sub>                  |
-| Llama 2 13B | **184.19 ms**<br><sub><sup>I: 182.88 ms, T: 0.69 ms</sup></sub>   | **115.38 ms**<br><sub><sup>I: 107.12 ms, T: 7.81 ms</sup></sub>     | **86.81 ms** 🔥<br><sub><sup>I: 66.25 ms, T: 19.94 ms</sup></sub>                  |
-| Llama 2 70B | **909.69 ms**<br><sub><sup>I: 907.25 ms, T: 1.75 ms</sup></sub>   | **501.38 ms**<br><sub><sup>I: 475.50 ms, T: 25.00 ms</sup></sub>    | **293.06 ms** 🔥<br><sub><sup>I: 264.00 ms, T: 28.50 ms</sup></sub>                  |
+3. Transfer weights and the tokenizer file to the root computer.
+4. Run worker nodes on worker computers:
+```sh
+./dllama worker --port 9998 --nthreads 4
+```
+5. Run root node on the root computer:
+```sh
+./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
+```
 
-### Network Transfer for Generating Token
+To add more worker nodes, just add more addresses to the `--workers` argument.
 
-**F32 Buffer**
+```
+./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998
+```
 
-| Model       | 2 devices      | 4 devices     | 8 devices     |
-|-------------|----------------|---------------|---------------|
-| Llama 3 8B  | **2048 kB**    | **6144 kB**   | **14336 kB**  |
+</details>
 
-**Q80 Buffer**
+<details>
 
-| Model       | 2 devices    | 4 devices     | 8 devices      |
-|-------------|--------------|---------------|----------------|
-| Llama 3 8B  | **544 kB**   | **1632 kB**   | **3808 kB**    |
+<summary>📟 Raspberry Pi</summary>
 
-## 📟 Setup Raspberry Pi Devices
+<br />
 
 1. Install `Raspberry Pi OS Lite (64 bit)` on your Raspberry Pi devices. This OS doesn't have desktop environment.
 2. Connect all devices to your switch or router.
@@ -182,59 +199,7 @@ To add more worker nodes, just add more addresses to the `--workers` argument.
 ./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998
 ```
 
-## 💻 Setup computers with MacOS, Linux, or Windows
-
-You need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs.
-
-#### MacOS or Linux
-
-The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS.
-
-1. Install Git and GCC:
-```sh
-sudo apt install git build-essential
-```
-2. Clone this repository and compile Distributed Llama on all computers:
-```sh
-git clone https://github.com/b4rtaz/distributed-llama.git
-make dllama
-make dllama-api
-```
-
-Continue to point 3.
-
-#### Windows
-
-1. Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)):
-```powershell
-choco install mingw
-```
-2. Clone this repository and compile Distributed Llama on all computers:
-```sh
-git clone https://github.com/b4rtaz/distributed-llama.git
-make dllama
-make dllama-api
-```
-
-Continue to point 3.
-
-#### Run Cluster
-
-3. Transfer weights and the tokenizer file to the root computer.
-4. Run worker nodes on worker computers:
-```sh
-./dllama worker --port 9998 --nthreads 4
-```
-5. Run root node on the root computer:
-```sh
-./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
-```
-
-To add more worker nodes, just add more addresses to the `--workers` argument.
-
-```
-./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998
-```
+</details>
 
 ## ✋ Contribution