Skip to content

Commit

Permalink
readme.md.
Browse files Browse the repository at this point in the history
  • Loading branch information
b4rtaz committed Feb 15, 2025
1 parent aec85b9 commit 4cda910
Showing 1 changed file with 67 additions and 102 deletions.
169 changes: 67 additions & 102 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@ Supports Linux, macOS, and Windows. Optimized for ARM and x86_64 AVX2 CPUs.

Python 3 and C++ compiler required. The command will download the model and the tokenizer.

| Model | Purpose | Size | Command |
| --------------------------------- | --------- | -------- | ---------------------------------------------------- |
| Llama 3.1 8B Instruct Q40 | Chat, API | 6.32 GB | `python launch.py llama3_1_8b_instruct_q40` |
| Llama 3.1 405B Instruct Q40. | Chat, API | 238 GB | `python launch.py llama3_1_405b_instruct_q40`. |
| Llama 3.2 1B Instruct Q40 | Chat, API | 1.7 GB | `python launch.py llama3_2_1b_instruct_q40` |
| Llama 3.2 3B Instruct Q40 | Chat, API | 3.4 GB | `python launch.py llama3_2_3b_instruct_q40` |
| Llama 3.3 70B Instruct Q40 | Chat, API | 40 GB | `python launch.py llama3_3_70b_instruct_q40` |
| DeepSeek R1 Distill Llama 8B Q40 | Chat, API | 6.32 GB | `python launch.py deepseek_r1_distill_llama_8b_q40` |
| Model | Size | Command |
| --------------------------------- | -------- | ---------------------------------------------------- |
| Llama 3.1 8B Instruct Q40 | 6.32 GB | `python launch.py llama3_1_8b_instruct_q40` |
| Llama 3.1 405B Instruct Q40. | 238 GB | `python launch.py llama3_1_405b_instruct_q40`. |
| Llama 3.2 1B Instruct Q40 | 1.7 GB | `python launch.py llama3_2_1b_instruct_q40` |
| Llama 3.2 3B Instruct Q40 | 3.4 GB | `python launch.py llama3_2_3b_instruct_q40` |
| Llama 3.3 70B Instruct Q40 | 40 GB | `python launch.py llama3_3_70b_instruct_q40` |
| DeepSeek R1 Distill Llama 8B Q40 | 6.32 GB | `python launch.py deepseek_r1_distill_llama_8b_q40` |

### 🛠️ Convert Model Manually

Expand Down Expand Up @@ -54,7 +54,11 @@ You always need the root node and you can add 2^n - 1 worker nodes to speed up t
* `dllama worker` - run the worker node,
* `dllama-api` - run the API server.

Inference, Chat, API
<details>

<summary>🎹 Supported Arguments</summary>

<br />Inference, Chat, API

| Argument | Description | Example |
| ---------------------------- | ---------------------------------------------------------------- | -------------------------------------- |
Expand Down Expand Up @@ -83,66 +87,79 @@ Inference
| `--prompt <prompt>` | Initial prompt. | `"Hello World"` |
| `--steps <steps>` | Number of tokens to generate. | `256` |

</details>

## 📊 Measurements

### Average Token Generation Time
Please check the [discussions](https://github.com/b4rtaz/distributed-llama/discussions) section, where many measurements were published on different configurations.

I - inference time of the root node, T - network transfer time of the root node.
## 🚀 Setup

**Raspberry Pi 5 8GB**
Select and expand one of the sections below:

<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.3.1 version</sup></sub>
<details>

| Model | 1 x RasPi 5 8 GB | 2 x RasPi 5 8 GB | 4 x RasPi 5 8 GB |
|-------------|---------------------------------------------------------------------|---------------------------------------------------------------------|---------------------------------------------------------------------|
| Llama 2 7B | **441.09 ms**, 2.26 t/s<br><sub><sup>I: 434.84 ms, T: 5.25 ms</sup></sub> | **341.46 ms**, 2.92 t/s<br><sub><sup>I: 257.78 ms, T: 83.27 ms</sup></sub> | **219.08 ms**, 4.56 t/s 🔥<br><sub><sup>I: 163.42 ms, T: 55.25 ms</sup></sub> |
| Llama 3 8B | **564.31 ms**, 1.77 t/s<br><sub><sup>I: 556.67 ms, T: 6.17 ms</sup></sub> | **444.27 ms**, 2.25 t/s<br><sub><sup>I: 362.73 ms, T: 80.11 ms</sup></sub> | **331.47 ms**, 3.01 t/s 🔥<br><sub><sup>I: 267.62 ms, T: 62.34 ms</sup></sub> |
<summary>💻 MacOS, Linux, or Windows</summary>

<br />You need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs.

#### MacOS or Linux

**Raspberry Pi 4B 8 GB**
The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS.

<p align="center">
<img src=".github/8raspi2.jpg" width="25%" alt="8 x Raspberry Pi 4B 8GB" /><br />
<sub><sup>8 x Raspberry Pi 4B 8GB</sup></sub>
</p>
1. Install Git and GCC:
```sh
sudo apt install git build-essential
```
2. Clone this repository and compile Distributed Llama on all computers:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
make dllama
make dllama-api
```

<p align="center">
<img src=".github/8raspi.jpg" width="35%" alt="Distributed Llama running on 8 Raspberry Pi 4B devices" /><br />
<sub><sup>Distributed Llama running Llama 2 70B Q40 on 8 Raspberry Pi 4B devices</sup></sub>
</p>
Continue to point 3.

<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, switch = TP-Link LS1008G, tested on 0.1.0 version</sup></sub>
#### Windows

| Model | 1 x RasPi 4B 8 GB | 2 x RasPi 4B 8 GB | 4 x RasPi 4B 8 GB | 8 x RasPi 4B 8 GB |
|-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------|----------------------------------------------------------------------|
| Llama 2 7B | **1312.50 ms**<br><sub><sup>I: 1307.94 ms, T: 1.81 ms</sup></sub> | **793.69 ms**<br><sub><sup>I: 739.00 ms, T: 52.50 ms</sup></sub> | **494.00 ms** 🔥 <br><sub><sup>I: 458.81 ms, T: 34.06 ms</sup></sub> | **588.19 ms**<br><sub><sup>I: 296.69 ms, T: 289.75 ms</sup></sub> |
| Llama 2 13B | <sub><sup>Not enough RAM</sup></sub> | **1497.19 ms**<br><sub><sup>I: 1465.06 ms, T: 30.88 ms</sup></sub> | **848.19 ms** 🔥<br><sub><sup>I: 746.88 ms, T: 99.50 ms</sup></sub> | **1114.88 ms**<br><sub><sup>I: 460.8 ms, T: 652.88 ms</sup></sub> |
| Llama 2 70B | <sub><sup>Not enough RAM</sup></sub> | <sub><sup>Not enough RAM</sup></sub> | <sub><sup>Not enough RAM</sup></sub> | **4842.81 ms** 🔥<br><sub><sup>I: 2121.94 ms, T: 2719.62 ms</sup></sub> |
1. Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)):
```powershell
choco install mingw
```
2. Clone this repository and compile Distributed Llama on all computers:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
make dllama
make dllama-api
```

**x86_64 CPU Cloud Server**
Continue to point 3.

<sub><sup>Weights = Q40, Buffer = Q80, nSamples = 16, VMs = [c3d-highcpu-30](https://github.com/b4rtaz/distributed-llama/discussions/9), tested on 0.1.0 version</sup></sub>
#### Run Cluster

| Model | 1 x VM | 2 x VM | 4 x VM |
|-------------|---------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| Llama 2 7B | **101.81 ms**<br><sub><sup>I: 101.06 ms, T: 0.19 ms</sup></sub> | **69.69 ms**<br><sub><sup>I: 61.50 ms, T: 7.62 ms</sup></sub> | **53.69 ms** 🔥<br><sub><sup>I: 40.25 ms, T: 12.81 ms</sup></sub> |
| Llama 2 13B | **184.19 ms**<br><sub><sup>I: 182.88 ms, T: 0.69 ms</sup></sub> | **115.38 ms**<br><sub><sup>I: 107.12 ms, T: 7.81 ms</sup></sub> | **86.81 ms** 🔥<br><sub><sup>I: 66.25 ms, T: 19.94 ms</sup></sub> |
| Llama 2 70B | **909.69 ms**<br><sub><sup>I: 907.25 ms, T: 1.75 ms</sup></sub> | **501.38 ms**<br><sub><sup>I: 475.50 ms, T: 25.00 ms</sup></sub> | **293.06 ms** 🔥<br><sub><sup>I: 264.00 ms, T: 28.50 ms</sup></sub> |
3. Transfer weights and the tokenizer file to the root computer.
4. Run worker nodes on worker computers:
```sh
./dllama worker --port 9998 --nthreads 4
```
5. Run root node on the root computer:
```sh
./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
```

### Network Transfer for Generating Token
To add more worker nodes, just add more addresses to the `--workers` argument.

**F32 Buffer**
```
./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998
```

| Model | 2 devices | 4 devices | 8 devices |
|-------------|----------------|---------------|---------------|
| Llama 3 8B | **2048 kB** | **6144 kB** | **14336 kB** |
</details>

**Q80 Buffer**
<details>

| Model | 2 devices | 4 devices | 8 devices |
|-------------|--------------|---------------|----------------|
| Llama 3 8B | **544 kB** | **1632 kB** | **3808 kB** |
<summary>📟 Raspberry Pi</summary>

## 📟 Setup Raspberry Pi Devices
<br />

1. Install `Raspberry Pi OS Lite (64 bit)` on your Raspberry Pi devices. This OS doesn't have desktop environment.
2. Connect all devices to your switch or router.
Expand Down Expand Up @@ -182,59 +199,7 @@ To add more worker nodes, just add more addresses to the `--workers` argument.
./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998
```

## 💻 Setup computers with MacOS, Linux, or Windows

You need x86_64 AVX2 CPUs or ARM CPUs. Different devices may have different CPUs.

#### MacOS or Linux

The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS.

1. Install Git and GCC:
```sh
sudo apt install git build-essential
```
2. Clone this repository and compile Distributed Llama on all computers:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
make dllama
make dllama-api
```

Continue to point 3.

#### Windows

1. Install Git and Mingw (via [Chocolatey](https://chocolatey.org/install)):
```powershell
choco install mingw
```
2. Clone this repository and compile Distributed Llama on all computers:
```sh
git clone https://github.com/b4rtaz/distributed-llama.git
make dllama
make dllama-api
```

Continue to point 3.

#### Run Cluster

3. Transfer weights and the tokenizer file to the root computer.
4. Run worker nodes on worker computers:
```sh
./dllama worker --port 9998 --nthreads 4
```
5. Run root node on the root computer:
```sh
./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
```

To add more worker nodes, just add more addresses to the `--workers` argument.

```
./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998
```
</details>

## ✋ Contribution

Expand Down

0 comments on commit 4cda910

Please sign in to comment.