Tracking: Metal performance vs. MLX, llama.cpp #903

EricLBuehler · 2024-11-10T02:14:29Z

This issue serves to track performance on Metal hardware versus MLX and llama.cpp.

EricLBuehler · 2024-11-10T02:22:48Z

Platform	TG (256) T/s	PG (22) T/s
mistral.rs	37.39	49.78
llama.cpp	38.94	304.07
mlx	43.43	109.84

mistral.rs commit: 7b8ca9a
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.1-8B-Instruct

llama.cpp commit 6423c65
Run benchmark:

/llama-cli -m ../gguf_models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Meta-Llama-3.1-8B-Instruct -q --q-bits 8 --mlx-path llama3.1_8b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.1_8b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

EricLBuehler · 2024-11-14T16:50:44Z

Platform	TG (256) T/s	PP (22) T/s
mistral.rs	37.39	274.32
llama.cpp	38.94	304.07
mlx	43.43	109.84

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.1-8B-Instruct

llama.cpp commit 6423c65
Run benchmark:

/llama-cli -m ../gguf_models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.1-8B-Instruct -q --q-bits 8 --mlx-path llama3.1_8b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.1_8b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

julien-c · 2024-11-14T17:12:31Z

wow, really cool to track this 😍

EricLBuehler · 2024-11-14T21:26:37Z

Thank you!

More benchmarks below for some smaller models:

Llama 3.2 3b

Platform	TG (256) T/s	PP (22) T/s
mistral.rs	63.33	434.72
llama.cpp	75.31	533.13
mlx	93.25	212.30

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.2-3B-Instruct

llama.cpp commit 6423c65
Run benchmark:

./llama-cli -m ../gguf_models/Llama-3.2-3B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/llama3.2_3b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.2_3b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

Llama 3.2 1b

Platform	TG (256) T/s	PP (22) T/s
mistral.rs	132.9	825.73
llama.cpp	166.84	1283.62
mlx	223.12	455.34

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.2-1B-Instruct

llama.cpp commit 6423c65
Run benchmark:

./llama-cli -m ../gguf_models/Llama-3.2-1B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/llama3.2_1b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.2_1b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

Qwen 2.5 0.5b

Platform	TG (256) T/s	PP (22) T/s
mistral.rs	83.00	823.8
llama.cpp	203.55	1503.22
mlx	231.014	297.12

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m Qwen/Qwen2.5-0.5B-Instruct

llama.cpp commit 6423c65
Run benchmark:

./llama-cli -m ../gguf_models/qwen2.5-0.5b-instruct-q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path Qwen/Qwen2.5-0.5B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/qwen2.5_0.5b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/qwen2.5_0.5b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

EricLBuehler · 2024-11-19T16:51:02Z

New benchmarks! #916 introduced a preallocated KV cache for better decoding effiecency. We can already see good results for some of the smaller models:

Llama 3.2 3b

Platform	TG (256) T/s	PP (22) T/s
mistral.rs	70.93 (+12%)	428.05 (-1.6%)
llama.cpp	75.31	533.13
mlx	93.25	212.30

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.2-3B-Instruct

llama.cpp commit 6423c65
Run benchmark:

./llama-cli -m ../gguf_models/Llama-3.2-3B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/llama3.2_3b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.2_3b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

Benchmarks to come in #903.

EricLBuehler · 2024-11-26T13:45:47Z

New benchmarks with improved performance for long-context generation!

#933 was just merged with some optimizations based on ml-explore/mlx#1597! Current benchmarks with Llama 3.1 8b show us benefiting from the change by about 3.3% with TG 5000.

Llama 3.1 8b

Platform	TG (5000) T/s	PG (22) T/s
mistral.rs	35.38	250.59
llama.cpp	33.05	304.07
mlx	41.45	109.04

mistral.rs commit: 4e6dd61
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 5000 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.1-8B-Instruct

llama.cpp commit ab96610
Run benchmark:

./llama-cli -m ../gguf_models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -n 5000 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.21.0)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Meta-Llama-3.1-8B-Instruct -q --q-bits 8 --mlx-path llama3.1_8b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.1_8b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 5000 --ignore-chat-template

EricLBuehler added the optimization label Nov 10, 2024

EricLBuehler mentioned this issue Nov 14, 2024

*Major T/s improvement* Use the Metal qmatmul MM kernels huggingface/candle#2615

Open

EricLBuehler pinned this issue Nov 20, 2024

EricLBuehler mentioned this issue Nov 26, 2024

Integrate fast MLX kernel for SDPA with long seqlen #933

Merged

EricLBuehler added a commit that referenced this issue Nov 26, 2024

Integrate fast MLX kernel for SDPA with long seqlen (#933)

4e6dd61

Benchmarks to come in #903.

This was referenced Nov 28, 2024

Streamed inference not as smooth (fast?) as with e.g. Ollama - Llama 3.1 #630

Closed

How's the M1 performance compare with llama.cpp or ollama? #673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: Metal performance vs. MLX, llama.cpp #903

Tracking: Metal performance vs. MLX, llama.cpp #903

EricLBuehler commented Nov 10, 2024

EricLBuehler commented Nov 10, 2024 •

edited

Loading

EricLBuehler commented Nov 14, 2024 •

edited

Loading

julien-c commented Nov 14, 2024

EricLBuehler commented Nov 14, 2024

EricLBuehler commented Nov 19, 2024

Llama 3.2 3b

EricLBuehler commented Nov 26, 2024 •

edited

Loading

Llama 3.1 8b

Tracking: Metal performance vs. MLX, llama.cpp #903

Tracking: Metal performance vs. MLX, llama.cpp #903

Comments

EricLBuehler commented Nov 10, 2024

EricLBuehler commented Nov 10, 2024 • edited Loading

EricLBuehler commented Nov 14, 2024 • edited Loading

julien-c commented Nov 14, 2024

EricLBuehler commented Nov 14, 2024

Llama 3.2 3b

Llama 3.2 1b

Qwen 2.5 0.5b

EricLBuehler commented Nov 19, 2024

Llama 3.2 3b

EricLBuehler commented Nov 26, 2024 • edited Loading

Llama 3.1 8b

EricLBuehler commented Nov 10, 2024 •

edited

Loading

EricLBuehler commented Nov 14, 2024 •

edited

Loading

EricLBuehler commented Nov 26, 2024 •

edited

Loading