Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: Metal performance vs. MLX, llama.cpp #903

Open
EricLBuehler opened this issue Nov 10, 2024 · 6 comments
Open

Tracking: Metal performance vs. MLX, llama.cpp #903

EricLBuehler opened this issue Nov 10, 2024 · 6 comments

Comments

@EricLBuehler
Copy link
Owner

This issue serves to track performance on Metal hardware versus MLX and llama.cpp.

@EricLBuehler
Copy link
Owner Author

EricLBuehler commented Nov 10, 2024

Platform TG (256) T/s PG (22) T/s
mistral.rs 37.39 49.78
llama.cpp 38.94 304.07
mlx 43.43 109.84

mistral.rs commit: 7b8ca9a
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.1-8B-Instruct

llama.cpp commit 6423c65
Run benchmark:

/llama-cli -m ../gguf_models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Meta-Llama-3.1-8B-Instruct -q --q-bits 8 --mlx-path llama3.1_8b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.1_8b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

@EricLBuehler
Copy link
Owner Author

EricLBuehler commented Nov 14, 2024

Platform TG (256) T/s PP (22) T/s
mistral.rs 37.39 274.32
llama.cpp 38.94 304.07
mlx 43.43 109.84

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.1-8B-Instruct

llama.cpp commit 6423c65
Run benchmark:

/llama-cli -m ../gguf_models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.1-8B-Instruct -q --q-bits 8 --mlx-path llama3.1_8b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.1_8b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

@julien-c
Copy link

wow, really cool to track this 😍

@EricLBuehler
Copy link
Owner Author

Thank you!

More benchmarks below for some smaller models:

Llama 3.2 3b

Platform TG (256) T/s PP (22) T/s
mistral.rs 63.33 434.72
llama.cpp 75.31 533.13
mlx 93.25 212.30

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.2-3B-Instruct

llama.cpp commit 6423c65
Run benchmark:

./llama-cli -m ../gguf_models/Llama-3.2-3B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/llama3.2_3b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.2_3b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

Llama 3.2 1b

Platform TG (256) T/s PP (22) T/s
mistral.rs 132.9 825.73
llama.cpp 166.84 1283.62
mlx 223.12 455.34

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.2-1B-Instruct

llama.cpp commit 6423c65
Run benchmark:

./llama-cli -m ../gguf_models/Llama-3.2-1B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/llama3.2_1b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.2_1b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

Qwen 2.5 0.5b

Platform TG (256) T/s PP (22) T/s
mistral.rs 83.00 823.8
llama.cpp 203.55 1503.22
mlx 231.014 297.12

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m Qwen/Qwen2.5-0.5B-Instruct

llama.cpp commit 6423c65
Run benchmark:

./llama-cli -m ../gguf_models/qwen2.5-0.5b-instruct-q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path Qwen/Qwen2.5-0.5B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/qwen2.5_0.5b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/qwen2.5_0.5b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

@EricLBuehler
Copy link
Owner Author

New benchmarks! #916 introduced a preallocated KV cache for better decoding effiecency. We can already see good results for some of the smaller models:

Llama 3.2 3b

Platform TG (256) T/s PP (22) T/s
mistral.rs 70.93 (+12%) 428.05 (-1.6%)
llama.cpp 75.31 533.13
mlx 93.25 212.30

mistral.rs commit: 9d647a9
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.2-3B-Instruct

llama.cpp commit 6423c65
Run benchmark:

./llama-cli -m ../gguf_models/Llama-3.2-3B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.19.2)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/llama3.2_3b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.2_3b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template

@EricLBuehler
Copy link
Owner Author

EricLBuehler commented Nov 26, 2024

New benchmarks with improved performance for long-context generation!

#933 was just merged with some optimizations based on ml-explore/mlx#1597! Current benchmarks with Llama 3.1 8b show us benefiting from the change by about 3.3% with TG 5000.

Llama 3.1 8b

Platform TG (5000) T/s PG (22) T/s
mistral.rs 35.38 250.59
llama.cpp 33.05 304.07
mlx 41.45 109.04

mistral.rs commit: 4e6dd61
Run benchmark:

cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 5000 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.1-8B-Instruct

llama.cpp commit ab96610
Run benchmark:

./llama-cli -m ../gguf_models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -n 5000 -p "What is an LLM? Please write an essay about LLMs and how they can be used."

mlx_lm (PyPI version 0.21.0)
Prequantize the model:

python3 -m mlx_lm.convert --hf-path meta-llama/Meta-Llama-3.1-8B-Instruct -q --q-bits 8 --mlx-path llama3.1_8b_8bit

Run benchmark:

python3 -m mlx_lm.generate --model ../mlx_models/llama3.1_8b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 5000 --ignore-chat-template

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants