-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking: Metal performance vs. MLX, llama.cpp #903
Comments
mistral.rs commit: 7b8ca9a
llama.cpp commit 6423c65
mlx_lm (PyPI version 0.19.2)
Run benchmark:
|
mistral.rs commit: 9d647a9
llama.cpp commit 6423c65
mlx_lm (PyPI version 0.19.2)
Run benchmark:
|
wow, really cool to track this 😍 |
Thank you! More benchmarks below for some smaller models: Llama 3.2 3b
mistral.rs commit: 9d647a9
llama.cpp commit 6423c65
mlx_lm (PyPI version 0.19.2)
Run benchmark:
Llama 3.2 1b
mistral.rs commit: 9d647a9
llama.cpp commit 6423c65
mlx_lm (PyPI version 0.19.2)
Run benchmark:
Qwen 2.5 0.5b
mistral.rs commit: 9d647a9
llama.cpp commit 6423c65
mlx_lm (PyPI version 0.19.2)
Run benchmark:
|
New benchmarks! #916 introduced a preallocated KV cache for better decoding effiecency. We can already see good results for some of the smaller models:
|
Platform | TG (256) T/s | PP (22) T/s |
---|---|---|
mistral.rs | 70.93 (+12%) | 428.05 (-1.6%) |
llama.cpp | 75.31 | 533.13 |
mlx | 93.25 | 212.30 |
mistral.rs commit: 9d647a9
Run benchmark:
cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 256 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.2-3B-Instruct
llama.cpp commit 6423c65
Run benchmark:
./llama-cli -m ../gguf_models/Llama-3.2-3B-Instruct-Q8_0.gguf -n 256 -p "What is an LLM? Please write an essay about LLMs and how they can be used."
mlx_lm (PyPI version 0.19.2)
Prequantize the model:
python3 -m mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct -q --q-bits 8 --mlx-path ../mlx_models/llama3.2_3b_8bit
Run benchmark:
python3 -m mlx_lm.generate --model ../mlx_models/llama3.2_3b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 256 --ignore-chat-template
New benchmarks with improved performance for long-context generation! #933 was just merged with some optimizations based on ml-explore/mlx#1597! Current benchmarks with Llama 3.1 8b show us benefiting from the change by about 3.3% with TG 5000.
|
Platform | TG (5000) T/s | PG (22) T/s |
---|---|---|
mistral.rs | 35.38 | 250.59 |
llama.cpp | 33.05 | 304.07 |
mlx | 41.45 | 109.04 |
mistral.rs commit: 4e6dd61
Run benchmark:
cargo run --features metal --release --bin mistralrs-bench '--' --n-gen 5000 --n-prompt 22 --isq q8_0 plain -m meta-llama/Llama-3.1-8B-Instruct
llama.cpp commit ab96610
Run benchmark:
./llama-cli -m ../gguf_models/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf -n 5000 -p "What is an LLM? Please write an essay about LLMs and how they can be used."
mlx_lm (PyPI version 0.21.0)
Prequantize the model:
python3 -m mlx_lm.convert --hf-path meta-llama/Meta-Llama-3.1-8B-Instruct -q --q-bits 8 --mlx-path llama3.1_8b_8bit
Run benchmark:
python3 -m mlx_lm.generate --model ../mlx_models/llama3.1_8b_8bit --prompt "What is an LLM? Please write an essay about LLMs and how they can be used." -m 5000 --ignore-chat-template
This issue serves to track performance on Metal hardware versus MLX and llama.cpp.
The text was updated successfully, but these errors were encountered: