Why is llamafile seemingly slower than ollama on my system for the same model and same query ? #614

bdutta · 2024-11-11T11:54:56Z

bdutta
Nov 11, 2024

Admittedly, I am pretty new to LLM and have a lot to learn, so this could be some basic mistake in observation or interpretation. I tried using ollama and llamafile on the same Ubuntu MATE 24.04.1 desktop running on Intel i5-8440 with 32GB of DDR4 (single-channel) RAM and no discrete GPU -- the main reason why I was hoping to see faster token/sec speed with llamafile as per claims seen. I was hoping to have a local LLM setup serving the Qwen-2.5-Coder-7B (4-bit Quantized K-M) model, in FIM mode, using the OpenAI compatible web-access interface, to be used by Continue.dev plugin running in VScode.

First I tried ollama with the above said model, and I see the following metrics reported:

total duration: 2m37.252170344s
load duration: 51.987167ms
prompt eval count: 98 token(s)
prompt eval duration: 6.76s
prompt eval rate: 14.50 tokens/s
eval count: 507 token(s)
eval duration: 2m29.848s
eval rate: 3.38 tokens/s

Then, I ran llamafile with a freshly downloaded huggingface GGUF for the same model, and observe the reported metrics below the web-access chat UI page as:

400 tokens predicted, 379 ms per token, 2.64 tokens per second
prompt evaluation speed is 0.00 prompt tokens evaluated per second

So, if my interpretation is correct, ollama resulted in 3.38 t/s while llamafile resulted in 2.64 t/s, which seems odd and doesn't match what I had expected (llamafile to be faster). Am I missing something ?

The prompt used: "Write a python program to create a csv file with 5 columns having 10 rows, where first column has firstname, second column has age between 11 & 16, third column has a random number between 40 & 100, fourth column has a random zip code, fifth column has a city name."

Here is the llamafile chat UI output:

Here is the ollama terminal output:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is llamafile seemingly slower than ollama on my system for the same model and same query ? #614

{{title}}

Replies: 0 comments

Select a reply

Why is llamafile seemingly slower than ollama on my system for the same model and same query ? #614

bdutta Nov 11, 2024

Replies: 0 comments

bdutta
Nov 11, 2024