You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Admittedly, I am pretty new to LLM and have a lot to learn, so this could be some basic mistake in observation or interpretation. I tried using ollama and llamafile on the same Ubuntu MATE 24.04.1 desktop running on Intel i5-8440 with 32GB of DDR4 (single-channel) RAM and no discrete GPU -- the main reason why I was hoping to see faster token/sec speed with llamafile as per claims seen. I was hoping to have a local LLM setup serving the Qwen-2.5-Coder-7B (4-bit Quantized K-M) model, in FIM mode, using the OpenAI compatible web-access interface, to be used by Continue.dev plugin running in VScode.
First I tried ollama with the above said model, and I see the following metrics reported:
Then, I ran llamafile with a freshly downloaded huggingface GGUF for the same model, and observe the reported metrics below the web-access chat UI page as:
400 tokens predicted, 379 ms per token, 2.64 tokens per second
prompt evaluation speed is 0.00 prompt tokens evaluated per second
So, if my interpretation is correct, ollama resulted in 3.38 t/s while llamafile resulted in 2.64 t/s, which seems odd and doesn't match what I had expected (llamafile to be faster). Am I missing something ?
The prompt used: "Write a python program to create a csv file with 5 columns having 10 rows, where first column has firstname, second column has age between 11 & 16, third column has a random number between 40 & 100, fourth column has a random zip code, fifth column has a city name."
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Admittedly, I am pretty new to LLM and have a lot to learn, so this could be some basic mistake in observation or interpretation. I tried using ollama and llamafile on the same Ubuntu MATE 24.04.1 desktop running on Intel i5-8440 with 32GB of DDR4 (single-channel) RAM and no discrete GPU -- the main reason why I was hoping to see faster token/sec speed with llamafile as per claims seen. I was hoping to have a local LLM setup serving the Qwen-2.5-Coder-7B (4-bit Quantized K-M) model, in FIM mode, using the OpenAI compatible web-access interface, to be used by Continue.dev plugin running in VScode.
First I tried ollama with the above said model, and I see the following metrics reported:
Then, I ran llamafile with a freshly downloaded huggingface GGUF for the same model, and observe the reported metrics below the web-access chat UI page as:
So, if my interpretation is correct, ollama resulted in 3.38 t/s while llamafile resulted in 2.64 t/s, which seems odd and doesn't match what I had expected (llamafile to be faster). Am I missing something ?
The prompt used: "Write a python program to create a csv file with 5 columns having 10 rows, where first column has firstname, second column has age between 11 & 16, third column has a random number between 40 & 100, fourth column has a random zip code, fifth column has a city name."
Here is the llamafile chat UI output:
Here is the ollama terminal output:
Beta Was this translation helpful? Give feedback.
All reactions