Can I get a little clarification over my little doubts about GGUF? #732

AayushSameerShah · 2023-09-18T07:30:00Z

AayushSameerShah
Sep 18, 2023

Hello, community!
Recently I have witnessed the rise of Llama.cpp and how it has managed to let anyone use LLM on their personal computer. I am having some basic gaps in knowledge to wrap my mind around this surge.

1️⃣ Running on CPU

I am willing to use the llama-2-chat model and in quantized format. And only thing is my CPU. It is true that it can run on CPU, but is the speed drastically slowed down?

And specially is there anything to accelerate the speed while running the 4bit quantized model on CPU?

2️⃣ What is the use of BLAS and all other jargons?

I mean, when I went through installing llama.cpp, it had many, I mean many steps to go by. BLAS looked like it could give acceleration in the inference.
What is mpirun? I mean there are a lot of things...

The question is: Will BLAS still work whole using CPU? Is it required?

3️⃣ How is this different than Ctransformers?

Okay, there is llama.cpp and there are other implementations in python: llama-cpp-python, java: java-llama.cpp... but what is this CTransformers? How is it different than the llama.cpp?

4️⃣ Can I get a filtered, step-by-step guide for installation on Windows?

The README of llama.cpp is pretty clear, but it has all ways scattered for all OS in the single page and it becomes hard to navigate for your purpose.

My purpose:

Run LLama-2-chat model (https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML)
On CPU only
On Windows
With Python

Now, what things should I install to get the maximum inference speed? Will you please guide me through that?

I know I am asking a lot, but If you can provide me a simple and straight guide to get the maximum speed for my requirements, it will be amazing!

Apologies for the noobie questions,
Thanks! 🙏🏻

tk-master · 2023-11-04T20:29:49Z

tk-master
Nov 4, 2023

Late response but maybe it will be still helpful.

Yes, cpu inference is ofc way way slower as you'd expect.
I believe you can use openBLAS to run bit faster on cpu but don't expect much.. gpu is the way to go.
Ctransformers is usually used for unquantized models afaik
look for cmake params to enable openBLAS i suppose, post around if you encounter any errors.

set FORCE_CMAKE=1
set CMAKE_ARGS=-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
pip install llama-cpp-python

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I get a little clarification over my little doubts about GGUF? #732

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Can I get a little clarification over my little doubts about GGUF? #732

AayushSameerShah Sep 18, 2023

1️⃣ Running on CPU

2️⃣ What is the use of BLAS and all other jargons?

3️⃣ How is this different than Ctransformers?

4️⃣ Can I get a filtered, step-by-step guide for installation on Windows?

Replies: 1 comment

tk-master Nov 4, 2023

AayushSameerShah
Sep 18, 2023

tk-master
Nov 4, 2023