Can I get a little clarification over my little doubts about GGUF? #732
Unanswered
AayushSameerShah
asked this question in
Q&A
Replies: 1 comment
-
Late response but maybe it will be still helpful.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, community!
Recently I have witnessed the rise of
Llama.cpp
and how it has managed to let anyone use LLM on their personal computer. I am having some basic gaps in knowledge to wrap my mind around this surge.1️⃣ Running on CPU
I am willing to use the
llama-2-chat
model and in quantized format. And only thing is my CPU. It is true that it can run on CPU, but is the speed drastically slowed down?And specially is there anything to accelerate the speed while running the 4bit quantized model on CPU?
2️⃣ What is the use of BLAS and all other jargons?
I mean, when I went through installing llama.cpp, it had many, I mean many steps to go by. BLAS looked like it could give acceleration in the inference.
What is
mpirun
? I mean there are a lot of things...The question is: Will BLAS still work whole using CPU? Is it required?
3️⃣ How is this different than Ctransformers?
Okay, there is
llama.cpp
and there are other implementations in python:llama-cpp-python
, java:java-llama.cpp
... but what is this CTransformers? How is it different than the llama.cpp?4️⃣ Can I get a filtered, step-by-step guide for installation on Windows?
The README of llama.cpp is pretty clear, but it has all ways scattered for all OS in the single page and it becomes hard to navigate for your purpose.
My purpose:
Now, what things should I install to get the maximum inference speed? Will you please guide me through that?
I know I am asking a lot, but If you can provide me a simple and straight guide to get the maximum speed for my requirements, it will be amazing!
Apologies for the noobie questions,
Thanks! 🙏🏻
Beta Was this translation helpful? Give feedback.
All reactions