-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3-bit or 6-bit quantization #702
Comments
We do not have a solution to store weights in 3 or 6 bits, nor do we know how to run inference just yet. I’m open for PRs on this |
Thanks for the response @casper-hansen. I see -- I am still finding my way around this but I have come across other quantization methods supporting 2 or 3 bit quant. Are you (at a high level) aware of how they do that? If you have any pointers, that would be helpful. And when I do end up figuring something out, I will be more than happy to contribute. |
To add to the above discussion, I believe it is still possible to just quantize the weights and just keep them in float16 or 32 for the bare minimum of performance evaluation, right? Of course, the memory / inference speedup gains won't be achieved. Its just a crude way to see what such a quantization level would do in terms of performance. |
Hello! I have been playing around with AutoAWQ for a couple of weeks now and have managed to run it on LlaVa and then evaluate the quantized version using the lmms-eval library. I have now gotten to a point where I wanted to test the performance of 3-bit and 6-bit quantizations. I understand that the current kernels (I was using GEMM from the default example) only support 4-bit. I could not find any such limitation in the quantization process though. To put it more concretely, the computation of the scales and the clipping is not limited by the kernel and it could in theory work for any quantization resolution. That means that I can run the quantization for whatever resolution I want. The problem is only at the very last step of the quantization process when the GEMM (or any of the other kernels) are called OR when you try and load the quantized model using the "from_quantized" method (assuming the last step of the quant process is somehow taken care of). Those are the only two places where the kernel is called and the error about the 4-bit only support breaks the execution. My question is that is there a simple way to run the quantized model without using the kernels just for the purpose of performance evaluation?? My immediate goal is to try and compare the performance at different quantization levels and if there is an easy way to do this, that would be a good starting point. Further down the line, I would be interested in running the quantized models more efficiently and that's where optimized kernels would start coming in (AFAIK). If someone can also point me in the direction of how one would do that or if there are any recipes for that, that would be useful as well. Thanks!
The text was updated successfully, but these errors were encountered: