Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3-bit or 6-bit quantization #702

Open
khurramusman-10xe opened this issue Jan 21, 2025 · 3 comments
Open

3-bit or 6-bit quantization #702

khurramusman-10xe opened this issue Jan 21, 2025 · 3 comments

Comments

@khurramusman-10xe
Copy link

khurramusman-10xe commented Jan 21, 2025

Hello! I have been playing around with AutoAWQ for a couple of weeks now and have managed to run it on LlaVa and then evaluate the quantized version using the lmms-eval library. I have now gotten to a point where I wanted to test the performance of 3-bit and 6-bit quantizations. I understand that the current kernels (I was using GEMM from the default example) only support 4-bit. I could not find any such limitation in the quantization process though. To put it more concretely, the computation of the scales and the clipping is not limited by the kernel and it could in theory work for any quantization resolution. That means that I can run the quantization for whatever resolution I want. The problem is only at the very last step of the quantization process when the GEMM (or any of the other kernels) are called OR when you try and load the quantized model using the "from_quantized" method (assuming the last step of the quant process is somehow taken care of). Those are the only two places where the kernel is called and the error about the 4-bit only support breaks the execution. My question is that is there a simple way to run the quantized model without using the kernels just for the purpose of performance evaluation?? My immediate goal is to try and compare the performance at different quantization levels and if there is an easy way to do this, that would be a good starting point. Further down the line, I would be interested in running the quantized models more efficiently and that's where optimized kernels would start coming in (AFAIK). If someone can also point me in the direction of how one would do that or if there are any recipes for that, that would be useful as well. Thanks!

@casper-hansen
Copy link
Owner

We do not have a solution to store weights in 3 or 6 bits, nor do we know how to run inference just yet. I’m open for PRs on this

@khurramusman-10xe
Copy link
Author

khurramusman-10xe commented Jan 21, 2025

Thanks for the response @casper-hansen.

I see -- I am still finding my way around this but I have come across other quantization methods supporting 2 or 3 bit quant. Are you (at a high level) aware of how they do that? If you have any pointers, that would be helpful. And when I do end up figuring something out, I will be more than happy to contribute.

@khurramusman-10xe
Copy link
Author

khurramusman-10xe commented Jan 21, 2025

To add to the above discussion, I believe it is still possible to just quantize the weights and just keep them in float16 or 32 for the bare minimum of performance evaluation, right? Of course, the memory / inference speedup gains won't be achieved. Its just a crude way to see what such a quantization level would do in terms of performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants