Replies: 5 comments
-
Hi @iSuslov,
|
Beta Was this translation helpful? Give feedback.
-
You can see a performance profile for Phi3 here:
|
Beta Was this translation helpful? Give feedback.
-
@iSuslov Furthermore with a shift to 4 bit we will get another 2x performance boost, so ~24 tok/s on your machine. |
Beta Was this translation helpful? Give feedback.
-
@FL33TW00D, that's just awesome! I think 4-bit will be a breakthrough in terms of browser inference. If I understand correctly, solutions like Xenova rely on the ONNX framework and use a model with FP16 activation functions with 4-bit weight quantization due to a lack of support for full 4-bit quantization. Theoretically, this means they shouldn't benefit from lower latency since weights are transformed back to FP16 during inference time, but I may be wrong here. It seems like full 4-bit support is something unique. It's also interesting to know what really stops others from supporting it; that would be a bummer if WebGPU itself is the reason. Announcement of DP4a built-in functions support in WGSL | Chrome 123 may be relevant. Again, I want to express my appreciation for the way you keep the development up and running. When I first stumbled upon your I immediately said to myself, - he knows what he is doing. 😁 Let me know If you need any help with the TypeScript/Frontend tasks, I'll be happy to assist. Also if you feel like this issue has a potential to remain opened forever, feel free to convert it to a discussion. |
Beta Was this translation helpful? Give feedback.
-
We would do the same thing as ONNX here, have weights in 4 bit and activations in fp16/fp32. This should 2x inference speed because the model inference is always limited by memory bandwidth, not computation! Smaller weights means faster inference! DP4a that you linked is a really interesting one. In this case you may want to do W8A8 (weights 8bit activations 8bit). This would not accelerate inference on Apple devices, but would on Windows & Vulkan. I'm glad you liked the message! There is lots of work to do on the frontend side! It's hard to convey to people the value of Ratchet without good demo applications. And the less time I need to work on the demos, the more I can work on the performance! Join the discord! Lots of fun demos to be made! https://discord.gg/XFe33KQTG4 |
Beta Was this translation helpful? Give feedback.
-
Hi,
First and foremost, I want to express my gratitude for the idea behind this project and the way you keep up with the development.
3 days ago I created a small issue regarding PHI-3 demo here and while this issue most probably has to do with "stop words" config I decided to conduct tests evaluating the same model performance in browser and as a standalone solution using LMStudio (llama.cpp).
The results deserve a discussion. Tests were conducted on M1 Pro 16Gb. Model is FL33TW00D-HF/phi3/phi3-mini-4k_q8_0.gguf except for Xenova. Small ~200 token prompt was given as an input, output speeds were measured. Table contains avg values. That's not a precise benchmark by any means, but results seem to differ so much that there is no need for precise benchmarking yet. Xenova results were measured for smaller model PHI-3_4q.
Seems like llama.cpp is 3x times faster using CPU only inference and 5.5x times faster using GPU comparing to WebGPU.
Questions:
My apologies in advance if some of these questions were answered before or make little sense.
Beta Was this translation helpful? Give feedback.
All reactions