Overall performance limits #200

iSuslov · 2024-05-12T12:02:38Z

iSuslov
May 12, 2024

Hi,
First and foremost, I want to express my gratitude for the idea behind this project and the way you keep up with the development.

3 days ago I created a small issue regarding PHI-3 demo here and while this issue most probably has to do with "stop words" config I decided to conduct tests evaluating the same model performance in browser and as a standalone solution using LMStudio (llama.cpp).

The results deserve a discussion. Tests were conducted on M1 Pro 16Gb. Model is FL33TW00D-HF/phi3/phi3-mini-4k_q8_0.gguf except for Xenova. Small ~200 token prompt was given as an input, output speeds were measured. Table contains avg values. That's not a precise benchmark by any means, but results seem to differ so much that there is no need for precise benchmarking yet. Xenova results were measured for smaller model PHI-3_4q.

	LMStudio GPU	LMStudio CPU	Ratchet Chrome WebGPU	Xenova PHI-3_4q (Twice smaller model)
RAM Usage	~5.5gb	~5.5gb	~7.7gb	5.3(GPU)+4.2(Renderer) = ~10Gb
CPU Usage	~10%	~800%	~20%	~15%
GPU Usage	~95%	~5%	~95%	~90%
time to first token	0.5s	1.5s	~12s	~5s
speed	~32 tok/s	~16 tok/s	~6 tok/s	~12 tok/s
gen t	5s	10s
gpu layers	33	0
cpu threads	4	4
mlock	true	true

Seems like llama.cpp is 3x times faster using CPU only inference and 5.5x times faster using GPU comparing to WebGPU.
Questions:

What is the performance limit of implementations using WebGPU comparing to "native" solutions like llama.cpp?
llama.cpp seem to be less memory hungry, peaking at 5.5gb at inference time and getting back to 3.73 GB while idle. Is there a room for improvement?
Most shocking part is CPU inference speed of llama.cpp to be 16 tok/s. Does it mean WASM only implementation theoretically can achieve same speeds?

My apologies in advance if some of these questions were answered before or make little sense.

FL33TW00D · 2024-05-12T12:32:10Z

FL33TW00D
May 12, 2024
Maintainer

Hi @iSuslov,
Thanks for raising this and the extensive testing!

Currently the performance of Ratchet is not super optimised, although it is much better than it was. However WebGPU currently lacks features that native has in order to really accelerate performance. In particular, subgroups & cooperative matrices. Both have been proposed and subgroups is experimentally available in Chrome.
The memory management story in WASM is not great, we can greatly improve this.
WASM only implementation could probably reach within 20% of the native CPU version!

0 replies

FL33TW00D · 2024-05-12T12:32:41Z

FL33TW00D
May 12, 2024
Maintainer

You can see a performance profile for Phi3 here:

┌────────────────────────────────────┬───────────────────┬───────┬────────────────┬──────────────┐
│                   Op Type          │ Elapsed Time (ns) │ Count │ Avg. Time (ns) │ % of Runtime │
├────────────────────────────────────┼───────────────────┼───────┼────────────────┼──────────────┤
│ qgemv_false_16_16_t_x1024_y1_z1    │ 8031626           │ 16    │ 501976         │ 39.98        │
├────────────────────────────────────┼───────────────────┼───────┼────────────────┼──────────────┤
│ qgemv_false_16_16_t_x576_y1_z1     │ 4855917           │ 16    │ 303494         │ 24.17        │
├────────────────────────────────────┼───────────────────┼───────┼────────────────┼──────────────┤
│ qgemv_false_4_256_t_x768_y1_z1     │ 3035998           │ 16    │ 189749         │ 15.11        │
├────────────────────────────────────┼───────────────────┼───────┼────────────────┼──────────────┤
│ qgemv_false_16_16_t_x192_y1_z1     │ 1866998           │ 16    │ 116687         │ 9.29         │
├────────────────────────────────────┼───────────────────┼───────┼────────────────┼──────────────┤
│ permute_scalar_x48_                │ 293292            │ 64    │ 4582           │ 1.46         │
├────────────────────────────────────┼───────────────────┼───────┼────────────────┼──────────────┤
│ sgemm_false_false_ffalse_true_false│ 267127            │ 16    │ 16695          │ 1.33         │
├────────────────────────────────────┼───────────────────┼───────┼────────────────┼──────────────┤
│ rmsnorm_vec4_x1_y1_                │ 256667            │ 33    │ 7777           │ 1.28         │
├────────────────────────────────────┼───────────────────┼───────┼────────────────┼──────────────┤

0 replies

FL33TW00D · 2024-05-13T08:12:05Z

FL33TW00D
May 13, 2024
Maintainer

@iSuslov
Thanks for raising this issue, it motivated me to ship some performance improvements!
Should release a 2x performance release shortly.

Furthermore with a shift to 4 bit we will get another 2x performance boost, so ~24 tok/s on your machine.
Obviously this still doesn't rival llama.cpp, but they've had a lot more time to optimize than we have.

0 replies

iSuslov · 2024-05-15T16:37:22Z

iSuslov
May 15, 2024
Author

@FL33TW00D, that's just awesome!

I think 4-bit will be a breakthrough in terms of browser inference. If I understand correctly, solutions like Xenova rely on the ONNX framework and use a model with FP16 activation functions with 4-bit weight quantization due to a lack of support for full 4-bit quantization. Theoretically, this means they shouldn't benefit from lower latency since weights are transformed back to FP16 during inference time, but I may be wrong here. It seems like full 4-bit support is something unique. It's also interesting to know what really stops others from supporting it; that would be a bummer if WebGPU itself is the reason.

Announcement of DP4a built-in functions support in WGSL | Chrome 123 may be relevant.

Again, I want to express my appreciation for the way you keep the development up and running. When I first stumbled upon your whisper-turbo repo and saw this "no BS" message to contributors:

I immediately said to myself, - he knows what he is doing. 😁 Let me know If you need any help with the TypeScript/Frontend tasks, I'll be happy to assist.

Also if you feel like this issue has a potential to remain opened forever, feel free to convert it to a discussion.

0 replies

FL33TW00D · 2024-05-15T17:37:53Z

FL33TW00D
May 15, 2024
Maintainer

@FL33TW00D, that's just awesome!

I think 4-bit will be a breakthrough in terms of browser inference. If I understand correctly, solutions like Xenova rely on the ONNX framework and use a model with FP16 activation functions with 4-bit weight quantization due to a lack of support for full 4-bit quantization. Theoretically, this means they shouldn't benefit from lower latency since weights are transformed back to FP16 during inference time, but I may be wrong here. It seems like full 4-bit support is something unique. It's also interesting to know what really stops others from supporting it; that would be a bummer if WebGPU itself is the reason.

Announcement of DP4a built-in functions support in WGSL | Chrome 123 may be relevant.

Again, I want to express my appreciation for the way you keep the development up and running. When I first stumbled upon your whisper-turbo repo and saw this "no BS" message to contributors:

I immediately said to myself, - he knows what he is doing. 😁 Let me know If you need any help with the TypeScript/Frontend tasks, I'll be happy to assist.

Also if you feel like this issue has a potential to remain opened forever, feel free to convert it to a discussion.

We would do the same thing as ONNX here, have weights in 4 bit and activations in fp16/fp32. This should 2x inference speed because the model inference is always limited by memory bandwidth, not computation! Smaller weights means faster inference!

DP4a that you linked is a really interesting one. In this case you may want to do W8A8 (weights 8bit activations 8bit). This would not accelerate inference on Apple devices, but would on Windows & Vulkan.

I'm glad you liked the message!

There is lots of work to do on the frontend side! It's hard to convey to people the value of Ratchet without good demo applications. And the less time I need to work on the demos, the more I can work on the performance!

Join the discord! Lots of fun demos to be made! https://discord.gg/XFe33KQTG4

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overall performance limits #200

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Overall performance limits #200

iSuslov May 12, 2024

Replies: 5 comments

FL33TW00D May 12, 2024 Maintainer

FL33TW00D May 12, 2024 Maintainer

FL33TW00D May 13, 2024 Maintainer

iSuslov May 15, 2024 Author

FL33TW00D May 15, 2024 Maintainer

iSuslov
May 12, 2024

FL33TW00D
May 12, 2024
Maintainer

FL33TW00D
May 12, 2024
Maintainer

FL33TW00D
May 13, 2024
Maintainer

iSuslov
May 15, 2024
Author

FL33TW00D
May 15, 2024
Maintainer