-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[webgpu] Use subgroup for matmulnbits #23224
Conversation
In this version, the local_id.x = 8, local_id.y = 4. To load A data, each thread needs to access memory twice so that the tile size A is 64 x 4. In order to get the correct shuffle A when acculate the inter_results, we have to unconditionally get a_data low and a data high. Then use select to decide which data to use for current thread. So this method doubles the shuffle commands.
Use local_id.x = 4, local_id.y = 8 so that we only need to suffle once compared with previous method.
When the index is a variable
This reverts commit 128cd7d.
This reverts commit c0dd1db.
This reverts commit 3b32acc.
@guschmue @fs-eire Please take a look. Currently, this PR is only applied on Intel devices since I don't see perf improvement on NV device on my hand. Need some time to investigate the reason. Please also let me know whether it works on your xe devices.
to
|
/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline |
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models |
/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
Azure Pipelines successfully started running 3 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
/azp run Win_TRT_Minimal_CUDA_Test_CI |
Azure Pipelines successfully started running 1 pipeline(s). |
Description
This PR applies subgroup to implement matmulnbits when tile_m > 1 for intel devices.
With this PR, prefill for 500 tokens prompt for phi3 becomes 3.5s from 8.5s on intel Meteor Lake.