Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[webgpu] Use subgroup for matmulnbits #23224

Merged
merged 17 commits into from
Jan 13, 2025

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Dec 30, 2024

Description

This PR applies subgroup to implement matmulnbits when tile_m > 1 for intel devices.
With this PR, prefill for 500 tokens prompt for phi3 becomes 3.5s from 8.5s on intel Meteor Lake.

qjia7 added 17 commits December 27, 2024 11:42
In this version, the local_id.x = 8, local_id.y = 4.
To load A data, each thread needs to access memory twice so that the
tile size A is 64 x 4.

In order to get the correct shuffle A when acculate the inter_results,
we have to unconditionally get a_data low and a data high. Then use
select to decide which data to use for current thread.

So this method doubles the shuffle commands.
Use local_id.x = 4, local_id.y = 8 so that we only need to suffle once
compared with previous method.
This reverts commit 128cd7d.
@qjia7 qjia7 changed the title [Not for Review] [webgpu] Test subgroup for matmulnbits [webgpu] Use subgroup for matmulnbits Jan 2, 2025
@qjia7 qjia7 marked this pull request as ready for review January 2, 2025 04:47
@qjia7
Copy link
Contributor Author

qjia7 commented Jan 2, 2025

@guschmue @fs-eire Please take a look. Currently, this PR is only applied on Intel devices since I don't see perf improvement on NV device on my hand. Need some time to investigate the reason.

Please also let me know whether it works on your xe devices.
If you want to check other GPUs like, mac, you can just change

const bool use_subgroup = context.Device().HasFeature(wgpu::FeatureName::Subgroups) && context.AdapterInfo().vendor == std::string_view{"intel"} && components_a == 4 && block_size == 32;

to

  const bool use_subgroup = context.Device().HasFeature(wgpu::FeatureName::Subgroups) && components_a == 4 && block_size == 32;

cc @sushraja-msft

@guschmue
Copy link
Contributor

guschmue commented Jan 2, 2025

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

guschmue commented Jan 2, 2025

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@guschmue
Copy link
Contributor

guschmue commented Jan 2, 2025

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

@guschmue
Copy link
Contributor

guschmue commented Jan 2, 2025

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Jan 2, 2025
@guschmue
Copy link
Contributor

/azp run Win_TRT_Minimal_CUDA_Test_CI

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue guschmue merged commit 80d8931 into microsoft:main Jan 13, 2025
76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants