Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Gemm results are different with tile_description? #1769

Open
hxdtest opened this issue Sep 2, 2024 · 3 comments
Open

[QST] Gemm results are different with tile_description? #1769

hxdtest opened this issue Sep 2, 2024 · 3 comments

Comments

@hxdtest
Copy link

hxdtest commented Sep 2, 2024

What is your question?
It seems that add tile_description would make the gemm result different? assert (tensor_D_numpy - tensor_D).max() == 0.0 would pass if I add tile_decription.

import numpy as np
import random
import torch
import cutlass

# This controls whether the C++ GEMM declaration will be printed at each step. 
# Set to `False` to omit this information.
print_module = True

m = 8192
n = 8192
k = 8192
dtype=torch.float16
tensor_A = torch.rand(m, k, device=0, dtype=torch.float16)   
tensor_B = torch.rand(k, n, device=0, dtype=torch.float16)   
tensor_C = torch.zeros(m, n, device=0, dtype=torch.float16)   
tensor_D = torch.zeros(m, n, device=0, dtype=torch.float16)   

alpha = 1
beta = 0.0

# We specify `element_accumulator` here so as to match the kernel run by NumPy below. However,
# specifying `element_accumulator` is not required if it is the same as `element`
plan = cutlass.Gemm(element=dtype, layout=cutlass.LayoutType.RowMajor, element_accumulator=torch.float32)
"""
tile_description = {
    "threadblock_shape":  [128, 256, 32],   # Threadblock shape
    "stages": 3,                # Number of stages
    "wrap_count" : [2, 4, 1],        # Number of warps within each dimension of the threadblock shape
    "instruction_shape":  [16, 8 , 16] ,
    "cluster_shape":  [1, 1 , 1]
}
plan.tile_description = tile_description
"""

plan.run(tensor_A, tensor_B, tensor_C, tensor_D, print_module=print_module)

tensor_D_numpy = (alpha * (tensor_A @ tensor_B)) + (beta * tensor_C)


assert (tensor_D_numpy - tensor_D).max() == 0.0
print(tensor_D)

@jackkosaian
Copy link
Contributor

Can you please list the C++ output you see after the call to plan.run() with print_module=true?

Copy link

github-actions bot commented Oct 3, 2024

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Copy link

github-actions bot commented Jan 1, 2025

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants