-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] ttnn.mean op - Data Mismatch #13621
Comments
@ntarafdar @sjameelTT can you please help me to find owners for this issue? |
hey @sdjordjevicTT asking around, its a reduction op who doesn't have an owner , will ask ttnn ppl and get back to you |
@sdjordjevicTT asked around and since there is no other owner for this, the TMs team will have to take this. |
Thanks @ntarafdar for picking this up. Great, I believe that should work for us. |
moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified. |
@nvukobratTT can comment more about priority, but I think this issue blocks Llama 3B bring-up on the Forge side. |
Confirming what @sdjordjevicTT mentioned, this one is a blocker for the Open Llama 3B model. Additional details can be found on the MLIR issue as well: |
Spoke to Jasmine, and @bbradelTT is for now taking over reductions. I'm reassigning this to him. |
I tried to find out if there's any point at which there's a big drop off. Seemed like it might be somewhere between 1200 and 1400, but the PCC value goes up and down a fair amount:
|
Hi @bbradelTT do we have some updates regarding this missmatch problem? |
@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while. |
@bbradelTT thanks for the details. Can you clarify the following:
To be certain that this issue is properly tracked, I'm re-adding the P0 label once again. Please correct me if I'm missing some context as to why this one should still be a P1 issue. Thanks! |
I tried running reduce mean test for shape (1,32,3200) dim = -1, and it also gives data mismatch.
However, ttnn output gives consistently less than 0.5 for all 32:
Interestingly I tried this for shapes (1,32,x) where x is going from 10 - 400 (excluded) with step 10. These shapes are failing with data mismatch:
For ones that are less than 210 there is no data mismatch. |
Since inputs are from torch.rand (which is Uniform distribution [0,1)) it is really odd that mean over 3200 values is 0.46-0.48. It seems that mean doesn't work on bigger dimensions at all... |
Today's update:
|
Hey @bbradelTT, For the first phase we need ttnn.mean to work for shapes (1,12,3200), (1,128,3200) and (1,1,3200). Fixing this is P0. For the second phase, we will need ttnn.mean to work for all shapes (1,x,3200) for x in [1,2048]. Only then, we can get Llama inference for any number of tokens on input. This second phase can be P1. So after you finish this first phase, we will move this issue to P1, and go from there. |
Update:
I created #15656 to look at improving accuracy further. |
Update:
|
As a temporary workaround, could you try using a compute kernel with fp32 acc enabled and let me know if there is a change in PCC? Something like
Update:
|
@bbradelTT is this workaround for shape (1,12,3200) or it should make all above shapes work? |
It should improve PCC for all shapes. However, I don't know by how much, and right now I would just like to know about (1,12,3200). Unfortunately I have a lot of edits on my branch, which means I can't test this change in isolation. I'm hoping that this is just a python change you would be able to test it easily on your end. |
@bbradelTT I tried this out and it fixes PCC fully. We have an option to call ttnn.mean with this |
It, has perf repercussions, but they should be small, and I'm going to need to add that in if you don't specify a compute_kernel_config. Which means unless you specify a compute kernel config with that off, you'll see the perf impact. Also, it'll have a lot less perf repercussions than what I'm working on, which is probably necessary since GS doesn't have fp32. |
Subscribing |
Update:
produces
I'm leaning towards cutting tensors at every 64 elements (2 tiles), but that will have a larger perf impact. |
For larger tensors, the tree add algorithm changes accuracy on GS from 0.7 to 0.85-0.95. However, there is going to be a pretty big performance hit for mean/sum/std/var. The options are:
@davorchap |
### Ticket Link to Github Issue #13621 ### Problem description reduce sum is not very accurate because fp32 acc for reduce was not enabled by default ### What's changed enable fp32 acc for reduce by default ### Checklist - [x] Post commit CI passes Between two runs, all jobs passed https://github.com/tenstorrent/tt-metal/actions/runs/12147218201 and https://github.com/tenstorrent/tt-metal/actions/runs/12160961521 - [x] Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12187630112 - [x] Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12187623632/job/33999246179 fails same as main except another random tt-smi reset not working. main: https://github.com/tenstorrent/tt-metal/actions/runs/12189517366 - [x] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12187626769 passes for WH, GS not affected and fails which it does on main https://github.com/tenstorrent/tt-metal/actions/runs/12189542166 - [x] New/Existing tests provide coverage for changes
Update:
|
update on one of the underlying issues, pad last dim |
Are the PCCs sufficiently good now for your purposes on the newest tt-metal main? I talked to @davorchap and he said it's better to not worry about GS and it would be good if the fp32 dest acc change is sufficient. |
I've rebased onto tt-metal main and rerun tests for shape (1,x,3200) where x goes from 1 to 2048. PCCs are sufficiently good without the use of the workaround (compute_kernel_config). Thanks for prioritizing this guys! Closing this one |
Describe the bug
The ttnn.mean throws assertion error because of data mismatch between PyTorch and TTNN output and the pcc is dropped to 0.72 when the input tensor of (1, 12, 3200) and dim = -1 is passed to ttnn.mean op.
For more context, here is the exact error message
To Reproduce
Run the following test:
Expected behavior
The data mismatch between PyTorch and TTNN output should be resolved.
The text was updated successfully, but these errors were encountered: