[Bug Report] ttnn.mean op - Data Mismatch #13621

chandrasekaranpradeep · 2024-10-09T04:31:34Z

Describe the bug
The ttnn.mean throws assertion error because of data mismatch between PyTorch and TTNN output and the pcc is dropped to 0.72 when the input tensor of (1, 12, 3200) and dim = -1 is passed to ttnn.mean op.
For more context, here is the exact error message

def assert_with_pcc(expected_pytorch_result, actual_pytorch_result, pcc=0.9999):
        assert list(expected_pytorch_result.shape) == list(
            actual_pytorch_result.shape
        ), f"list(expected_pytorch_result.shape)={list(expected_pytorch_result.shape)} vs list(actual_pytorch_result.shape)={list(actual_pytorch_result.shape)}"
        pcc_passed, pcc_message = comp_pcc(expected_pytorch_result, actual_pytorch_result, pcc)
>       assert pcc_passed, construct_pcc_assert_message(pcc_message, expected_pytorch_result, actual_pytorch_result)
E       AssertionError: 0.7203957195745748

To Reproduce

Run the following test:

import torch
import ttnn
from tests.ttnn.utils_for_testing import assert_with_pcc
from models.utility_functions import torch_random
def test_mean_pcc_issue(device):
    torch.manual_seed(0)

    input_shape = (1, 12, 3200)
    reduce_dim = -1
    
    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)
    
    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)
    
    assert_with_pcc(torch_output_tensor, output_tensor)

Expected behavior
The data mismatch between PyTorch and TTNN output should be resolved.

The text was updated successfully, but these errors were encountered:

sdjordjevicTT · 2024-10-25T15:51:25Z

@ntarafdar @sjameelTT can you please help me to find owners for this issue?

ntarafdar · 2024-10-29T16:16:35Z

hey @sdjordjevicTT asking around, its a reduction op who doesn't have an owner , will ask ttnn ppl and get back to you

ntarafdar · 2024-10-29T16:48:39Z

@sdjordjevicTT asked around and since there is no other owner for this, the TMs team will have to take this.
We cannot get to this until end of next week.

sdjordjevicTT · 2024-10-29T19:43:26Z

Thanks @ntarafdar for picking this up. Great, I believe that should work for us.

jvasilje · 2024-10-30T15:05:38Z

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

sdjordjevicTT · 2024-10-30T15:57:25Z

@nvukobratTT can comment more about priority, but I think this issue blocks Llama 3B bring-up on the Forge side.

nvukobratTT · 2024-10-30T16:02:31Z

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

Confirming what @sdjordjevicTT mentioned, this one is a blocker for the Open Llama 3B model.

Additional details can be found on the MLIR issue as well:

ttnn.mean op - Tensor Mismatch tt-mlir#869

ntarafdar · 2024-11-05T16:00:30Z

Spoke to Jasmine, and @bbradelTT is for now taking over reductions. I'm reassigning this to him.

bbradelTT · 2024-11-05T16:57:01Z

I tried to find out if there's any point at which there's a big drop off. Seemed like it might be somewhere between 1200 and 1400, but the PCC value goes up and down a fair amount:

    #input_shape = (1, 12, 3200) # .72
    #input_shape = (1, 12, 1600) # .76
    #input_shape = (1, 12, 1400) # .78
    #input_shape = (1, 12, 1376) # .81
    #input_shape = (1, 12, 1363) # .93
    #input_shape = (1, 12, 1369) # .83
    #input_shape = (1, 12, 1368) # .85
    #input_shape = (1, 12, 1367) # .71
    #input_shape = (1, 12, 1366) # .96
    #input_shape = (1, 12, 1350) # .95
    #input_shape = (1, 12, 1344) # .87
    #input_shape = (1, 12, 1300) # .91
    #input_shape = (1, 12, 1200) # .93
    #input_shape = (1, 12, 800) # .92
    #input_shape = (1, 12, 320) # .99

sdjordjevicTT · 2024-11-13T10:59:57Z

Hi @bbradelTT do we have some updates regarding this missmatch problem?

bbradelTT · 2024-11-13T14:41:49Z

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

nvukobratTT · 2024-11-13T16:13:00Z

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

@bbradelTT thanks for the details. Can you clarify the following:

What are the core issues with reduced mean and related PCC issues?
- Having this info, we might be able to work around it until a fix is in place
Details around lowering the priority on this issue.
- From the current standpoint, this issue should be treated as P0 as it blocks one of the Forge core models, Llama.

To be certain that this issue is properly tracked, I'm re-adding the P0 label once again. Please correct me if I'm missing some context as to why this one should still be a P1 issue.

Thanks!

dgolubovicTT · 2024-11-27T16:45:18Z

I tried running reduce mean test for shape (1,32,3200) dim = -1, and it also gives data mismatch.
Since I generated input using torch.rand mean should be around 0.5 for all. That is the case in torch output:

fw_out is  tensor([[[0.4979],
         [0.4969],
         [0.5080],
         [0.5029],
         [0.5012],
         [0.5046],
         [0.4993],
         [0.5034],
         [0.5109],
         [0.4984],
         [0.4972],
         [0.4963],
         [0.5007],
         [0.5020],
         [0.4910],
         [0.4976],
         [0.5074],
         [0.4978],
         [0.4947],
         [0.5001],
         [0.4986],
         [0.4956],
         [0.5075],
         [0.4962],
         [0.4956],
         [0.5050],
         [0.5009],
         [0.5055],
         [0.5023],
         [0.5025],
         [0.4936],
         [0.4976]]])

However, ttnn output gives consistently less than 0.5 for all 32:

co_out is  tensor([[[0.4648],
         [0.4707],
         [0.4844],
         [0.4727],
         [0.4570],
         [0.4766],
         [0.4727],
         [0.4785],
         [0.4883],
         [0.4707],
         [0.4766],
         [0.4648],
         [0.4668],
         [0.4668],
         [0.4629],
         [0.4648],
         [0.4688],
         [0.4688],
         [0.4590],
         [0.4805],
         [0.4766],
         [0.4668],
         [0.4824],
         [0.4648],
         [0.4668],
         [0.4844],
         [0.4746],
         [0.4766],
         [0.4668],
         [0.4707],
         [0.4668],
         [0.4766]]])

Interestingly I tried this for shapes (1,32,x) where x is going from 10 - 400 (excluded) with step 10.

These shapes are failing with data mismatch:

(1,32,210)
(1,32,220)
(1,32,250)
(1,32,330)
(1,32,340)
(1,32,350)
(1,32,360)
(1,32,370)
(1,32,380)
(1,32,390)

For ones that are less than 210 there is no data mismatch.

dgolubovicTT · 2024-11-27T16:52:46Z

Since inputs are from torch.rand (which is Uniform distribution [0,1)) it is really odd that mean over 3200 values is 0.46-0.48. It seems that mean doesn't work on bigger dimensions at all...

bbradelTT · 2024-11-27T20:16:36Z

Today's update:

still blocked by pad. Created ttnn pad is not working for some inputs and seems to ignore pad value #15511 to track fixing pad

dgolubovicTT · 2024-11-28T10:10:15Z

Hey @bbradelTT,
I've talked to @nvukobratTT offline to scope requirements for this issue.

For the first phase we need ttnn.mean to work for shapes (1,12,3200), (1,128,3200) and (1,1,3200). Fixing this is P0.

For the second phase, we will need ttnn.mean to work for all shapes (1,x,3200) for x in [1,2048]. Only then, we can get Llama inference for any number of tokens on input. This second phase can be P1.

So after you finish this first phase, we will move this issue to P1, and go from there.

bbradelTT · 2024-12-03T17:15:33Z

Update:

still blocked

I created #15656 to look at improving accuracy further.

bbradelTT · 2024-12-03T18:42:01Z

Update:

accumulating into FP32 brought the PCC high enough that the tree add algorithm is accurate enough.

bbradelTT · 2024-12-04T14:36:06Z

@dgolubovicTT @nvukobratTT

As a temporary workaround, could you try using a compute kernel with fp32 acc enabled and let me know if there is a change in PCC?

Something like

    compute_kernel_config = ttnn.WormholeComputeKernelConfig(
            math_fidelity=ttnn.MathFidelity.HiFi4,
            math_approx_mode=False,
            fp32_dest_acc_en=True,
            packer_l1_acc=False, # or True, doesn't matter
        )
output_tensor = ttnn.mean(input_tensor, dim=reduce_dim, compute_kernel_config=compute_kernel_config)

Update:

still blocked
to help with PCC tried to enable fp32_dest_acc_en by default if compute_kernel_config is not specified but can't because of an assert due to an issue on BH that hasn't been fixed yet

dgolubovicTT · 2024-12-04T16:05:43Z

@bbradelTT is this workaround for shape (1,12,3200) or it should make all above shapes work?

bbradelTT · 2024-12-04T16:25:12Z

@bbradelTT is this workaround for shape (1,12,3200) or it should make all above shapes work?

It should improve PCC for all shapes. However, I don't know by how much, and right now I would just like to know about (1,12,3200).

Unfortunately I have a lot of edits on my branch, which means I can't test this change in isolation. I'm hoping that this is just a python change you would be able to test it easily on your end.

dgolubovicTT · 2024-12-04T16:43:04Z

@bbradelTT I tried this out and it fixes PCC fully. We have an option to call ttnn.mean with this compute_kernel_config in runtime. Does this compute_kernel_config have any perf repercussions? We would use it for each call of ttnn.mean so it is important to know this.

bbradelTT · 2024-12-04T16:49:10Z

It, has perf repercussions, but they should be small, and I'm going to need to add that in if you don't specify a compute_kernel_config.

Which means unless you specify a compute kernel config with that off, you'll see the perf impact.

Also, it'll have a lot less perf repercussions than what I'm working on, which is probably necessary since GS doesn't have fp32.

ayerofieiev-tt · 2024-12-05T16:25:19Z

Subscribing

bbradelTT · 2024-12-05T20:19:55Z

Update:

still blocked on underlying issues
I looked at behaviour on GS of my tree add change against baseline on GS (where fp32 is not on option and got the following)

@pytest.mark.parametrize("input_shape", [
    (1, 12, 3200),
    (1, 1, 3200),
    (1, 128, 3200),
    (1, 2048, 3200),
    ])
def test_large_tensor_mean(device, input_shape):
    reduce_dim = -1
    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)
    assert_with_pcc(torch_output_tensor, output_tensor)

produces

on GS: 
Base:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.7769184620303734
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.7137656700429718
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.6815292219338761

with cut at 64:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.9772472040616764
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.9666487764829472
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.9527421326781462

with cut at 128:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.9335399543286304
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.9003435192493461
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.8903827326146612

with cut at 256:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.9522780737237543
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.9058280173941267
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.8969139544477592

with cut at 512:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.9544566782083125
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.8692186521807318
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.8555394079497131

I'm leaning towards cutting tensors at every 64 elements (2 tiles), but that will have a larger perf impact.

bbradelTT · 2024-12-05T20:30:54Z

For larger tensors, the tree add algorithm changes accuracy on GS from 0.7 to 0.85-0.95. However, there is going to be a pretty big performance hit for mean/sum/std/var.

The options are:

Just enable fp32 dest acc by default, which will only help WH onwards
Enable fp32 dest acc by default and add in the tree add algorithm while dividing the tensors at a fine granularity
Enable fp32 dest acc by default and add in the tree add algorithm while dividing the tensors at a course granularity

@davorchap
How much do we care about accuracy on GS?
Do you have a preference for which option I go with?

### Ticket Link to Github Issue #13621 ### Problem description reduce sum is not very accurate because fp32 acc for reduce was not enabled by default ### What's changed enable fp32 acc for reduce by default ### Checklist - [x] Post commit CI passes Between two runs, all jobs passed https://github.com/tenstorrent/tt-metal/actions/runs/12147218201 and https://github.com/tenstorrent/tt-metal/actions/runs/12160961521 - [x] Blackhole Post commit (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12187630112 - [x] Model regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12187623632/job/33999246179 fails same as main except another random tt-smi reset not working. main: https://github.com/tenstorrent/tt-metal/actions/runs/12189517366 - [x] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12187626769 passes for WH, GS not affected and fails which it does on main https://github.com/tenstorrent/tt-metal/actions/runs/12189542166 - [x] New/Existing tests provide coverage for changes

bbradelTT · 2024-12-06T18:57:04Z

Update:

still waiting for underlying issues to be fixed
enabled fp32 dest acc, will hopefully help with PCC on WH at least

ntarafdar · 2024-12-06T18:59:49Z

update on one of the underlying issues, pad last dim
@jaykru-tt is doing a good job in rewriting the kernel from scratch to handle misaligned read/writes. We expect a solution early to mid next week .

bbradelTT · 2024-12-09T15:05:47Z

@dgolubovicTT @nvukobratTT

Are the PCCs sufficiently good now for your purposes on the newest tt-metal main?

I talked to @davorchap and he said it's better to not worry about GS and it would be good if the fp32 dest acc change is sufficient.

dgolubovicTT · 2024-12-10T10:57:58Z

I've rebased onto tt-metal main and rerun tests for shape (1,x,3200) where x goes from 1 to 2048. PCCs are sufficiently good without the use of the workaround (compute_kernel_config). Thanks for prioritizing this guys! Closing this one

chandrasekaranpradeep added the bug Something isn't working label Oct 9, 2024

This was referenced Oct 9, 2024

ttnn.mean op - Tensor Mismatch tenstorrent/tt-mlir#869

Closed

[Ops] Support for reduce average op (ttnn.mean) tenstorrent/tt-forge-fe#114

Open

tt-mpantic added the forge label Oct 14, 2024

nvukobratTT added the P0 label Oct 14, 2024

sdjordjevicTT assigned ntarafdar and sjameelTT Oct 25, 2024

ntarafdar assigned yugi957 and unassigned sjameelTT and ntarafdar Oct 29, 2024

jvasilje added P1 and removed P0 labels Oct 30, 2024

ayerofieiev-tt added this to PyTorch 2.0 TT-NN Compiler Oct 30, 2024

ayerofieiev-tt added op_cat: reduces pytorch-compiler labels Oct 30, 2024

ntarafdar assigned bbradelTT and unassigned yugi957 Nov 5, 2024

nvukobratTT removed the P1 label Nov 13, 2024

bbradelTT mentioned this issue Dec 2, 2024

TTNN pad does not support last dimension #12896

Closed

bbradelTT mentioned this issue Dec 3, 2024

Create a Kahan summation algorithm kernel for reduce sum #15656

Closed

bbradelTT added a commit that referenced this issue Dec 3, 2024

#13621: enable default fp32 acc for reduce

78432d2

bbradelTT mentioned this issue Dec 3, 2024

#13621: enable default fp32 acc for reduce #15665

Merged

5 tasks

bbradelTT added a commit that referenced this issue Dec 4, 2024

#13621: enable default fp32 acc for reduce

1dfa9cd

bbradelTT added a commit that referenced this issue Dec 5, 2024

#13621: enable default fp32 acc for reduce

456f7cb

bbradelTT added a commit that referenced this issue Dec 5, 2024

#13621: adjust reduce, including tree add for reduce sum

3865c5e

bbradelTT added a commit that referenced this issue Dec 6, 2024

#13621: enable default fp32 acc for reduce

0d0cd3e

dgolubovicTT closed this as completed Dec 10, 2024

github-project-automation bot moved this to Done in PyTorch 2.0 TT-NN Compiler Dec 10, 2024

bbradelTT mentioned this issue Dec 17, 2024

Add avg_pool2d with kernel size support #14268

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] ttnn.mean op - Data Mismatch #13621

[Bug Report] ttnn.mean op - Data Mismatch #13621

chandrasekaranpradeep commented Oct 9, 2024

sdjordjevicTT commented Oct 25, 2024

ntarafdar commented Oct 29, 2024

ntarafdar commented Oct 29, 2024

sdjordjevicTT commented Oct 29, 2024

jvasilje commented Oct 30, 2024

sdjordjevicTT commented Oct 30, 2024

nvukobratTT commented Oct 30, 2024

ntarafdar commented Nov 5, 2024

bbradelTT commented Nov 5, 2024

sdjordjevicTT commented Nov 13, 2024

bbradelTT commented Nov 13, 2024

nvukobratTT commented Nov 13, 2024

dgolubovicTT commented Nov 27, 2024

dgolubovicTT commented Nov 27, 2024

bbradelTT commented Nov 27, 2024

dgolubovicTT commented Nov 28, 2024

bbradelTT commented Dec 3, 2024

bbradelTT commented Dec 3, 2024

bbradelTT commented Dec 4, 2024

dgolubovicTT commented Dec 4, 2024 •

edited

Loading

bbradelTT commented Dec 4, 2024

dgolubovicTT commented Dec 4, 2024

bbradelTT commented Dec 4, 2024

ayerofieiev-tt commented Dec 5, 2024

bbradelTT commented Dec 5, 2024

bbradelTT commented Dec 5, 2024

bbradelTT commented Dec 6, 2024

ntarafdar commented Dec 6, 2024

bbradelTT commented Dec 9, 2024

dgolubovicTT commented Dec 10, 2024

[Bug Report] ttnn.mean op - Data Mismatch #13621

[Bug Report] ttnn.mean op - Data Mismatch #13621

Comments

chandrasekaranpradeep commented Oct 9, 2024

sdjordjevicTT commented Oct 25, 2024

ntarafdar commented Oct 29, 2024

ntarafdar commented Oct 29, 2024

sdjordjevicTT commented Oct 29, 2024

jvasilje commented Oct 30, 2024

sdjordjevicTT commented Oct 30, 2024

nvukobratTT commented Oct 30, 2024

ntarafdar commented Nov 5, 2024

bbradelTT commented Nov 5, 2024

sdjordjevicTT commented Nov 13, 2024

bbradelTT commented Nov 13, 2024

nvukobratTT commented Nov 13, 2024

dgolubovicTT commented Nov 27, 2024

dgolubovicTT commented Nov 27, 2024

bbradelTT commented Nov 27, 2024

dgolubovicTT commented Nov 28, 2024

bbradelTT commented Dec 3, 2024

bbradelTT commented Dec 3, 2024

bbradelTT commented Dec 4, 2024

dgolubovicTT commented Dec 4, 2024 • edited Loading

bbradelTT commented Dec 4, 2024

dgolubovicTT commented Dec 4, 2024

bbradelTT commented Dec 4, 2024

ayerofieiev-tt commented Dec 5, 2024

bbradelTT commented Dec 5, 2024

bbradelTT commented Dec 5, 2024

bbradelTT commented Dec 6, 2024

ntarafdar commented Dec 6, 2024

bbradelTT commented Dec 9, 2024

dgolubovicTT commented Dec 10, 2024

dgolubovicTT commented Dec 4, 2024 •

edited

Loading