Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] ttnn.mean op - Data Mismatch #13621

Closed
chandrasekaranpradeep opened this issue Oct 9, 2024 · 50 comments
Closed

[Bug Report] ttnn.mean op - Data Mismatch #13621

chandrasekaranpradeep opened this issue Oct 9, 2024 · 50 comments
Assignees

Comments

@chandrasekaranpradeep
Copy link

Describe the bug
The ttnn.mean throws assertion error because of data mismatch between PyTorch and TTNN output and the pcc is dropped to 0.72 when the input tensor of (1, 12, 3200) and dim = -1 is passed to ttnn.mean op.
For more context, here is the exact error message

def assert_with_pcc(expected_pytorch_result, actual_pytorch_result, pcc=0.9999):
        assert list(expected_pytorch_result.shape) == list(
            actual_pytorch_result.shape
        ), f"list(expected_pytorch_result.shape)={list(expected_pytorch_result.shape)} vs list(actual_pytorch_result.shape)={list(actual_pytorch_result.shape)}"
        pcc_passed, pcc_message = comp_pcc(expected_pytorch_result, actual_pytorch_result, pcc)
>       assert pcc_passed, construct_pcc_assert_message(pcc_message, expected_pytorch_result, actual_pytorch_result)
E       AssertionError: 0.7203957195745748

To Reproduce

Run the following test:

import torch
import ttnn
from tests.ttnn.utils_for_testing import assert_with_pcc
from models.utility_functions import torch_random
def test_mean_pcc_issue(device):
    torch.manual_seed(0)

    input_shape = (1, 12, 3200)
    reduce_dim = -1
    
    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.float32, layout=ttnn.TILE_LAYOUT, device=device)
    
    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)
    
    assert_with_pcc(torch_output_tensor, output_tensor)

Expected behavior
The data mismatch between PyTorch and TTNN output should be resolved.

@sdjordjevicTT
Copy link
Contributor

@ntarafdar @sjameelTT can you please help me to find owners for this issue?

@ntarafdar
Copy link
Contributor

hey @sdjordjevicTT asking around, its a reduction op who doesn't have an owner , will ask ttnn ppl and get back to you

@ntarafdar
Copy link
Contributor

@sdjordjevicTT asked around and since there is no other owner for this, the TMs team will have to take this.
We cannot get to this until end of next week.

@ntarafdar ntarafdar assigned yugi957 and unassigned sjameelTT and ntarafdar Oct 29, 2024
@sdjordjevicTT
Copy link
Contributor

Thanks @ntarafdar for picking this up. Great, I believe that should work for us.

@jvasilje
Copy link
Collaborator

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

@jvasilje jvasilje added P1 and removed P0 labels Oct 30, 2024
@sdjordjevicTT
Copy link
Contributor

@nvukobratTT can comment more about priority, but I think this issue blocks Llama 3B bring-up on the Forge side.

@nvukobratTT
Copy link
Contributor

moving to a P1 issue. @sdjordjevicTT pls comment if you believe the P0 is justified.

Confirming what @sdjordjevicTT mentioned, this one is a blocker for the Open Llama 3B model.

Additional details can be found on the MLIR issue as well:

@ntarafdar
Copy link
Contributor

Spoke to Jasmine, and @bbradelTT is for now taking over reductions. I'm reassigning this to him.

@ntarafdar ntarafdar assigned bbradelTT and unassigned yugi957 Nov 5, 2024
@bbradelTT
Copy link
Contributor

I tried to find out if there's any point at which there's a big drop off. Seemed like it might be somewhere between 1200 and 1400, but the PCC value goes up and down a fair amount:

    #input_shape = (1, 12, 3200) # .72
    #input_shape = (1, 12, 1600) # .76
    #input_shape = (1, 12, 1400) # .78
    #input_shape = (1, 12, 1376) # .81
    #input_shape = (1, 12, 1363) # .93
    #input_shape = (1, 12, 1369) # .83
    #input_shape = (1, 12, 1368) # .85
    #input_shape = (1, 12, 1367) # .71
    #input_shape = (1, 12, 1366) # .96
    #input_shape = (1, 12, 1350) # .95
    #input_shape = (1, 12, 1344) # .87
    #input_shape = (1, 12, 1300) # .91
    #input_shape = (1, 12, 1200) # .93
    #input_shape = (1, 12, 800) # .92
    #input_shape = (1, 12, 320) # .99

@sdjordjevicTT
Copy link
Contributor

Hi @bbradelTT do we have some updates regarding this missmatch problem?

@bbradelTT
Copy link
Contributor

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

@nvukobratTT
Copy link
Contributor

@sdjordjevicTT Unfortunately we need to overhaul reduce. I won't have concrete updates for a while.

@bbradelTT thanks for the details. Can you clarify the following:

  1. What are the core issues with reduced mean and related PCC issues?
    • Having this info, we might be able to work around it until a fix is in place
  2. Details around lowering the priority on this issue.
    • From the current standpoint, this issue should be treated as P0 as it blocks one of the Forge core models, Llama.

To be certain that this issue is properly tracked, I'm re-adding the P0 label once again. Please correct me if I'm missing some context as to why this one should still be a P1 issue.

Thanks!

@nvukobratTT nvukobratTT removed the P1 label Nov 13, 2024
@dgolubovicTT
Copy link

I tried running reduce mean test for shape (1,32,3200) dim = -1, and it also gives data mismatch.
Since I generated input using torch.rand mean should be around 0.5 for all. That is the case in torch output:

fw_out is  tensor([[[0.4979],
         [0.4969],
         [0.5080],
         [0.5029],
         [0.5012],
         [0.5046],
         [0.4993],
         [0.5034],
         [0.5109],
         [0.4984],
         [0.4972],
         [0.4963],
         [0.5007],
         [0.5020],
         [0.4910],
         [0.4976],
         [0.5074],
         [0.4978],
         [0.4947],
         [0.5001],
         [0.4986],
         [0.4956],
         [0.5075],
         [0.4962],
         [0.4956],
         [0.5050],
         [0.5009],
         [0.5055],
         [0.5023],
         [0.5025],
         [0.4936],
         [0.4976]]])

However, ttnn output gives consistently less than 0.5 for all 32:

co_out is  tensor([[[0.4648],
         [0.4707],
         [0.4844],
         [0.4727],
         [0.4570],
         [0.4766],
         [0.4727],
         [0.4785],
         [0.4883],
         [0.4707],
         [0.4766],
         [0.4648],
         [0.4668],
         [0.4668],
         [0.4629],
         [0.4648],
         [0.4688],
         [0.4688],
         [0.4590],
         [0.4805],
         [0.4766],
         [0.4668],
         [0.4824],
         [0.4648],
         [0.4668],
         [0.4844],
         [0.4746],
         [0.4766],
         [0.4668],
         [0.4707],
         [0.4668],
         [0.4766]]])

Interestingly I tried this for shapes (1,32,x) where x is going from 10 - 400 (excluded) with step 10.

These shapes are failing with data mismatch:

(1,32,210)
(1,32,220)
(1,32,250)
(1,32,330)
(1,32,340)
(1,32,350)
(1,32,360)
(1,32,370)
(1,32,380)
(1,32,390)

For ones that are less than 210 there is no data mismatch.

@dgolubovicTT
Copy link

Since inputs are from torch.rand (which is Uniform distribution [0,1)) it is really odd that mean over 3200 values is 0.46-0.48. It seems that mean doesn't work on bigger dimensions at all...

@bbradelTT
Copy link
Contributor

Today's update:

@dgolubovicTT
Copy link

Hey @bbradelTT,
I've talked to @nvukobratTT offline to scope requirements for this issue.

For the first phase we need ttnn.mean to work for shapes (1,12,3200), (1,128,3200) and (1,1,3200). Fixing this is P0.

For the second phase, we will need ttnn.mean to work for all shapes (1,x,3200) for x in [1,2048]. Only then, we can get Llama inference for any number of tokens on input. This second phase can be P1.

So after you finish this first phase, we will move this issue to P1, and go from there.

@bbradelTT
Copy link
Contributor

Update:

  • still blocked

I created #15656 to look at improving accuracy further.

@bbradelTT
Copy link
Contributor

Update:

  • accumulating into FP32 brought the PCC high enough that the tree add algorithm is accurate enough.

@bbradelTT
Copy link
Contributor

@dgolubovicTT @nvukobratTT

As a temporary workaround, could you try using a compute kernel with fp32 acc enabled and let me know if there is a change in PCC?

Something like

    compute_kernel_config = ttnn.WormholeComputeKernelConfig(
            math_fidelity=ttnn.MathFidelity.HiFi4,
            math_approx_mode=False,
            fp32_dest_acc_en=True,
            packer_l1_acc=False, # or True, doesn't matter
        )
output_tensor = ttnn.mean(input_tensor, dim=reduce_dim, compute_kernel_config=compute_kernel_config)

Update:

  • still blocked
  • to help with PCC tried to enable fp32_dest_acc_en by default if compute_kernel_config is not specified but can't because of an assert due to an issue on BH that hasn't been fixed yet

@dgolubovicTT
Copy link

dgolubovicTT commented Dec 4, 2024

@bbradelTT is this workaround for shape (1,12,3200) or it should make all above shapes work?

@bbradelTT
Copy link
Contributor

@bbradelTT is this workaround for shape (1,12,3200) or it should make all above shapes work?

It should improve PCC for all shapes. However, I don't know by how much, and right now I would just like to know about (1,12,3200).

Unfortunately I have a lot of edits on my branch, which means I can't test this change in isolation. I'm hoping that this is just a python change you would be able to test it easily on your end.

@dgolubovicTT
Copy link

@bbradelTT I tried this out and it fixes PCC fully. We have an option to call ttnn.mean with this compute_kernel_config in runtime. Does this compute_kernel_config have any perf repercussions? We would use it for each call of ttnn.mean so it is important to know this.

@bbradelTT
Copy link
Contributor

It, has perf repercussions, but they should be small, and I'm going to need to add that in if you don't specify a compute_kernel_config.

Which means unless you specify a compute kernel config with that off, you'll see the perf impact.

Also, it'll have a lot less perf repercussions than what I'm working on, which is probably necessary since GS doesn't have fp32.

@ayerofieiev-tt
Copy link
Member

Subscribing

@bbradelTT
Copy link
Contributor

Update:

  • still blocked on underlying issues
  • I looked at behaviour on GS of my tree add change against baseline on GS (where fp32 is not on option and got the following)
@pytest.mark.parametrize("input_shape", [
    (1, 12, 3200),
    (1, 1, 3200),
    (1, 128, 3200),
    (1, 2048, 3200),
    ])
def test_large_tensor_mean(device, input_shape):
    reduce_dim = -1
    torch_input_tensor = torch.rand(input_shape, dtype=torch.float32)
    torch_output_tensor = torch.mean(torch_input_tensor, dim=reduce_dim, keepdim=True, dtype=torch.float32)

    input_tensor = ttnn.from_torch(torch_input_tensor, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
    output_tensor = ttnn.mean(input_tensor, dim=reduce_dim)
    output_tensor = ttnn.to_torch(output_tensor)
    assert_with_pcc(torch_output_tensor, output_tensor)

produces

on GS: 
Base:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.7769184620303734
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.7137656700429718
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.6815292219338761

with cut at 64:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.9772472040616764
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.9666487764829472
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.9527421326781462

with cut at 128:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.9335399543286304
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.9003435192493461
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.8903827326146612

with cut at 256:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.9522780737237543
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.9058280173941267
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.8969139544477592

with cut at 512:
PASSED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 1, 3200)]
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 12, 3200)] - AssertionError: 0.9544566782083125
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 128, 3200)] - AssertionError: 0.8692186521807318
FAILED tests/ttnn/unit_tests/operations/test_reduction_mean.py::test_large_tensor_mean[input_shape=(1, 2048, 3200)] - AssertionError: 0.8555394079497131

I'm leaning towards cutting tensors at every 64 elements (2 tiles), but that will have a larger perf impact.

@bbradelTT
Copy link
Contributor

For larger tensors, the tree add algorithm changes accuracy on GS from 0.7 to 0.85-0.95. However, there is going to be a pretty big performance hit for mean/sum/std/var.

The options are:

  1. Just enable fp32 dest acc by default, which will only help WH onwards
  2. Enable fp32 dest acc by default and add in the tree add algorithm while dividing the tensors at a fine granularity
  3. Enable fp32 dest acc by default and add in the tree add algorithm while dividing the tensors at a course granularity

@davorchap
How much do we care about accuracy on GS?
Do you have a preference for which option I go with?

bbradelTT added a commit that referenced this issue Dec 6, 2024
### Ticket
Link to Github Issue #13621

### Problem description
reduce sum is not very accurate because fp32 acc for reduce was not
enabled by default

### What's changed
enable fp32 acc for reduce by default

### Checklist
- [x] Post commit CI passes Between two runs, all jobs passed
https://github.com/tenstorrent/tt-metal/actions/runs/12147218201 and
https://github.com/tenstorrent/tt-metal/actions/runs/12160961521
- [x] Blackhole Post commit (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12187630112
- [x] Model regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12187623632/job/33999246179
fails same as main except another random tt-smi reset not working. main:
https://github.com/tenstorrent/tt-metal/actions/runs/12189517366
- [x] Device performance regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12187626769 passes
for WH, GS not affected and fails which it does on main
https://github.com/tenstorrent/tt-metal/actions/runs/12189542166
- [x] New/Existing tests provide coverage for changes
@bbradelTT
Copy link
Contributor

Update:

  • still waiting for underlying issues to be fixed
  • enabled fp32 dest acc, will hopefully help with PCC on WH at least

@ntarafdar
Copy link
Contributor

update on one of the underlying issues, pad last dim
@jaykru-tt is doing a good job in rewriting the kernel from scratch to handle misaligned read/writes. We expect a solution early to mid next week .

@bbradelTT
Copy link
Contributor

@dgolubovicTT @nvukobratTT

Are the PCCs sufficiently good now for your purposes on the newest tt-metal main?

I talked to @davorchap and he said it's better to not worry about GS and it would be good if the fp32 dest acc change is sufficient.

@dgolubovicTT
Copy link

I've rebased onto tt-metal main and rerun tests for shape (1,x,3200) where x goes from 1 to 2048. PCCs are sufficiently good without the use of the workaround (compute_kernel_config). Thanks for prioritizing this guys! Closing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests