ttnn.moreh_cumsum causes low accuracy on Bloom model #17594

amalbasaTT · 2025-02-05T14:42:41Z

Ticket

Describe the bug
ttnn.moreh_cumsum causes low PCC (0.8703882797784891) on Bloom model:
- input: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
- shape: [1, 32, 1, 1]
- dtype: ttnn.uint32
- layout: Layout::TILE
Output tensor has 8388608 where input tensor has 1.
Issue happens at ttnn_moreh_cumsum = ttnn.moreh_cumsum(ttnn_to_device, 1, )

To Reproduce
Steps to reproduce the behavior:

Install torch-ttnn as explained in https://github.com/tenstorrent/pytorch2.0_ttnn?tab=readme-ov-file
Possition to pytorch2.0_ttnn directory
Generate Bloom_code.py and Bloom_inputs.pickle by running command:

pytest tests/models/bloom/test_bloom.py --gen_op_accuracy_tests

In Bloom_code.py modify 109. line of code (_tensor_constant0 = 0.7071067690849304) to _tensor_constant0 = torch.tensor(0.7071067690849304) (check issue Bloom_code.py forward method fails at aten.lift_fresh_copy.default(_tensor_constant0, ) because _tensor_constant0 is not a Tensor pytorch2.0_ttnn#743)
Run Bloom_code.py:

python3 tests/autogen_accuracy_tests/Bloom_code.py

Low accuracy error will be at test_accuracy(cumsum, ttnn_reshape_1)

The text was updated successfully, but these errors were encountered:

ayerofieiev-tt · 2025-02-12T17:34:50Z

@amalbasaTT can we create a simple TT-NN test that fails?

ayerofieiev-tt · 2025-02-12T18:09:51Z

Also, now that we know about this issue, can someone from the team start looking into this deeper? Is there an expertise in the team?

amalbasaTT · 2025-02-13T10:14:16Z

Also, now that we know about this issue, can someone from the team start looking into this deeper? Is there an expertise in the team?

We had some experience debugging some composite ops and sharding in cpp, however not as much with kernels per se. We most certainly can look deeper into it, but would appreciate it if you could provide us with some docs to figure out in more detail how some of the lower level functions which appear in kernel implementation work.

amalbasaTT · 2025-02-13T10:21:50Z

Unit test:

# SPDX-FileCopyrightText: © 2025 Tenstorrent Inc.

# SPDX-License-Identifier: Apache-2.0

from loguru import logger
import random
import pytest
import torch
import ttnn
import traceback

from tests.ttnn.utils_for_testing import assert_with_pcc
from tests.ttnn.utils_for_testing import check_with_pcc

aten = torch.ops.aten


def run_moreh_cumsum_tests(
    x,
    device,
):
    x = torch.tensor(x).to(torch.int32)
    
    try:
        ref_value = aten.cumsum.default(x, -1, )
        
        tt_x = ttnn.from_torch(
            x,
            dtype=ttnn.uint32,
            layout=ttnn.ROW_MAJOR_LAYOUT,
        )
        tt_x = ttnn.reshape(tt_x, (1, 32, 1, 1), )
        passed, message = check_with_pcc(x.reshape(1, 32, 1, 1), ttnn.to_torch(tt_x))
        assert passed, message
        tt_x = ttnn.from_device(tt_x, )
        passed, message = check_with_pcc(x.reshape(1, 32, 1, 1), ttnn.to_torch(tt_x))
        assert passed, message
        tt_x = ttnn.to_layout(tt_x, ttnn.TILE_LAYOUT, )
        passed, message = check_with_pcc(x.reshape(1, 32, 1, 1), ttnn.to_torch(tt_x))
        assert passed, message
        tt_x = ttnn.to_device(tt_x, device=device)
        passed, message = check_with_pcc(x.reshape(1, 32, 1, 1), ttnn.to_torch(tt_x))
        assert passed, message
        tt_result = ttnn.moreh_cumsum(tt_x, 1, )
        passed, message = check_with_pcc(ref_value.reshape(1, 32, 1, 1), ttnn.to_torch(tt_result))
        assert passed, f"{message, tt_x}"
        tt_result = ttnn.to_layout(tt_result, ttnn.ROW_MAJOR_LAYOUT, )
        tt_result = ttnn.reshape(tt_result, (1, 32), )
        
        tt_result = ttnn.to_torch(tt_result)
        
    except Exception as e:
        logger.warning(f"Test execution crashed: {e}")
        print(traceback.format_exc())
        raise e

    #assert len(tt_result.shape) == len(ref_value.shape)
    assert tt_result.shape == ref_value.shape
    assert_with_pcc(ref_value, tt_result, 0.999)


test_sweep_args = [
    (
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 1, 1, 1, 1, 1, 1],
    ),
]


@pytest.mark.parametrize(
    "x",
    (test_sweep_args),
)
def test_moreh_cumsum(x, device):
    run_moreh_cumsum_tests(x, device)

bfilipovicTT · 2025-02-19T12:49:40Z

@ayerofieiev-tt I have looked into it, and I found out that we've had a similar issue while testing "add" operation where second op was int scalar (#17019 (comment)).
Since this op also consists of summing integers, and since Andrija's unit test is passing when I change int to float, I suspect it could be related.

nemanjagrujic · 2025-02-27T15:04:35Z

@ayerofieiev-tt

We looked into this a bit deeper. It turned out that ttnn.add and ttnn.experimental.add also don't work with uint32.

Workaround for now can be that we change the lowering to include typecast to and from bfloat16. That will improve accuracy of model so we can continue.

Considering moreh_cumsum , it's using ckernel::add_tiles I'am not sure how to make it to work with uint32. There is SFPU operation to add integers, and we are looking into that now.

@umadevimcw any insight?

amalbasaTT added the bug Something isn't working label Feb 5, 2025

ayerofieiev-tt added the accuracy label Feb 5, 2025

ayerofieiev-tt added this to PyTorch 2.0 TT-NN Compiler Feb 5, 2025

ayerofieiev-tt moved this to Todo in PyTorch 2.0 TT-NN Compiler Feb 5, 2025

ayerofieiev-tt assigned amalbasaTT Feb 12, 2025

ayerofieiev-tt added accuracy and removed accuracy bug Something isn't working labels Feb 12, 2025

nemanjagrujic assigned nemanjagrujic and bfilipovicTT and unassigned amalbasaTT Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ttnn.moreh_cumsum causes low accuracy on Bloom model #17594

ttnn.moreh_cumsum causes low accuracy on Bloom model #17594

amalbasaTT commented Feb 5, 2025

ayerofieiev-tt commented Feb 12, 2025

ayerofieiev-tt commented Feb 12, 2025

amalbasaTT commented Feb 13, 2025

amalbasaTT commented Feb 13, 2025

bfilipovicTT commented Feb 19, 2025

nemanjagrujic commented Feb 27, 2025

ttnn.moreh_cumsum causes low accuracy on Bloom model #17594

ttnn.moreh_cumsum causes low accuracy on Bloom model #17594

Comments

amalbasaTT commented Feb 5, 2025

Ticket

ayerofieiev-tt commented Feb 12, 2025

ayerofieiev-tt commented Feb 12, 2025

amalbasaTT commented Feb 13, 2025

amalbasaTT commented Feb 13, 2025

bfilipovicTT commented Feb 19, 2025

nemanjagrujic commented Feb 27, 2025