#15572: Rewrite of Reshape OP from scratch #15572

jvegaTT · 2024-11-29T16:52:15Z

Ticket

Solves the following issues (and potentially others, still going through and testing opened issues):
#13889
#12153
#14676
#15048
#15269
#14515
#14513
#14972
#15524

Problem description

Reshape OP was re-written from scratch and has no host fallback during legal uses

Previous issues were:

A lack of higher dimensionality support. I built a dimensionality reducer that found that for any arbitrary ND shape there exists a 3D shape in tile or a 2D shape in row major where conversion from this shape to the original does not require data movement (metadata only). We now do a ND->MD reshape by doing ND->3D->reshape->3D->MD in tiled and ND->2D->reshape->2D->MD in row major where we select the shapes of the two 2/3D intermediate shapes such that ND->2/3D and 2/3D->MD are metadata only changes. This allows us to only have to support 3D->3D shapes in tile and 2D->2D in row major in the kernel and get generality for free

A PCC issue which was caused by an invalid optimization which would produce a metadata only reshape even when some of the data should have been moved. In this case the data corruption was small but not 0 so it often ended up hidden in most invalid calls. This has been removed which hurt perf a bit but it is a necessary change.

The reshape function has been generalized to work in every case removing the need for reshape on host fallback for valid cases.

The old implementation unnecessarily needed a CB large enough to hold the entire tensor. New version only needs to allocate CBs whose total area is page_size_of_input + page_size_of_output + a small constant. This enables the reshape of very large tensors that almost fully occupy the L1. This also allows us to parallelize reading and writing by only adding barriers when needed rather than a big barrier between the reader and the writer kernels.

What's changed

Complete re-write of reshape OP - Sweep pass rate is 98.57%

Checklist

Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12164049672
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12166706962 https://github.com/tenstorrent/tt-metal/actions/runs/12157327133/job/33903077970
New/Existing tests provide coverage for changes

…nly ever called on 3D shapes

jvegaTT · 2024-11-29T17:22:32Z

Sweep tests are showing 98.57% with no hangs on this branch for reshape

ntarafdar · 2024-11-29T18:29:16Z

ttnn/cpp/ttnn/operations/data_movement/common/kernels/common.hpp


 namespace tt::data_movement::common {

+template <bool guaranteed_16B_alligned, bool copy_async>
+FORCE_INLINE
+void tt_memmove (


This is great. Can you announce this utility function in TT developers. This might be useful for others.

ntarafdar · 2024-11-29T18:32:31Z

ttnn/cpp/ttnn/operations/data_movement/reshape_view/reshape_pybind.cpp

        ttnn::pybind_overload_t{
            [](const data_movement_operation_t& self,
               const ttnn::Tensor& input_tensor,
-               const ttnn::SmallVector<int32_t> shape) -> ttnn::Tensor { return self(input_tensor, shape); },
+               const ttnn::SmallVector<int32_t> shape,
+               const std::optional<MemoryConfig> &memory_config,


Amazing! Is therr a condition for 0 cost view that this is left as null, because in the case where we want to go from L1 to dram or vice versa or sharded to interleaved or vice versa it will require a data movement

Oh true, just added it

…t-metal into jvega/reshape_rm_on_device

jvegaTT · 2024-12-04T21:28:29Z

To highlight the prevalence of host fallback, according to sweeps 11% of all sweep tests from the models team required fallback on host in the implementation currently on main. This is after the previously merged view operation eliminated many of them.

ayerofieiev-tt · 2024-12-04T21:44:36Z

tests/ttnn/unit_tests/test_reshape.py

+
+
+# issue 15048
+def test_broken_reshape(device):


this works now, right? "test_broken_reshape" is such a confusing name :D

I just copied the code from the issue, I am going to rename the test to test_previously_failing_test

ayerofieiev-tt · 2024-12-04T21:45:58Z

ttnn/cpp/ttnn/operations/data_movement/reshape_view/reshape_pybind.cpp

+            py::kw_only(),
+            py::arg("memory_config") = std::nullopt,
+            py::arg("queue_id") = 0,
+            py::arg("pad_value") = std::nullopt


how do you choose between 0 and nullptr here?

I set the default directly when it is an integer and there is no type ambiguity. In the case of pad_value it actually accepts both float or int and interprets it based on the datatype of the tensor. So if it is a bfloat16 tensor, the default is 0.0 but in a uint32 tensor the default is integer 0. In the case of memory_config I read in the configuration of the source tensor and match it to the output tensor if it is not specified.

ayerofieiev-tt · 2024-12-04T21:46:20Z

ttnn/cpp/ttnn/operations/data_movement/reshape_view/reshape_common.hpp

+//
+// SPDX-License-Identifier: Apache-2.0
+
+typedef std::variant<uint32_t, float> PadValue;


using PadValue = std::variant<uint32_t, float> PadValue;

ayerofieiev-tt · 2024-12-04T21:48:40Z

ttnn/cpp/ttnn/operations/data_movement/reshape_view/reshape.hpp

 }  // namespace operations::data_movement

-constexpr auto reshape =
-    ttnn::register_operation<"ttnn::reshape", ttnn::operations::data_movement::ReshapeViewOperation>();
+constexpr auto reshape = ttnn::register_operation<"ttnn::reshape", ttnn::operations::data_movement::ReshapeViewOperation>();


want to doublecheck. I did not see launch_op, should this be register_operation_with_auto_launch_op?

I have operation::run in reshape.cpp line 246. I don't always run the device part of the operation since metadata only changes do not go on device, I want to be able to control if the operation will be run or not. Maybe my understanding of how operation registering is wrong?

ayerofieiev-tt · 2024-12-04T21:49:35Z

ttnn/cpp/ttnn/operations/data_movement/reshape_view/reshape.cpp

-        input_volume = input_volume * input_shape[i];
+    uint32_t inferred_dim = -1;
+    for (uint32_t i=0; i< shape.rank(); i++) {
+        if (((int)(shape[i])) == -1) {


please avoid c-style casts. prefer c++ casts like static_cast, dynamic_cast, reinterpret_cast

ayerofieiev-tt · 2024-12-04T21:49:53Z

ttnn/cpp/ttnn/operations/data_movement/reshape_view/reshape.cpp

-    for (int i = 0; i < input_shape.rank(); i++) {
-        input_volume = input_volume * input_shape[i];
+    uint32_t inferred_dim = -1;
+    for (uint32_t i=0; i< shape.rank(); i++) {


looks like clang format was not run here

@blozano-tt 👀

ayerofieiev-tt · 2024-12-04T21:50:43Z

ttnn/cpp/ttnn/operations/data_movement/reshape_view/reshape.cpp

    }
-    // Perform a reshape (view)
+    //Perform a reshape (view)
    return tensor.reshape(shape);


How soon will we actually have a view and reshape call so its really clear when cost is guaranteed to be ~0

There is a separate PR right now for metadata-only reshapes that will cause PCC issues if incorrectly used. However I like to consolidate the guidance to always use ttnn.reshape that does the view if safe to do so. Maybe I can add a can_do_view operation which will tell the model writer if it is going to be a view or not in advance? I have also provided clear guidance in the pybind to state when a 0 cost view is happening.

ayerofieiev-tt · 2024-12-04T21:51:54Z

ttnn/cpp/ttnn/operations/data_movement/reshape_view/device/host/reshape_rm_host_prep.cpp

+            }
+        }
+    }
+    auto override_runtime_args_callback = [reader_kernel_id, compute_with_storage_grid_size](


move to a separate function?

Almost every OP does it this way though, I wanted to maintain consistency.

That style is deprecated now. For existing ops it's fine, but we do prefer this variant:

https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/adding_new_ttnn_operation.html

tt-aho · 2024-12-04T22:00:56Z

ttnn/cpp/ttnn/operations/data_movement/common/kernels/common.hpp

+#define max_packet_size 8192
+
+template <uint32_t max_transfer_size, bool only_reads>
+FORCE_INLINE void enhanced_noc_async_read(
+    const uint64_t src_noc_addr, const uint32_t dst_l1_addr, const uint32_t bytes) {
+    // If you do not know the max_transfer_size at compile time write 0 to it.
+    // only reads is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as True
+    if constexpr (((max_transfer_size < max_packet_size) && (max_transfer_size != 0)) || only_reads) {
+        noc_async_read_one_packet(src_noc_addr, dst_l1_addr, bytes);
+    } else {
+        noc_async_read(src_noc_addr, dst_l1_addr, bytes);
+    }
+}
+
+template <uint32_t max_transfer_size, bool only_writes>
+FORCE_INLINE void enhanced_noc_async_write(
+    const uint32_t src_l1_addr, const uint64_t dst_noc_addr, const uint32_t bytes) {
+    // If you do not know the max_transfer_size at compile time write 0 to it.
+    // only writes is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as False
+    if constexpr (((max_transfer_size < max_packet_size) && (max_transfer_size != 0)) || only_writes) {
+        noc_async_write_one_packet(src_l1_addr, dst_noc_addr, bytes);
+    } else {
+        noc_async_write(src_l1_addr, dst_noc_addr, bytes);
+    }
+}


Please do not add a new hardcoded define for max_packet_size. We already have NOC_MAX_BURST_SIZE.

We also already updated some apis to do this sort of thing, ex noc_async_write is already templated to support using the one packet api if you specify a size less than the max transfer size. You could instead copy how noc_async_write was templated for noc_async_read.

It has the max packet size support but not the only_writes support which gives us just a bit more perf as some of the data mover ops only have reader or writer in a kernel. I think we should keep it as is but will migrate when the only writes optimization is also there. I am changing the max page size to NOC_MAX_BURST_SIZE though.

We didn't expose only read/only write flags in the original noc apis because initial testing seemed to show it was mostly functional, but there were some kernels where it seemed to hang even though it should have only been issuing only reads/writes so it was never merged since it risks users running into issues. It's fine if you want to use/expose it outside of dataflow api but you should be careful about exposing a common api for users that hasn't been fully tested. Ex. GS does not support > 8K txns at all, but that is not handled correctly here to ignore the flag for GS.

My suggestion here is to remove you logic regarding max transfer size here since it is already handled in noc_async_write, and instead pass the value down.

Ex

template <uint32_t max_transfer_size, bool only_writes> FORCE_INLINE void enhanced_noc_async_write( const uint32_t src_l1_addr, const uint64_t dst_noc_addr, const uint32_t bytes) { // If you do not know the max_transfer_size at compile time write 0 to it. // only writes is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as False if constexpr (only_writes) { noc_async_write_one_packet(src_l1_addr, dst_noc_addr, bytes); } else { noc_async_write<max_transfer_size>(src_l1_addr, dst_noc_addr, bytes); } }

tt-aho · 2024-12-04T22:05:31Z

ttnn/cpp/ttnn/operations/data_movement/common/kernels/common.hpp

+    }
+}
+
+template <bool guaranteed_16B_alligned, bool copy_async, bool use_read_datamover, uint32_t max_transfer_size>


A number of typos alligned

Lol, English not being my first language is showing

tt-aho · 2024-12-04T22:07:35Z

ttnn/cpp/ttnn/operations/data_movement/common/kernels/common.hpp

+FORCE_INLINE void tt_memmove(const uint32_t dst_l1_addr, const uint32_t src_l1_addr, const uint32_t bytes) {
+    //Function performs a memory copy between two l1 addresses in the local core
+    //Uses noc_async_read when possible to copy the data over
+    //Set guaranteed 16B alligned to true if the source and destination are externally guaranteed to be 16B alligned (dangerous)


You should also not be hardcoding alignment values like 16 and 64 since there's no guarantee they're always that value across arches, these can be queried and passed from host as compile time args or accessed in kernels ex L1_ALIGNMENT.

This is a bit annoying to fix as I am also using MASK and OFFSET that is hardcoded to what should be used if the shift 16 and 64 respectively and they are also used in other files as this is the common code for all data mover ops. I can't take the value as compile time arg as this is in common. However because it is in common I think updating it should a new arch use a new alignment is not very difficult.

Come to think of it, I think the cleanest fix would be to define a DDR_ALIGNMENT at the same location L1_ALIGNMENT is defined that is set automatically at compile time depending on the ARCH type. The OPs would then use that constant value instead which is auto updated to match the architecture. However that would not be the solution for this PR

Would it be acceptable to let this merge in as is for now and then I will post an issue where I can fix it in a cleaner way, just in the interest of getting this in as soon as possible since the change would alter multiple files, require a new function, and I am not comfortable doing that without re-running the pipelines that take hours

DRAM_ALIGNMENT is also already defined (is this what you meant by DDR_ALIGNMENT?). It's fine if you want to address later, I'm not blocking the pr on any of these comments but wanted to make sure this was considered, as there is currently work being done regarding what alignments are being used.

Oh cool, good to know. Yeah will do a subsequent PR since I do want to avoid having to re-visit alignment when the next arch comes in

davorchap · 2024-12-04T22:45:47Z

models/demos/bert_tiny/tests/test_performance.py

@@ -119,7 +119,7 @@ def test_perf_device_bare_metal(batch_size, expected_perf):
    margin = 0.03

    if is_wormhole_b0():
-        expected_perf = 4155.25
+        expected_perf = 3990.0


is this a perf drop or increase (lower or higher is better)?

Perf increased but the nature of the test makes it report a decrease. The device test currently uses tracy to record how long it runs on the device during the duration of the model. However, we are currently making the really expensive on-host reshapes (which waste time in moving data in and out of the device) run on device instead which has a significant end-to-end perf improvement. However this also means device run-time goes up since the device is doing more and the host is doing less. The very expensive memory copy is not being reflected. That is why we also ran full model experiments and saw that we significantly improved perf when taken in a wholistic approach. We do however need to update these targets to reflect the new device runtime though.

freaking excellent, great job @jvegaTT !

davorchap · 2024-12-05T02:33:11Z

37 commits :) , squash to 1 commit?

e.g.,
"#issue : beast mode re-shape re-write for perf improvements and 98.57% pass rate"

jvegaTT · 2024-12-05T02:37:10Z

37 commits :) , squash to 1 commit?

e.g., "#issue : beast mode re-shape re-write for perf improvements and 98.57% pass rate"

Yup, doing a squash and merge now

ntarafdar · 2024-12-05T04:03:04Z

37 commits :) , squash to 1 commit?

e.g.,

"#issue : beast mode re-shape re-write for perf improvements and 98.57% pass rate"

This would be an amazing commit message

### Ticket Solves the following issues (and potentially others, still going through and testing opened issues): #13889 #12153 #14676 #15048 #15269 #14515 #14513 #14972 #15524 ### Problem description Reshape OP was re-written from scratch and has no host fallback during legal uses Previous issues were: A lack of higher dimensionality support. I built a dimensionality reducer that found that for any arbitrary ND shape there exists a 3D shape in tile or a 2D shape in row major where conversion from this shape to the original does not require data movement (metadata only). We now do a ND->MD reshape by doing ND->3D->reshape->3D->MD in tiled and ND->2D->reshape->2D->MD in row major where we select the shapes of the two 2/3D intermediate shapes such that ND->2/3D and 2/3D->MD are metadata only changes. This allows us to only have to support 3D->3D shapes in tile and 2D->2D in row major in the kernel and get generality for free A PCC issue which was caused by an invalid optimization which would produce a metadata only reshape even when some of the data should have been moved. In this case the data corruption was small but not 0 so it often ended up hidden in most invalid calls. This has been removed which hurt perf a bit but it is a necessary change. The reshape function has been generalized to work in every case removing the need for reshape on host fallback for valid cases. The old implementation unnecessarily needed a CB large enough to hold the entire tensor. New version only needs to allocate CBs whose total area is page_size_of_input + page_size_of_output + a small constant. This enables the reshape of very large tensors that almost fully occupy the L1. This also allows us to parallelize reading and writing by only adding barriers when needed rather than a big barrier between the reader and the writer kernels. ### What's changed Complete re-write of reshape OP - Sweep pass rate is 98.57% ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12164049672 - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12166706962 https://github.com/tenstorrent/tt-metal/actions/runs/12157327133/job/33903077970 - [ ] New/Existing tests provide coverage for changes

jvegaTT added 8 commits November 28, 2024 22:17

#15269: reshape fully on device now

5d343d4

#15558 edited comment to mention this issue

e2bbda5

#0: move tt_memmove to common library and ensure tilize/untilize is o…

43b0bd2

…nly ever called on 3D shapes

#0: added corrector for implied shape dimensions

8a94c1b

#13889: Added test to prove this issue is resolved

4fd8c45

#12153: Adding test to verify that issue is resolved

285d1fa

#15048: being more careful about bandaid for issues #15137 and #13338

7d40412

#14676: Adding test to verify that this issue is resolved

7335c11

jvegaTT requested review from ayerofieiev-tt, dmakoviichuk-tt, yan-zaretskiy, rfurko-tt, cfjchu, TT-BrianLiu, razorback3, dongjin-na, ntarafdar, sjameelTT, jaykru-tt, yugi957 and llongTT as code owners November 29, 2024 16:52

#0: adding libraries for memmove to common

6110d47

jvegaTT mentioned this pull request Nov 29, 2024

ttnn.reshape segfaults or gets stuck on some shapes after some time executing in a sweep #14515

Closed

#14513: Adding test to prove issue is resolved

cbc2830

jvegaTT mentioned this pull request Nov 29, 2024

ttnn.reshape with bfloat8_b input on some shapes gives low PCC #14513

Closed

Merge branch 'main' into jvega/reshape_rm_on_device

98c4b01

ntarafdar approved these changes Nov 29, 2024

View reviewed changes

jvegaTT added 2 commits November 29, 2024 22:39

#15269: added multi-core support

055dc0f

Merge branch 'jvega/reshape_rm_on_device' of github.com:tenstorrent/t…

1d058c7

…t-metal into jvega/reshape_rm_on_device

jvegaTT requested a review from rfurko-tt December 4, 2024 20:58

yan-zaretskiy approved these changes Dec 4, 2024

View reviewed changes

ayerofieiev-tt reviewed Dec 4, 2024

View reviewed changes

TT-BrianLiu approved these changes Dec 4, 2024

View reviewed changes

ayerofieiev-tt reviewed Dec 4, 2024

View reviewed changes

ayerofieiev-tt approved these changes Dec 4, 2024

View reviewed changes

tt-aho reviewed Dec 4, 2024

View reviewed changes

jvegaTT added 3 commits December 4, 2024 22:13

#0 addressing artem PR review changes on 15572 PR

0ea118b

#15269: updating vgg device targets

66a28c1

#0: addressing austin PR review changes on 15572 PR

0e411b0

jvegaTT requested a review from tt-aho December 4, 2024 22:37

davorchap reviewed Dec 4, 2024

View reviewed changes

Merge branch 'main' into jvega/reshape_rm_on_device

b8ce2b0

uaydonat approved these changes Dec 5, 2024

View reviewed changes

jvegaTT merged commit 3389120 into main Dec 5, 2024
9 checks passed

jvegaTT deleted the jvega/reshape_rm_on_device branch December 5, 2024 02:37

ajakovljevicTT mentioned this pull request Dec 6, 2024

Deleting the asserts preventing reshaping of more than 4D tensors tenstorrent/tt-mlir#1524

Merged

#15572: Rewrite of Reshape OP from scratch #15572

#15572: Rewrite of Reshape OP from scratch #15572

Conversation

jvegaTT commented Nov 29, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

jvegaTT commented Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvegaTT commented Dec 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayerofieiev-tt Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvegaTT Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davorchap Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

jvegaTT Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davorchap commented Dec 5, 2024 • edited Loading

jvegaTT commented Dec 5, 2024

ntarafdar commented Dec 5, 2024

jvegaTT commented Nov 29, 2024 •

edited

Loading

jvegaTT commented Nov 29, 2024 •

edited

Loading

ayerofieiev-tt Dec 4, 2024 •

edited

Loading

jvegaTT Dec 4, 2024 •

edited

Loading

davorchap Dec 4, 2024 •

edited

Loading

jvegaTT Dec 4, 2024 •

edited

Loading

davorchap commented Dec 5, 2024 •

edited

Loading