Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#15572: Rewrite of Reshape OP from scratch #15572

Merged
merged 37 commits into from
Dec 5, 2024
Merged

Conversation

jvegaTT
Copy link
Contributor

@jvegaTT jvegaTT commented Nov 29, 2024

Ticket

Solves the following issues (and potentially others, still going through and testing opened issues):
#13889
#12153
#14676
#15048
#15269
#14515
#14513
#14972
#15524

Problem description

Reshape OP was re-written from scratch and has no host fallback during legal uses

Previous issues were:

A lack of higher dimensionality support. I built a dimensionality reducer that found that for any arbitrary ND shape there exists a 3D shape in tile or a 2D shape in row major where conversion from this shape to the original does not require data movement (metadata only). We now do a ND->MD reshape by doing ND->3D->reshape->3D->MD in tiled and ND->2D->reshape->2D->MD in row major where we select the shapes of the two 2/3D intermediate shapes such that ND->2/3D and 2/3D->MD are metadata only changes. This allows us to only have to support 3D->3D shapes in tile and 2D->2D in row major in the kernel and get generality for free

A PCC issue which was caused by an invalid optimization which would produce a metadata only reshape even when some of the data should have been moved. In this case the data corruption was small but not 0 so it often ended up hidden in most invalid calls. This has been removed which hurt perf a bit but it is a necessary change.

The reshape function has been generalized to work in every case removing the need for reshape on host fallback for valid cases.

The old implementation unnecessarily needed a CB large enough to hold the entire tensor. New version only needs to allocate CBs whose total area is page_size_of_input + page_size_of_output + a small constant. This enables the reshape of very large tensors that almost fully occupy the L1. This also allows us to parallelize reading and writing by only adding barriers when needed rather than a big barrier between the reader and the writer kernels.

What's changed

Complete re-write of reshape OP - Sweep pass rate is 98.57%

Checklist

@jvegaTT
Copy link
Contributor Author

jvegaTT commented Nov 29, 2024

Sweep tests are showing 98.57% with no hangs on this branch for reshape


namespace tt::data_movement::common {

template <bool guaranteed_16B_alligned, bool copy_async>
FORCE_INLINE
void tt_memmove (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Can you announce this utility function in TT developers. This might be useful for others.

ttnn::pybind_overload_t{
[](const data_movement_operation_t& self,
const ttnn::Tensor& input_tensor,
const ttnn::SmallVector<int32_t> shape) -> ttnn::Tensor { return self(input_tensor, shape); },
const ttnn::SmallVector<int32_t> shape,
const std::optional<MemoryConfig> &memory_config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! Is therr a condition for 0 cost view that this is left as null, because in the case where we want to go from L1 to dram or vice versa or sharded to interleaved or vice versa it will require a data movement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh true, just added it

@jvegaTT jvegaTT requested a review from rfurko-tt December 4, 2024 20:58
@jvegaTT
Copy link
Contributor Author

jvegaTT commented Dec 4, 2024

To highlight the prevalence of host fallback, according to sweeps 11% of all sweep tests from the models team required fallback on host in the implementation currently on main. This is after the previously merged view operation eliminated many of them.



# issue 15048
def test_broken_reshape(device):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works now, right? "test_broken_reshape" is such a confusing name :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied the code from the issue, I am going to rename the test to test_previously_failing_test

py::kw_only(),
py::arg("memory_config") = std::nullopt,
py::arg("queue_id") = 0,
py::arg("pad_value") = std::nullopt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you choose between 0 and nullptr here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set the default directly when it is an integer and there is no type ambiguity. In the case of pad_value it actually accepts both float or int and interprets it based on the datatype of the tensor. So if it is a bfloat16 tensor, the default is 0.0 but in a uint32 tensor the default is integer 0. In the case of memory_config I read in the configuration of the source tensor and match it to the output tensor if it is not specified.

//
// SPDX-License-Identifier: Apache-2.0

typedef std::variant<uint32_t, float> PadValue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using PadValue = std::variant<uint32_t, float> PadValue;

} // namespace operations::data_movement

constexpr auto reshape =
ttnn::register_operation<"ttnn::reshape", ttnn::operations::data_movement::ReshapeViewOperation>();
constexpr auto reshape = ttnn::register_operation<"ttnn::reshape", ttnn::operations::data_movement::ReshapeViewOperation>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to doublecheck. I did not see launch_op, should this be register_operation_with_auto_launch_op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have operation::run in reshape.cpp line 246. I don't always run the device part of the operation since metadata only changes do not go on device, I want to be able to control if the operation will be run or not. Maybe my understanding of how operation registering is wrong?

input_volume = input_volume * input_shape[i];
uint32_t inferred_dim = -1;
for (uint32_t i=0; i< shape.rank(); i++) {
if (((int)(shape[i])) == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid c-style casts. prefer c++ casts like static_cast, dynamic_cast, reinterpret_cast

for (int i = 0; i < input_shape.rank(); i++) {
input_volume = input_volume * input_shape[i];
uint32_t inferred_dim = -1;
for (uint32_t i=0; i< shape.rank(); i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like clang format was not run here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
// Perform a reshape (view)
//Perform a reshape (view)
return tensor.reshape(shape);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How soon will we actually have a view and reshape call so its really clear when cost is guaranteed to be ~0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a separate PR right now for metadata-only reshapes that will cause PCC issues if incorrectly used. However I like to consolidate the guidance to always use ttnn.reshape that does the view if safe to do so. Maybe I can add a can_do_view operation which will tell the model writer if it is going to be a view or not in advance? I have also provided clear guidance in the pybind to state when a 0 cost view is happening.

}
}
}
auto override_runtime_args_callback = [reader_kernel_id, compute_with_storage_grid_size](
Copy link
Member

@ayerofieiev-tt ayerofieiev-tt Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost every OP does it this way though, I wanted to maintain consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That style is deprecated now. For existing ops it's fine, but we do prefer this variant:

https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/adding_new_ttnn_operation.html

Comment on lines 18 to 42
#define max_packet_size 8192

template <uint32_t max_transfer_size, bool only_reads>
FORCE_INLINE void enhanced_noc_async_read(
const uint64_t src_noc_addr, const uint32_t dst_l1_addr, const uint32_t bytes) {
// If you do not know the max_transfer_size at compile time write 0 to it.
// only reads is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as True
if constexpr (((max_transfer_size < max_packet_size) && (max_transfer_size != 0)) || only_reads) {
noc_async_read_one_packet(src_noc_addr, dst_l1_addr, bytes);
} else {
noc_async_read(src_noc_addr, dst_l1_addr, bytes);
}
}

template <uint32_t max_transfer_size, bool only_writes>
FORCE_INLINE void enhanced_noc_async_write(
const uint32_t src_l1_addr, const uint64_t dst_noc_addr, const uint32_t bytes) {
// If you do not know the max_transfer_size at compile time write 0 to it.
// only writes is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as False
if constexpr (((max_transfer_size < max_packet_size) && (max_transfer_size != 0)) || only_writes) {
noc_async_write_one_packet(src_l1_addr, dst_noc_addr, bytes);
} else {
noc_async_write(src_l1_addr, dst_noc_addr, bytes);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not add a new hardcoded define for max_packet_size. We already have NOC_MAX_BURST_SIZE.

We also already updated some apis to do this sort of thing, ex noc_async_write is already templated to support using the one packet api if you specify a size less than the max transfer size. You could instead copy how noc_async_write was templated for noc_async_read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has the max packet size support but not the only_writes support which gives us just a bit more perf as some of the data mover ops only have reader or writer in a kernel. I think we should keep it as is but will migrate when the only writes optimization is also there. I am changing the max page size to NOC_MAX_BURST_SIZE though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't expose only read/only write flags in the original noc apis because initial testing seemed to show it was mostly functional, but there were some kernels where it seemed to hang even though it should have only been issuing only reads/writes so it was never merged since it risks users running into issues. It's fine if you want to use/expose it outside of dataflow api but you should be careful about exposing a common api for users that hasn't been fully tested. Ex. GS does not support > 8K txns at all, but that is not handled correctly here to ignore the flag for GS.

My suggestion here is to remove you logic regarding max transfer size here since it is already handled in noc_async_write, and instead pass the value down.

Ex

template <uint32_t max_transfer_size, bool only_writes>
FORCE_INLINE void enhanced_noc_async_write(
    const uint32_t src_l1_addr, const uint64_t dst_noc_addr, const uint32_t bytes) {
    // If you do not know the max_transfer_size at compile time write 0 to it.
    // only writes is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as False
    if constexpr (only_writes) {
        noc_async_write_one_packet(src_l1_addr, dst_noc_addr, bytes);
    } else {
        noc_async_write<max_transfer_size>(src_l1_addr, dst_noc_addr, bytes);
    }
}

}
}

template <bool guaranteed_16B_alligned, bool copy_async, bool use_read_datamover, uint32_t max_transfer_size>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A number of typos alligned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, English not being my first language is showing

FORCE_INLINE void tt_memmove(const uint32_t dst_l1_addr, const uint32_t src_l1_addr, const uint32_t bytes) {
//Function performs a memory copy between two l1 addresses in the local core
//Uses noc_async_read when possible to copy the data over
//Set guaranteed 16B alligned to true if the source and destination are externally guaranteed to be 16B alligned (dangerous)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also not be hardcoding alignment values like 16 and 64 since there's no guarantee they're always that value across arches, these can be queried and passed from host as compile time args or accessed in kernels ex L1_ALIGNMENT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit annoying to fix as I am also using MASK and OFFSET that is hardcoded to what should be used if the shift 16 and 64 respectively and they are also used in other files as this is the common code for all data mover ops. I can't take the value as compile time arg as this is in common. However because it is in common I think updating it should a new arch use a new alignment is not very difficult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, I think the cleanest fix would be to define a DDR_ALIGNMENT at the same location L1_ALIGNMENT is defined that is set automatically at compile time depending on the ARCH type. The OPs would then use that constant value instead which is auto updated to match the architecture. However that would not be the solution for this PR

Copy link
Contributor Author

@jvegaTT jvegaTT Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be acceptable to let this merge in as is for now and then I will post an issue where I can fix it in a cleaner way, just in the interest of getting this in as soon as possible since the change would alter multiple files, require a new function, and I am not comfortable doing that without re-running the pipelines that take hours

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRAM_ALIGNMENT is also already defined (is this what you meant by DDR_ALIGNMENT?). It's fine if you want to address later, I'm not blocking the pr on any of these comments but wanted to make sure this was considered, as there is currently work being done regarding what alignments are being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool, good to know. Yeah will do a subsequent PR since I do want to avoid having to re-visit alignment when the next arch comes in

@jvegaTT jvegaTT requested a review from tt-aho December 4, 2024 22:37
@@ -119,7 +119,7 @@ def test_perf_device_bare_metal(batch_size, expected_perf):
margin = 0.03

if is_wormhole_b0():
expected_perf = 4155.25
expected_perf = 3990.0
Copy link
Collaborator

@davorchap davorchap Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a perf drop or increase (lower or higher is better)?

Copy link
Contributor Author

@jvegaTT jvegaTT Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf increased but the nature of the test makes it report a decrease. The device test currently uses tracy to record how long it runs on the device during the duration of the model. However, we are currently making the really expensive on-host reshapes (which waste time in moving data in and out of the device) run on device instead which has a significant end-to-end perf improvement. However this also means device run-time goes up since the device is doing more and the host is doing less. The very expensive memory copy is not being reflected. That is why we also ran full model experiments and saw that we significantly improved perf when taken in a wholistic approach. We do however need to update these targets to reflect the new device runtime though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freaking excellent, great job @jvegaTT !

@davorchap
Copy link
Collaborator

davorchap commented Dec 5, 2024

37 commits :) , squash to 1 commit?

e.g.,
"#issue : beast mode re-shape re-write for perf improvements and 98.57% pass rate"

@jvegaTT
Copy link
Contributor Author

jvegaTT commented Dec 5, 2024

37 commits :) , squash to 1 commit?

e.g., "#issue : beast mode re-shape re-write for perf improvements and 98.57% pass rate"

Yup, doing a squash and merge now

@jvegaTT jvegaTT merged commit 3389120 into main Dec 5, 2024
9 checks passed
@jvegaTT jvegaTT deleted the jvega/reshape_rm_on_device branch December 5, 2024 02:37
@ntarafdar
Copy link
Contributor

37 commits :) , squash to 1 commit?

e.g.,

"#issue : beast mode re-shape re-write for perf improvements and 98.57% pass rate"

This would be an amazing commit message

yieldthought pushed a commit that referenced this pull request Dec 13, 2024
### Ticket
Solves the following issues (and potentially others, still going through
and testing opened issues):
#13889
#12153
#14676
#15048
#15269
#14515
#14513
#14972
#15524

### Problem description
Reshape OP was re-written from scratch and has no host fallback during
legal uses

Previous issues were: 

A lack of higher dimensionality support. I built a dimensionality
reducer that found that for any arbitrary ND shape there exists a 3D
shape in tile or a 2D shape in row major where conversion from this
shape to the original does not require data movement (metadata only). We
now do a ND->MD reshape by doing ND->3D->reshape->3D->MD in tiled and
ND->2D->reshape->2D->MD in row major where we select the shapes of the
two 2/3D intermediate shapes such that ND->2/3D and 2/3D->MD are
metadata only changes. This allows us to only have to support 3D->3D
shapes in tile and 2D->2D in row major in the kernel and get generality
for free

A PCC issue which was caused by an invalid optimization which would
produce a metadata only reshape even when some of the data should have
been moved. In this case the data corruption was small but not 0 so it
often ended up hidden in most invalid calls. This has been removed which
hurt perf a bit but it is a necessary change.

The reshape function has been generalized to work in every case removing
the need for reshape on host fallback for valid cases.

The old implementation unnecessarily needed a CB large enough to hold
the entire tensor. New version only needs to allocate CBs whose total
area is page_size_of_input + page_size_of_output + a small constant.
This enables the reshape of very large tensors that almost fully occupy
the L1. This also allows us to parallelize reading and writing by only
adding barriers when needed rather than a big barrier between the reader
and the writer kernels.

### What's changed
Complete re-write of reshape OP - Sweep pass rate is 98.57%

### Checklist
- [x] Post commit CI passes
https://github.com/tenstorrent/tt-metal/actions/runs/12164049672
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
https://github.com/tenstorrent/tt-metal/actions/runs/12166706962
https://github.com/tenstorrent/tt-metal/actions/runs/12157327133/job/33903077970
- [ ] New/Existing tests provide coverage for changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants