-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#15572: Rewrite of Reshape OP from scratch #15572
Conversation
…nly ever called on 3D shapes
Sweep tests are showing 98.57% with no hangs on this branch for reshape |
|
||
namespace tt::data_movement::common { | ||
|
||
template <bool guaranteed_16B_alligned, bool copy_async> | ||
FORCE_INLINE | ||
void tt_memmove ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Can you announce this utility function in TT developers. This might be useful for others.
ttnn::pybind_overload_t{ | ||
[](const data_movement_operation_t& self, | ||
const ttnn::Tensor& input_tensor, | ||
const ttnn::SmallVector<int32_t> shape) -> ttnn::Tensor { return self(input_tensor, shape); }, | ||
const ttnn::SmallVector<int32_t> shape, | ||
const std::optional<MemoryConfig> &memory_config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! Is therr a condition for 0 cost view that this is left as null, because in the case where we want to go from L1 to dram or vice versa or sharded to interleaved or vice versa it will require a data movement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh true, just added it
…t-metal into jvega/reshape_rm_on_device
To highlight the prevalence of host fallback, according to sweeps 11% of all sweep tests from the models team required fallback on host in the implementation currently on main. This is after the previously merged view operation eliminated many of them. |
|
||
|
||
# issue 15048 | ||
def test_broken_reshape(device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works now, right? "test_broken_reshape" is such a confusing name :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied the code from the issue, I am going to rename the test to test_previously_failing_test
py::kw_only(), | ||
py::arg("memory_config") = std::nullopt, | ||
py::arg("queue_id") = 0, | ||
py::arg("pad_value") = std::nullopt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you choose between 0 and nullptr here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set the default directly when it is an integer and there is no type ambiguity. In the case of pad_value it actually accepts both float or int and interprets it based on the datatype of the tensor. So if it is a bfloat16 tensor, the default is 0.0 but in a uint32 tensor the default is integer 0. In the case of memory_config I read in the configuration of the source tensor and match it to the output tensor if it is not specified.
// | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
typedef std::variant<uint32_t, float> PadValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using PadValue = std::variant<uint32_t, float> PadValue;
} // namespace operations::data_movement | ||
|
||
constexpr auto reshape = | ||
ttnn::register_operation<"ttnn::reshape", ttnn::operations::data_movement::ReshapeViewOperation>(); | ||
constexpr auto reshape = ttnn::register_operation<"ttnn::reshape", ttnn::operations::data_movement::ReshapeViewOperation>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to doublecheck. I did not see launch_op
, should this be register_operation_with_auto_launch_op
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have operation::run in reshape.cpp line 246. I don't always run the device part of the operation since metadata only changes do not go on device, I want to be able to control if the operation will be run or not. Maybe my understanding of how operation registering is wrong?
input_volume = input_volume * input_shape[i]; | ||
uint32_t inferred_dim = -1; | ||
for (uint32_t i=0; i< shape.rank(); i++) { | ||
if (((int)(shape[i])) == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid c-style casts. prefer c++ casts like static_cast, dynamic_cast, reinterpret_cast
for (int i = 0; i < input_shape.rank(); i++) { | ||
input_volume = input_volume * input_shape[i]; | ||
uint32_t inferred_dim = -1; | ||
for (uint32_t i=0; i< shape.rank(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like clang format was not run here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
// Perform a reshape (view) | ||
//Perform a reshape (view) | ||
return tensor.reshape(shape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How soon will we actually have a view and reshape call so its really clear when cost is guaranteed to be ~0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a separate PR right now for metadata-only reshapes that will cause PCC issues if incorrectly used. However I like to consolidate the guidance to always use ttnn.reshape that does the view if safe to do so. Maybe I can add a can_do_view operation which will tell the model writer if it is going to be a view or not in advance? I have also provided clear guidance in the pybind to state when a 0 cost view is happening.
} | ||
} | ||
} | ||
auto override_runtime_args_callback = [reader_kernel_id, compute_with_storage_grid_size]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to a separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost every OP does it this way though, I wanted to maintain consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That style is deprecated now. For existing ops it's fine, but we do prefer this variant:
https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/adding_new_ttnn_operation.html
#define max_packet_size 8192 | ||
|
||
template <uint32_t max_transfer_size, bool only_reads> | ||
FORCE_INLINE void enhanced_noc_async_read( | ||
const uint64_t src_noc_addr, const uint32_t dst_l1_addr, const uint32_t bytes) { | ||
// If you do not know the max_transfer_size at compile time write 0 to it. | ||
// only reads is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as True | ||
if constexpr (((max_transfer_size < max_packet_size) && (max_transfer_size != 0)) || only_reads) { | ||
noc_async_read_one_packet(src_noc_addr, dst_l1_addr, bytes); | ||
} else { | ||
noc_async_read(src_noc_addr, dst_l1_addr, bytes); | ||
} | ||
} | ||
|
||
template <uint32_t max_transfer_size, bool only_writes> | ||
FORCE_INLINE void enhanced_noc_async_write( | ||
const uint32_t src_l1_addr, const uint64_t dst_noc_addr, const uint32_t bytes) { | ||
// If you do not know the max_transfer_size at compile time write 0 to it. | ||
// only writes is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as False | ||
if constexpr (((max_transfer_size < max_packet_size) && (max_transfer_size != 0)) || only_writes) { | ||
noc_async_write_one_packet(src_l1_addr, dst_noc_addr, bytes); | ||
} else { | ||
noc_async_write(src_l1_addr, dst_noc_addr, bytes); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not add a new hardcoded define for max_packet_size. We already have NOC_MAX_BURST_SIZE
.
We also already updated some apis to do this sort of thing, ex noc_async_write is already templated to support using the one packet api if you specify a size less than the max transfer size. You could instead copy how noc_async_write was templated for noc_async_read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has the max packet size support but not the only_writes support which gives us just a bit more perf as some of the data mover ops only have reader or writer in a kernel. I think we should keep it as is but will migrate when the only writes optimization is also there. I am changing the max page size to NOC_MAX_BURST_SIZE though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't expose only read/only write flags in the original noc apis because initial testing seemed to show it was mostly functional, but there were some kernels where it seemed to hang even though it should have only been issuing only reads/writes so it was never merged since it risks users running into issues. It's fine if you want to use/expose it outside of dataflow api but you should be careful about exposing a common api for users that hasn't been fully tested. Ex. GS does not support > 8K txns at all, but that is not handled correctly here to ignore the flag for GS.
My suggestion here is to remove you logic regarding max transfer size here since it is already handled in noc_async_write, and instead pass the value down.
Ex
template <uint32_t max_transfer_size, bool only_writes>
FORCE_INLINE void enhanced_noc_async_write(
const uint32_t src_l1_addr, const uint64_t dst_noc_addr, const uint32_t bytes) {
// If you do not know the max_transfer_size at compile time write 0 to it.
// only writes is true if we ONLY use noc_async_read and all calls to tt_memmove have use_read_datamover as False
if constexpr (only_writes) {
noc_async_write_one_packet(src_l1_addr, dst_noc_addr, bytes);
} else {
noc_async_write<max_transfer_size>(src_l1_addr, dst_noc_addr, bytes);
}
}
} | ||
} | ||
|
||
template <bool guaranteed_16B_alligned, bool copy_async, bool use_read_datamover, uint32_t max_transfer_size> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A number of typos alligned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, English not being my first language is showing
FORCE_INLINE void tt_memmove(const uint32_t dst_l1_addr, const uint32_t src_l1_addr, const uint32_t bytes) { | ||
//Function performs a memory copy between two l1 addresses in the local core | ||
//Uses noc_async_read when possible to copy the data over | ||
//Set guaranteed 16B alligned to true if the source and destination are externally guaranteed to be 16B alligned (dangerous) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also not be hardcoding alignment values like 16 and 64 since there's no guarantee they're always that value across arches, these can be queried and passed from host as compile time args or accessed in kernels ex L1_ALIGNMENT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit annoying to fix as I am also using MASK and OFFSET that is hardcoded to what should be used if the shift 16 and 64 respectively and they are also used in other files as this is the common code for all data mover ops. I can't take the value as compile time arg as this is in common. However because it is in common I think updating it should a new arch use a new alignment is not very difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come to think of it, I think the cleanest fix would be to define a DDR_ALIGNMENT at the same location L1_ALIGNMENT is defined that is set automatically at compile time depending on the ARCH type. The OPs would then use that constant value instead which is auto updated to match the architecture. However that would not be the solution for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be acceptable to let this merge in as is for now and then I will post an issue where I can fix it in a cleaner way, just in the interest of getting this in as soon as possible since the change would alter multiple files, require a new function, and I am not comfortable doing that without re-running the pipelines that take hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRAM_ALIGNMENT is also already defined (is this what you meant by DDR_ALIGNMENT?). It's fine if you want to address later, I'm not blocking the pr on any of these comments but wanted to make sure this was considered, as there is currently work being done regarding what alignments are being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cool, good to know. Yeah will do a subsequent PR since I do want to avoid having to re-visit alignment when the next arch comes in
@@ -119,7 +119,7 @@ def test_perf_device_bare_metal(batch_size, expected_perf): | |||
margin = 0.03 | |||
|
|||
if is_wormhole_b0(): | |||
expected_perf = 4155.25 | |||
expected_perf = 3990.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a perf drop or increase (lower or higher is better)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perf increased but the nature of the test makes it report a decrease. The device test currently uses tracy to record how long it runs on the device during the duration of the model. However, we are currently making the really expensive on-host reshapes (which waste time in moving data in and out of the device) run on device instead which has a significant end-to-end perf improvement. However this also means device run-time goes up since the device is doing more and the host is doing less. The very expensive memory copy is not being reflected. That is why we also ran full model experiments and saw that we significantly improved perf when taken in a wholistic approach. We do however need to update these targets to reflect the new device runtime though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
freaking excellent, great job @jvegaTT !
37 commits :) , squash to 1 commit? e.g., |
Yup, doing a squash and merge now |
This would be an amazing commit message |
### Ticket Solves the following issues (and potentially others, still going through and testing opened issues): #13889 #12153 #14676 #15048 #15269 #14515 #14513 #14972 #15524 ### Problem description Reshape OP was re-written from scratch and has no host fallback during legal uses Previous issues were: A lack of higher dimensionality support. I built a dimensionality reducer that found that for any arbitrary ND shape there exists a 3D shape in tile or a 2D shape in row major where conversion from this shape to the original does not require data movement (metadata only). We now do a ND->MD reshape by doing ND->3D->reshape->3D->MD in tiled and ND->2D->reshape->2D->MD in row major where we select the shapes of the two 2/3D intermediate shapes such that ND->2/3D and 2/3D->MD are metadata only changes. This allows us to only have to support 3D->3D shapes in tile and 2D->2D in row major in the kernel and get generality for free A PCC issue which was caused by an invalid optimization which would produce a metadata only reshape even when some of the data should have been moved. In this case the data corruption was small but not 0 so it often ended up hidden in most invalid calls. This has been removed which hurt perf a bit but it is a necessary change. The reshape function has been generalized to work in every case removing the need for reshape on host fallback for valid cases. The old implementation unnecessarily needed a CB large enough to hold the entire tensor. New version only needs to allocate CBs whose total area is page_size_of_input + page_size_of_output + a small constant. This enables the reshape of very large tensors that almost fully occupy the L1. This also allows us to parallelize reading and writing by only adding barriers when needed rather than a big barrier between the reader and the writer kernels. ### What's changed Complete re-write of reshape OP - Sweep pass rate is 98.57% ### Checklist - [x] Post commit CI passes https://github.com/tenstorrent/tt-metal/actions/runs/12164049672 - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) https://github.com/tenstorrent/tt-metal/actions/runs/12166706962 https://github.com/tenstorrent/tt-metal/actions/runs/12157327133/job/33903077970 - [ ] New/Existing tests provide coverage for changes
Ticket
Solves the following issues (and potentially others, still going through and testing opened issues):
#13889
#12153
#14676
#15048
#15269
#14515
#14513
#14972
#15524
Problem description
Reshape OP was re-written from scratch and has no host fallback during legal uses
Previous issues were:
A lack of higher dimensionality support. I built a dimensionality reducer that found that for any arbitrary ND shape there exists a 3D shape in tile or a 2D shape in row major where conversion from this shape to the original does not require data movement (metadata only). We now do a ND->MD reshape by doing ND->3D->reshape->3D->MD in tiled and ND->2D->reshape->2D->MD in row major where we select the shapes of the two 2/3D intermediate shapes such that ND->2/3D and 2/3D->MD are metadata only changes. This allows us to only have to support 3D->3D shapes in tile and 2D->2D in row major in the kernel and get generality for free
A PCC issue which was caused by an invalid optimization which would produce a metadata only reshape even when some of the data should have been moved. In this case the data corruption was small but not 0 so it often ended up hidden in most invalid calls. This has been removed which hurt perf a bit but it is a necessary change.
The reshape function has been generalized to work in every case removing the need for reshape on host fallback for valid cases.
The old implementation unnecessarily needed a CB large enough to hold the entire tensor. New version only needs to allocate CBs whose total area is page_size_of_input + page_size_of_output + a small constant. This enables the reshape of very large tensors that almost fully occupy the L1. This also allows us to parallelize reading and writing by only adding barriers when needed rather than a big barrier between the reader and the writer kernels.
What's changed
Complete re-write of reshape OP - Sweep pass rate is 98.57%
Checklist