-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#15171: Better parallelization strategy #15172
Conversation
I think you should run the whole bag of pipelines, like MCW does for their models
along with post commit If you're already doing that, sorry for noise |
I already run all of these except nightly fast dispatch (on previous versions of the branch) |
I have posted what is wrong here: #15144 For now, only mamba should be deterministically failing in nightly FD. Otherwise, the other jobs in that pipeline are non-det. ttnn integration tests for GS, N150, N300 should determinstically pass |
b848fb5
to
b0f0830
Compare
pad_and_fold_conv_activation_for_unity_stride, | ||
) | ||
from typing import List | ||
from loguru import logger | ||
from tests.ttnn.utils_for_testing import assert_with_pcc | ||
|
||
|
||
def get_core_grid_from_num_cores(num_cores: int, grid_rows: int, grid_cols: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the same function is defined in multiple places, maybe define it in a common location and import it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, lets move this to a common utilities location -- I think we should already have such functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# and reshard that follows first conv fails with padding ATM. | ||
if is_grayskull(): | ||
compute_grid = device.compute_with_storage_grid_size() | ||
parallel_config.grid = get_core_grid_from_num_cores(98, compute_grid.x, compute_grid.y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention 98 or a reduction of 10 in the comment.
Also, please I don't know what you mean by first convs and the grammar for "would got" is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed hard coded num cores
@@ -349,7 +402,7 @@ static TensorMemoryLayout select_shard_spec( | |||
|
|||
// Prefer block sharding over height sharding but make sure that we got at least | |||
// some blocking on width dimension as well. | |||
if (cc_height > max_cc || (cc_height == max_cc && cc_height <= compute_grid_size.x)) { | |||
if ((cc_height > max_cc && max_cc < 48) || (cc_height == max_cc && cc_height <= compute_grid_size.x)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put 48 into a variable with a descriptive name. I don't know the importance of 48 and why it is chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pad_and_fold_conv_activation_for_unity_stride, | ||
) | ||
from typing import List | ||
from loguru import logger | ||
from tests.ttnn.utils_for_testing import assert_with_pcc | ||
|
||
|
||
def get_core_grid_from_num_cores(num_cores: int, grid_rows: int, grid_cols: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, lets move this to a common utilities location -- I think we should already have such functions.
[1, 816, 816, 19, 19, 5, 5, 1, 1, 2, 2, 816, False, 1], # 373 | ||
[1, 816, 816, 23, 23, 5, 5, 2, 2, 0, 0, 816, False, 1], # 374 | ||
[1, 960, 960, 24, 24, 5, 5, 1, 1, 2, 2, 960, False, 1], # 394 | ||
[1, 960, 960, 27, 27, 5, 5, 2, 2, 0, 0, 960, False, 1], # 395 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!!
20b3639
to
632ec64
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less code and model is faster 🔥 Amazing!!
@@ -125,6 +119,7 @@ sliding_window::ParallelConfig determine_parallel_config( | |||
uint32_t output_channels, | |||
const CoreCoord& compute_grid_size, | |||
ShardOrientation block_shard_orientation, | |||
bool enable_channels_padding, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s avoid adding new bool flags like this.
https://github.com/tenstorrent/tt-metal/blob/main/contributing/BestPractices.md#14-avoid-bool-arguments-in-apis
@@ -24,8 +25,11 @@ namespace optimized_conv_op_utils { | |||
using namespace tt; | |||
using namespace tt::tt_metal; | |||
|
|||
std::pair<std::vector<uint32_t>, std::vector<uint32_t>> compute_opt_conv_activation_as_mm_shape(const tt::tt_metal::LegacyShape& conv_activation_shape, ttnn::operations::sliding_window::SlidingWindowConfig sliding_window_config, uint32_t act_block_h_ntiles) { | |||
|
|||
std::pair<std::vector<uint32_t>, std::vector<uint32_t>> compute_opt_conv_activation_as_mm_shape( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These vectors are Shapes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider if returning SimpleShape here makes sense
4d0a28a
to
33fbf4f
Compare
33fbf4f
to
8ccf99e
Compare
This change is motivated by improving pass rates on ttnn torch traces.
In conv2d we deal with lots of out-of-memory issues.
One class of such problems is conv2d mapping it's internal work items to tensix cores.
This change improves on that by padding work items up-to a number that easier to distribute.
Adjusting models as overriding parallelisation strategy was necessary in places as changing number of cores which conv2d is executed can open a pandoras box of other ops failing (DM mostly).
This change also affects max_pool2d and trasposed_conv2d as they use same utility methods for determining parallelisation strategy.
Ticket
Link to Github Issue
Checklist