Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#15171: Better parallelization strategy #15172

Merged
merged 1 commit into from
Nov 27, 2024

Conversation

pavlejosipovic
Copy link
Contributor

@pavlejosipovic pavlejosipovic commented Nov 18, 2024

This change is motivated by improving pass rates on ttnn torch traces.
In conv2d we deal with lots of out-of-memory issues.
One class of such problems is conv2d mapping it's internal work items to tensix cores.
This change improves on that by padding work items up-to a number that easier to distribute.

Adjusting models as overriding parallelisation strategy was necessary in places as changing number of cores which conv2d is executed can open a pandoras box of other ops failing (DM mostly).
This change also affects max_pool2d and trasposed_conv2d as they use same utility methods for determining parallelisation strategy.

Ticket

Link to Github Issue

Checklist

@tt-rkim
Copy link
Collaborator

tt-rkim commented Nov 18, 2024

I think you should run the whole bag of pipelines, like MCW does for their models

  • Single card device perf
  • Single card model perf (e2e)
  • Nightly fast dispatch
  • Single card demos

along with post commit

If you're already doing that, sorry for noise

@pavlejosipovic
Copy link
Contributor Author

I think you should run the whole bag of pipelines, like MCW does for their models

  • Single card device perf
  • Single card model perf (e2e)
  • Nightly fast dispatch
  • Single card demos

along with post commit

If you're already doing that, sorry for noise

I already run all of these except nightly fast dispatch (on previous versions of the branch)
but noise is very heavy I have a hard time figuring out, what is just broken vs my impact.

@tt-rkim
Copy link
Collaborator

tt-rkim commented Nov 18, 2024

I have posted what is wrong here: #15144

For now, only mamba should be deterministically failing in nightly FD. Otherwise, the other jobs in that pipeline are non-det.

ttnn integration tests for GS, N150, N300 should determinstically pass

@pavlejosipovic pavlejosipovic force-pushed the pjosipovic/conv2d_better_parallelization branch from b848fb5 to b0f0830 Compare November 19, 2024 14:51
pad_and_fold_conv_activation_for_unity_stride,
)
from typing import List
from loguru import logger
from tests.ttnn.utils_for_testing import assert_with_pcc


def get_core_grid_from_num_cores(num_cores: int, grid_rows: int, grid_cols: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the same function is defined in multiple places, maybe define it in a common location and import it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, lets move this to a common utilities location -- I think we should already have such functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# and reshard that follows first conv fails with padding ATM.
if is_grayskull():
compute_grid = device.compute_with_storage_grid_size()
parallel_config.grid = get_core_grid_from_num_cores(98, compute_grid.x, compute_grid.y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention 98 or a reduction of 10 in the comment.

Also, please I don't know what you mean by first convs and the grammar for "would got" is not correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed hard coded num cores

@@ -349,7 +402,7 @@ static TensorMemoryLayout select_shard_spec(

// Prefer block sharding over height sharding but make sure that we got at least
// some blocking on width dimension as well.
if (cc_height > max_cc || (cc_height == max_cc && cc_height <= compute_grid_size.x)) {
if ((cc_height > max_cc && max_cc < 48) || (cc_height == max_cc && cc_height <= compute_grid_size.x)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put 48 into a variable with a descriptive name. I don't know the importance of 48 and why it is chosen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pad_and_fold_conv_activation_for_unity_stride,
)
from typing import List
from loguru import logger
from tests.ttnn.utils_for_testing import assert_with_pcc


def get_core_grid_from_num_cores(num_cores: int, grid_rows: int, grid_cols: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, lets move this to a common utilities location -- I think we should already have such functions.

[1, 816, 816, 19, 19, 5, 5, 1, 1, 2, 2, 816, False, 1], # 373
[1, 816, 816, 23, 23, 5, 5, 2, 2, 0, 0, 816, False, 1], # 374
[1, 960, 960, 24, 24, 5, 5, 1, 1, 2, 2, 960, False, 1], # 394
[1, 960, 960, 27, 27, 5, 5, 2, 2, 0, 0, 960, False, 1], # 395
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!

@pavlejosipovic pavlejosipovic force-pushed the pjosipovic/conv2d_better_parallelization branch 4 times, most recently from 20b3639 to 632ec64 Compare November 25, 2024 17:20
Copy link
Contributor

@esmalTT esmalTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less code and model is faster 🔥 Amazing!!

@@ -125,6 +119,7 @@ sliding_window::ParallelConfig determine_parallel_config(
uint32_t output_channels,
const CoreCoord& compute_grid_size,
ShardOrientation block_shard_orientation,
bool enable_channels_padding,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -24,8 +25,11 @@ namespace optimized_conv_op_utils {
using namespace tt;
using namespace tt::tt_metal;

std::pair<std::vector<uint32_t>, std::vector<uint32_t>> compute_opt_conv_activation_as_mm_shape(const tt::tt_metal::LegacyShape& conv_activation_shape, ttnn::operations::sliding_window::SlidingWindowConfig sliding_window_config, uint32_t act_block_h_ntiles) {

std::pair<std::vector<uint32_t>, std::vector<uint32_t>> compute_opt_conv_activation_as_mm_shape(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These vectors are Shapes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider if returning SimpleShape here makes sense

@pavlejosipovic pavlejosipovic force-pushed the pjosipovic/conv2d_better_parallelization branch from 4d0a28a to 33fbf4f Compare November 27, 2024 10:01
@pavlejosipovic pavlejosipovic force-pushed the pjosipovic/conv2d_better_parallelization branch from 33fbf4f to 8ccf99e Compare November 27, 2024 10:46
@pavlejosipovic pavlejosipovic merged commit 236622e into main Nov 27, 2024
9 checks passed
@pavlejosipovic pavlejosipovic deleted the pjosipovic/conv2d_better_parallelization branch November 27, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants