#15171: Better parallelization strategy #15172

pavlejosipovic · 2024-11-18T16:23:57Z

This change is motivated by improving pass rates on ttnn torch traces.
In conv2d we deal with lots of out-of-memory issues.
One class of such problems is conv2d mapping it's internal work items to tensix cores.
This change improves on that by padding work items up-to a number that easier to distribute.

Adjusting models as overriding parallelisation strategy was necessary in places as changing number of cores which conv2d is executed can open a pandoras box of other ops failing (DM mostly).
This change also affects max_pool2d and trasposed_conv2d as they use same utility methods for determining parallelisation strategy.

Ticket

Link to Github Issue

Checklist

Post commit CI passes - https://github.com/tenstorrent/tt-metal/actions/runs/11896373877
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable) - https://github.com/tenstorrent/tt-metal/actions/runs/11915621466
Device performance regression CI testing passes (if applicable) - https://github.com/tenstorrent/tt-metal/actions/runs/11915616797 (falcon7b failing on GS as on main)
Nightly fast dispatch tests- https://github.com/tenstorrent/tt-metal/actions/runs/11938068038 (only mamba is failing)

tt-rkim · 2024-11-18T16:29:43Z

I think you should run the whole bag of pipelines, like MCW does for their models

Single card device perf
Single card model perf (e2e)
Nightly fast dispatch
Single card demos

along with post commit

If you're already doing that, sorry for noise

pavlejosipovic · 2024-11-18T16:31:45Z

I think you should run the whole bag of pipelines, like MCW does for their models

Single card device perf

Single card model perf (e2e)

Nightly fast dispatch

Single card demos

along with post commit

If you're already doing that, sorry for noise

I already run all of these except nightly fast dispatch (on previous versions of the branch)
but noise is very heavy I have a hard time figuring out, what is just broken vs my impact.

tt-rkim · 2024-11-18T16:34:00Z

I have posted what is wrong here: #15144

For now, only mamba should be deterministically failing in nightly FD. Otherwise, the other jobs in that pipeline are non-det.

ttnn integration tests for GS, N150, N300 should determinstically pass

models/experimental/functional_unet/tt/unet_shallow_ttnn.py

bbradelTT · 2024-11-19T20:05:04Z

models/demos/ttnn_resnet/tt/ttnn_functional_resnet50_new_conv_api.py

    pad_and_fold_conv_activation_for_unity_stride,
 )
 from typing import List
 from loguru import logger
 from tests.ttnn.utils_for_testing import assert_with_pcc

+
+def get_core_grid_from_num_cores(num_cores: int, grid_rows: int, grid_cols: int):


If the same function is defined in multiple places, maybe define it in a common location and import it.

Yeah, lets move this to a common utilities location -- I think we should already have such functions.

bbradelTT · 2024-11-19T20:10:08Z

models/demos/ttnn_resnet/tt/ttnn_functional_resnet50_new_conv_api.py

+        # and reshard that follows first conv fails with padding ATM.
+        if is_grayskull():
+            compute_grid = device.compute_with_storage_grid_size()
+            parallel_config.grid = get_core_grid_from_num_cores(98, compute_grid.x, compute_grid.y)


Maybe mention 98 or a reduction of 10 in the comment.

Also, please I don't know what you mean by first convs and the grammar for "would got" is not correct.

Removed hard coded num cores

bbradelTT · 2024-11-19T20:11:36Z

ttnn/cpp/ttnn/operations/conv/conv2d/conv2d.cpp

@@ -349,7 +402,7 @@ static TensorMemoryLayout select_shard_spec(

    // Prefer block sharding over height sharding but make sure that we got at least
    // some blocking on width dimension as well.
-    if (cc_height > max_cc || (cc_height == max_cc && cc_height <= compute_grid_size.x)) {
+    if ((cc_height > max_cc && max_cc < 48) || (cc_height == max_cc && cc_height <= compute_grid_size.x)) {


Maybe put 48 into a variable with a descriptive name. I don't know the importance of 48 and why it is chosen.

models/experimental/functional_unet/tt/unet_shallow_ttnn.py

mywoodstock · 2024-11-20T16:32:05Z

models/demos/ttnn_resnet/tt/ttnn_functional_resnet50_new_conv_api.py

    pad_and_fold_conv_activation_for_unity_stride,
 )
 from typing import List
 from loguru import logger
 from tests.ttnn.utils_for_testing import assert_with_pcc

+
+def get_core_grid_from_num_cores(num_cores: int, grid_rows: int, grid_cols: int):


Yeah, lets move this to a common utilities location -- I think we should already have such functions.

mywoodstock · 2024-11-21T02:21:22Z

tests/sweep_framework/sweeps/conv2d/short/conv2d_short_sweep.py

+    [1, 816, 816, 19, 19, 5, 5, 1, 1, 2, 2, 816, False, 1],  # 373
+    [1, 816, 816, 23, 23, 5, 5, 2, 2, 0, 0, 816, False, 1],  # 374
+    [1, 960, 960, 24, 24, 5, 5, 1, 1, 2, 2, 960, False, 1],  # 394
+    [1, 960, 960, 27, 27, 5, 5, 2, 2, 0, 0, 960, False, 1],  # 395


models/experimental/functional_unet/tt/unet_shallow_ttnn.py

esmalTT

Less code and model is faster 🔥 Amazing!!

ayerofieiev-tt · 2024-11-26T15:14:46Z

ttnn/cpp/ttnn/operations/conv/conv2d/conv2d.hpp

@@ -125,6 +119,7 @@ sliding_window::ParallelConfig determine_parallel_config(
    uint32_t output_channels,
    const CoreCoord& compute_grid_size,
    ShardOrientation block_shard_orientation,
+    bool enable_channels_padding,


Let’s avoid adding new bool flags like this.
https://github.com/tenstorrent/tt-metal/blob/main/contributing/BestPractices.md#14-avoid-bool-arguments-in-apis

ayerofieiev-tt · 2024-11-26T15:16:52Z

ttnn/cpp/ttnn/operations/conv/conv2d/device/conv2d_op.cpp

@@ -24,8 +25,11 @@ namespace optimized_conv_op_utils {
 using namespace tt;
 using namespace tt::tt_metal;

-std::pair<std::vector<uint32_t>, std::vector<uint32_t>> compute_opt_conv_activation_as_mm_shape(const tt::tt_metal::LegacyShape& conv_activation_shape, ttnn::operations::sliding_window::SlidingWindowConfig sliding_window_config, uint32_t act_block_h_ntiles) {
-
+std::pair<std::vector<uint32_t>, std::vector<uint32_t>> compute_opt_conv_activation_as_mm_shape(


These vectors are Shapes?

Consider if returning SimpleShape here makes sense

pavlejosipovic requested review from mywoodstock, sankarmanoj-tt, esmalTT, uaydonat, shwetankTT, tt-aho, tt-rkim, nkpatel-tt, ayerofieiev-tt, dmakoviichuk-tt, rfurko-tt, cfjchu, TT-BrianLiu, razorback3, dongjin-na, bbradelTT and yugaoTT as code owners November 18, 2024 16:23

esmalTT reviewed Nov 19, 2024

View reviewed changes

models/experimental/functional_unet/tt/unet_shallow_ttnn.py Outdated Show resolved Hide resolved

models/experimental/functional_unet/tt/unet_shallow_ttnn.py Outdated Show resolved Hide resolved

pavlejosipovic force-pushed the pjosipovic/conv2d_better_parallelization branch from b848fb5 to b0f0830 Compare November 19, 2024 14:51

bbradelTT approved these changes Nov 19, 2024

View reviewed changes

esmalTT reviewed Nov 20, 2024

View reviewed changes

models/experimental/functional_unet/tt/unet_shallow_ttnn.py Outdated Show resolved Hide resolved

yugaoTT approved these changes Nov 20, 2024

View reviewed changes

mywoodstock reviewed Nov 21, 2024

View reviewed changes

pavlejosipovic force-pushed the pjosipovic/conv2d_better_parallelization branch 4 times, most recently from 20b3639 to 632ec64 Compare November 25, 2024 17:20

esmalTT requested changes Nov 25, 2024

View reviewed changes

esmalTT approved these changes Nov 25, 2024

View reviewed changes

mywoodstock approved these changes Nov 25, 2024

View reviewed changes

tt-rkim approved these changes Nov 26, 2024

View reviewed changes

ayerofieiev-tt reviewed Nov 26, 2024

View reviewed changes

ayerofieiev-tt approved these changes Nov 26, 2024

View reviewed changes

pavlejosipovic force-pushed the pjosipovic/conv2d_better_parallelization branch from 4d0a28a to 33fbf4f Compare November 27, 2024 10:01

#15171: Better parallelization strategy

8ccf99e

pavlejosipovic force-pushed the pjosipovic/conv2d_better_parallelization branch from 33fbf4f to 8ccf99e Compare November 27, 2024 10:46

pavlejosipovic merged commit 236622e into main Nov 27, 2024
9 checks passed

pavlejosipovic deleted the pjosipovic/conv2d_better_parallelization branch November 27, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#15171: Better parallelization strategy #15172

#15171: Better parallelization strategy #15172

pavlejosipovic commented Nov 18, 2024 •

edited

Loading

tt-rkim commented Nov 18, 2024

pavlejosipovic commented Nov 18, 2024

tt-rkim commented Nov 18, 2024

bbradelTT Nov 19, 2024

mywoodstock Nov 20, 2024

pavlejosipovic Nov 25, 2024

bbradelTT Nov 19, 2024

pavlejosipovic Nov 25, 2024

bbradelTT Nov 19, 2024

pavlejosipovic Nov 25, 2024

mywoodstock Nov 20, 2024

mywoodstock Nov 21, 2024

esmalTT left a comment

ayerofieiev-tt Nov 26, 2024

ayerofieiev-tt Nov 26, 2024

ayerofieiev-tt Nov 26, 2024

#15171: Better parallelization strategy #15172

#15171: Better parallelization strategy #15172

Conversation

pavlejosipovic commented Nov 18, 2024 • edited Loading

Ticket

Checklist

tt-rkim commented Nov 18, 2024

pavlejosipovic commented Nov 18, 2024

tt-rkim commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

esmalTT left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavlejosipovic commented Nov 18, 2024 •

edited

Loading