Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line/Ring All Gather Test Sweeps #10874

Open
1 of 8 tasks
Tracked by #10873
SeanNijjar opened this issue Jul 30, 2024 · 0 comments
Open
1 of 8 tasks
Tracked by #10873

Line/Ring All Gather Test Sweeps #10874

SeanNijjar opened this issue Jul 30, 2024 · 0 comments
Assignees
Labels
master op_cat: ccl Op Generalization Generalization and relaxations of requirements in Ops P1

Comments

@SeanNijjar
Copy link
Contributor

SeanNijjar commented Jul 30, 2024

Summary (Copy For All CCL Ops)-

This set of tasks should be completed for each CCL operation, though progress for each op can be completed independently.

For a given op we improve test coverage by adding breadth (adding more shape and argument combinations) and adding depth (running given op variations across more topologies and scale-out configurations). Together these form a 2D matrix of test coverage where each cell is itself a sweep over a multi-variable space.

Links to sub-tasks:

Here are the subtasks. High level info follows:

Testing Breadth

Priority Configs:

Prior to sweeping, the priority configs should be added and tested.

From Llama 405B

  • line-all-gather (4 chips, 3 links, input tensor (per chip) = [1,1,32,6.5*1024], dim=1 => input_tensor for test_case = [1,4,32,6.5*1024])
  • line-all-gather: (8 chips, {3,4} links, input tensor (per chip) = [1,1,32,2304], dim=1 => input_tensor for test_case = [1,8,32,2304])
  • line-all-gather: (8 chips, {3,4} links, input tensor (per chip) = [1,1,32,4k], dim=1 => input_tensor for test_case = [1,8,32,4k])
  • line-all-gather: (8 chips, {3,4} links, input tensor (per chip) = [1, 1, 8[padded to 32], 4k], dim=2 => output shape (per chip) = [1, 1, 32, 4k] -> all-gather concatenates within tile
    • currently expected to fail as this feature is missing
  • line-all-gather: (8 chips, {3,4} links, input tensor (per chip) = [1, 1, 8[padded to 32], 16k], dim=2 => output shape (per chip) = [1, 1, 32, 16k] -> all-gather concatenates within tile
    • currently expected to fail as this feature is missing

From Llama 70B

  • line-all-gather (4 chips, 3 links, input tensor (per chip) = [1,1,32,3.5*1024], dim=1 => input_tensor for test_case = [1,4,32,(int)(3.5*1024)])
  • line-all-gather: (8 chips, {3,4} links, input tensor (per chip) = [1,1,32,1280], dim=1 => input_tensor for test_case = [1,8,32,1280])
  • line-all-gather: (8 chips, {3,4} links, input tensor (per chip) = [1,1,32,2048], dim=1 => input_tensor for test_case = [1,8,32,2048])
  • line-all-gather: (8 chips, {3,4} links, input tensor (per chip) = [1, 1, 8[padded to 32], 2k], dim=2 => output shape (per chip) = [1, 1, 32, 2k] -> all-gather concatenates within tile
    • currently expected to fail as this feature is missing
  • line-all-gather: (8 chips, {3,4} links, input tensor (per chip) = [1, 1, 8[padded to 32], 4k], dim=2 => output shape (per chip) = [1, 1, 32, 4k] -> all-gather concatenates within tile
    • currently expected to fail as this feature is missing

Basic Sweep tests:

  • Sweep TensorMemoryLayout: {Single Bank, Interleaved, Width Sharded, Height Sharded, Block Sharded}
  • Dim: {3,2,1,0}
  • BufferType: {DRAM, L1}
  • Layout: {RowMajor, Tile}
  • Shapes: Constrained to tile/page aligned, Shard grids unpadded
    • For inner dims (y, x), increment by 32 in each direction. For outer dims, increment by 1.
      • Outer dims can be swept by basic numbers, then relatively prime numbers up to 128.
      • More values can be swept over after all-gather is migrated to make more use of runtime args.
  • Dataformat: (fp16, bfp8)

Tile Padding Sweep Tests:

After basic sweep tests are running

  • For Layout == Tile, sweep over padded tile configurations/shapes:
    • x-padded, y-aligned
      • from 1 to 31
    • x-aligned, y-padded
      • from 1 to 31
    • x-padded, y-padded
      • from 1 to 31

Advanced Sharding Sweep Tests:

After basic sweep tests are running

  • For TensorMemoryLayout == (WIDTH|HEIGHT|BLOCK) sharded, sweep over padded shard grid
    • Sweep all possible shard grids
  • In addition to padded shard grids, also sweep lightly over grid offset

Basic Sweeps (Large Tensors)

  • Run the basic sweeps but only for very large tensor shapes (in the GBs, the goal is to make sure we can execute DRAM filling CCLs
    • Assume 10GB usable space for WH per chip
    • Be sure to include very short and wide tensors as well as narrow but tall tensors so we can stress dim sizes out too
      • This will help flush out any integer overflow issues that might be lurking

Input/Output Tensor Attribute Mixing

  • Mix and match combinations of the above but applied differently between the input and output tensor

Note:

The above should all be runnable on 8 chip, 1 link and then 4chip 2 link. 3 and 4 link variants are runnable on TG. TG generality testing will lag t3000 generality testing initially.

Adding Testing Depth

For a given test (list), enable the tests on various multichip configurations. Priorities may change over time and by op.

Basic Topology Configurations

hardware topology #links #chips #instances comment
n300 line x1 2 1
n300 ring x1 2 1
t3000 ring x1 8 1
t3000 line x1 8 1
t3000 line x1 4 2
t3000 line x1 3 1
t3000 ring x1 4 1
TG line x{1,2,3,4} 8 4
TG line x{1,2,3} 4 8
TGG line x{1,2,3} 8 8

Advanced Topology Configurations

hardware topology #links #chips #concurrent_instances comment
TG line x4 4 8 each column runs 2 separate line all-gathers. 8 all gathers total
TG line x4 5 4
TG line x4 6 4
TG line x4 7 4
TG ring x3 4 (2x2) 8 8 2x2 rings
TG ring x3 8 (2x4) 4 4 2x4 rings
TG ring x3 8 (4x2) 4 4 4x2 rings
TG ring x3 16 (8x2) 2 2 8x2 rings
TGG line x4 4 16 0

"Random" Topology Configurations

  • Build random topology generator

In the case of sweeping, we enumerate the various ways to map lines and rings onto the given cluster. Below are some non-typical but valid test cases that should be included in the sweep:

image
@SeanNijjar SeanNijjar added Op Generalization Generalization and relaxations of requirements in Ops op_cat: ccl labels Jul 30, 2024
@SeanNijjar SeanNijjar added the P1 label Jul 31, 2024
Aswinmcw added a commit that referenced this issue Aug 2, 2024
Aswinmcw added a commit that referenced this issue Aug 5, 2024
Aswinmcw added a commit that referenced this issue Aug 5, 2024
Aswinmcw added a commit that referenced this issue Aug 7, 2024
Aswinmcw added a commit that referenced this issue Aug 7, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
Aswinmcw added a commit that referenced this issue Aug 22, 2024
…10885)

* #10874: Enable test cases for concurrent instances

* #10874: Move test to separate file

* #10874: Add sweep test in new infra

* #10874: Use t3k_device_mesh fixture and remove reused code

* #10874: Use t3k_device_mesh fixture for concurrent instances

* #10874: Use t3k_device_mesh fixture in sweep test

* #10874: Fix test ncalls

* #10874: Add symlink for CI

* #0: Minor change

* #10874: Enable cases

* #10874: Use ttnn calls

* #10874: Use loops

* #10874: Use deprecated version

* #10874: Modify sweep test to use device fixture
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
master op_cat: ccl Op Generalization Generalization and relaxations of requirements in Ops P1
Projects
None yet
Development

No branches or pull requests

2 participants