Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrect assertion logic in check_barrier_in_range in barrier.h #1781

Closed
Algy opened this issue Sep 4, 2024 · 1 comment · Fixed by #1782
Closed

[BUG] Incorrect assertion logic in check_barrier_in_range in barrier.h #1781

Algy opened this issue Sep 4, 2024 · 1 comment · Fixed by #1782
Labels
bug Something isn't working

Comments

@Algy
Copy link
Contributor

Algy commented Sep 4, 2024

Describe the bug

When the Stream-K scheduler is used along with the TMA Cooperative kernel, which can contain more than 2 mma warp groups, stream-k raises a assertion error in fixup() method.

cutlass/include/cutlass/barrier.h:280: static void cutlass::NamedBarrierManager<ThreadCount_, Offset, MaxNumNamedBarriers>::check_barrier_in_range(unsigned int) [with unsigned int ThreadCount_ = 128U; unsigned int Offset = 3U; unsigned int MaxNumNamedBarriers = 2U]: block: [0,25,0], thread: [159,0,0] Assertion `(idx >= MaxNumNamedBarriers) && "Index exceeds barrier count"` failed

At sm90_tile_scheduler_stream_k.hpp : L363-378,

  // Performs the reduction across splits for a given output tile.
  template <class FrgTensorC>
  CUTLASS_DEVICE
  static void
  fixup(
    Params const& params,
    WorkTileInfo const& work_tile_info,
    FrgTensorC& accumulators,
    uint32_t num_barriers,
    uint32_t barrier_idx) {
    static constexpr uint32_t Offset = static_cast<int>(cutlass::arch::ReservedNamedBarriers::StreamkBarrier0);
    static constexpr uint32_t MaxNumNamedBarriers = 2;
    using BarrierManager = NamedBarrierManager<NumThreadsPerWarpGroup, Offset, MaxNumNamedBarriers>;
    return fixup_helper<FrgTensorC, BarrierManager>(
      params, work_tile_info, accumulators, num_barriers, barrier_idx);
  }

Here, MaxNumNamedBarriers is set to 2. However, when it is used in conjunction with the Cooperative kernel, it raises an assertion error caused by the wrong assumption on the number of MMA WGs, I guess.

EDIT: the bug seems to be caused by the wrong assertion logic described below.

Steps/Code to reproduce bug

I can't provide the code stub right now since it's proprietary, but the gemm configuration goes like this:

  • Type: e4m3 x e4m3 -> f16
  • Tile: 128x64x64
  • Cluster: 2x1x1
  • Tile Scheduler: Stream-K
  • Kernel Schedule: cutlass::gemm::KernelTmaWarpSpecializedCooperative
  • Epilogue Schedule: cutlass::epilogue::TmaWarpSpecializedCooperative
  • Simply built with CollectiveBuilder

I found this error in my project when I accidently entered Debug mode with no -NDEBUG flag. It works correctly when it compiles with Release mode in which assert() has no effect.

Expected behavior

MaxNumNamedBarriers should be set to 3 or larger.
EDIT: See below

Environment details (please complete the following information):

  • Bare-metal, H100
@Algy Algy added ? - Needs Triage bug Something isn't working labels Sep 4, 2024
@Algy
Copy link
Contributor Author

Algy commented Sep 4, 2024

Wait. It seems this is caused by the incorrect assertion logic?

https://github.com/NVIDIA/cutlass/blob/e1976daacc7b030ba672217eb5d96f5a663df4ab/include/cutlass/barrier.h#L276-282

Currently the assert statement is always evalauted to false in normal situations.

    assert((idx >= MaxNumNamedBarriers) && "Index exceeds barrier count");

Therefore, the predicate should be inverted like this:

    assert((idx < MaxNumNamedBarriers) && "Index exceeds barrier count");

pre-v3.5.0 used to check with a branch.

  if (idx >= MaxNumNamedBarriers) {
    CUTE_RUNTIME_ASSERT("Index exceeds barrier count");
  }

This bug was introduced in v3.5.0, where the predicate is blindly copied to the assert(). However, the logic should be inverted.

@Algy Algy changed the title [BUG] wrong number of max named barriers in fixup() of Stream-K scheduler [BUG] Incorrect assertion logic in check_barrier_in_range in barrier.h Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants