MI300A Race Condition Fix #185

PaulMullowney · 2024-12-11T19:05:49Z

This P.R. addresses race conditions that occur when running on MI300A as an APU. Host and device kernels using the same memory simultaneously can lead to incorrect results. The solution is to refactor and stash the results of the host computation so that the GPU kernels can continue to be run asynchronously. In the case of trltog, the refactored host code can be executed during MPI communication.

This branch has been run on MI300A (XNACK=0 (discrete GPU) and XNACK=1 (APU)) and A100s nodes.

samhatfield · 2024-12-13T12:42:42Z

Currently we are prone to a race condition on APUs in TRLTOG/TRGTOL in the following way:

DO J = 1, N
  VARIABLE = VALUE

  !$ACC DATA COPYIN(VARIABLE) ASYNC(1)
  !$ACC PARALLEL DO ASYNC(1)
  DO I = 1, M
    ! do something with VARIABLE
  END DO
  !$ACC END DATA
END DO

On APUs the COPYIN essentially does nothing because host and device share the same memory. And because the PARALLEL region is executed asynchronously, the next iteration of the loop modifies VARIABLE while the previous iteration is still operating on it.

On GPUs you don't have this problem because the device-side VARIABLE isn't modified (by the COPYIN) until the PARALLEL region on the previous loop iteration has completed. This is because both are submitted to stream 1.

Paul's fix here is to essentially make VARIABLE an array of length N and precompute all values in a prior loop.

Is that right @PaulMullowney?

src/trans/gpu/internal/trgtol_mod.F90

lukasm91 · 2024-12-13T14:20:41Z

This is a race condition in the code and we have been lucky so far that it did never show up - nice catch! Though, my recommendation for a fix here is to rather stay with a one-dimensional IFLDA array, something along the line of (pseudocode)

add ISEND_FIELD_OFFSET = CUMSUM(ISEND_FIELD_COUNT)

compute IFLDA as - not tested, semi-pseudocode:

DO I=1, NPRTRV
  IFLDS = 1
  DO JFLD=1,KF_GP
    IF (IVSET(JFLD) == I) THEN
      IF (PRESENT(KPTRGP)) THEN
        IFLDA(ISEND_FIELD_OFFSET(ISETV)+IFLDS) = KPTRGP(JFLD)
      ELSE
        IFLDA(ISEND_FIELD_OFFSET(ISETV)+IFLDS) = JFLD
      ENDIF
      IFLDS = IFLDS + 1
    ENDIF
  ENDDO
ENDDO

Do not copyin IFLDA(1:FIELD_COUNT), but copyin the full IFLDA.

Later, just use IFLDA(ISEND_FIELD_OFFSET(ISETV)+X) rather than IFLDA(X). This will save us an array that depends on the number of ranks * # fields, which is something I think we want to avoid if possible.

PaulMullowney · 2024-12-13T15:32:15Z

Currently we are prone to a race condition on APUs in TRLTOG/TRGTOL in the following way:
DO J = 1, N
  VARIABLE = VALUE

  !$ACC DATA COPYIN(VARIABLE) ASYNC(1)
  !$ACC PARALLEL DO ASYNC(1)
  DO I = 1, M
    ! do something with VARIABLE
  END DO
  !$ACC END DATA
END DO
On APUs the COPYIN essentially does nothing because host and device share the same memory. And because the PARALLEL region is executed asynchronously, the next iteration of the loop modifies VARIABLE while the previous iteration is still operating on it.

On GPUs you don't have this problem because the device-side VARIABLE isn't modified (by the COPYIN) until the PARALLEL region on the previous loop iteration has completed. This is because both are submitted to stream 1.

Paul's fix here is to essentially make VARIABLE an array of length N and precompute all values in a prior loop.

Is that right @PaulMullowney?

Correct. Provided that the the COPYIN and PARALLEL region are submitted to the same stream in every loop iteration, I don't think this can be an issue on discrete GPUs. This is because subsequent COPYINs (iteration i) will wait until the previous PARALLEL region (iteration i-1) finished.

samhatfield · 2024-12-13T18:25:15Z

A very interesting suggestion from Lukas there. I'll try coding that up. It seems obvious now, I wonder why it wasn't already like that.

PaulMullowney · 2024-12-13T19:31:54Z

A very interesting suggestion from Lukas there. I'll try coding that up. It seems obvious now, I wonder why it wasn't already like that.

Is the primary advantage to this approach memory savings? From my perspective, simpler, easier to read (and debug) code is of high value when the resource/performance benefits are insignificant.

In any event, I am about to push 1 more commit that refactors a bit more. In the 2 modules, IFLDA was being computed in multiple places depending on whether one had self communication or not. I am trying to combine these into a single initialization with copy to device. If I were you, I would try to implement off of this newer version.

Also, this new commit will be rebased off of the latest develop.

lukasm91 · 2024-12-13T21:38:39Z

Well, I don't want to fight for one or the other solution, I just wanted to lay out an alternative I think is worth investigation.

Is the primary advantage to this approach memory savings? From my perspective, simpler, easier to read (and debug) code is of high value when the resource/performance benefits are insignificant.

Not quite memory savings, but yes I think we should avoid copies and allocations that scale with number of rank * number of fields. This might not be as dramatic here, so I totally agree, this is not the only argument.
But I would also argue that code should be concise and self-documenting. In the end, IFLDA is just a permutation of 1...KF_GP, which is made clear by the size of KF_GP with my proposal. On top, it documents "by itself" that each "W-rank" has the same subset of layers, because there is simply only one array subsection; hence it's definition lives without special cases (IFLDA(:,1) in the 2D case). When there is a IFLDA sub-array for each send, you can introduce inconsistencies that are not possible with my proposal; one might say this risk was there before, true, but why not remove it when rewriting the code.

I really didn't want to offend anyone, we can also do it in two steps; I really just looked at this change and thought - very valid point you bring up, but I also see an alternative :)

Finally regarding

I don't think this can be an issue on discrete GPUs

It can, at least on NVIDIA hardware, so I am surprised we didn't run into it. Since all kernels and data regions are async, the CPU loop counter can run ahead and pile up copies of 1D IFLDA. The actual execution of the copies can happen after IFLDA was already overwritten by a next iteration.

PaulMullowney · 2024-12-13T21:58:14Z

Well, I don't want to fight for one or the other solution, I just wanted to lay out an alternative I think is worth investigation.

Is the primary advantage to this approach memory savings? From my perspective, simpler, easier to read (and debug) code is of high value when the resource/performance benefits are insignificant.

Not quite memory savings, but yes I think we should avoid copies and allocations that scale with number of rank * number of fields. This might not be as dramatic here, so I totally agree, this is not the only argument. But I would also argue that code should be concise and self-documenting. In the end, IFLDA is just a permutation of 1...KF_GP, which is made clear by the size of KF_GP with my proposal. On top, it documents "by itself" that each "W-rank" has the same subset of layers, because there is simply only one array subsection; hence it's definition lives without special cases (IFLDA(:,1) in the 2D case). When there is a IFLDA sub-array for each send, you can introduce inconsistencies that are not possible with my proposal; one might say this risk was there before, true, but why not remove it when rewriting the code.

I really didn't want to offend anyone, we can also do it in two steps; I really just looked at this change and thought - very valid point you bring up, but I also see an alternative :)

Finally regarding

I don't think this can be an issue on discrete GPUs

It can, at least on NVIDIA hardware, so I am surprised we didn't run into it. Since all kernels and data regions are async, the CPU loop counter can run ahead and pile up copies of 1D IFLDA. The actual execution of the copies can happen after IFLDA was already overwritten by a next iteration.

No offense taken. My motivation is to get this branch and 1 other into develop sooner rather later. I am playing a constant rebasing game because we're touching a lot of the same code. I'll let others decide upon the preferred solution here.

On 2nd thought I think you're right about the discrete GPU case though I haven't seen it happen either even at scale.

…ously with device kernels on the same memory. This leads to race conditions and incorrect results. In this commit, the host code is rewritten to avoid this situtation. Then the kernels can continue to be executed asychronously.

…al big jobs 32 (truncation 639) and 64 APUs (truncation 1279) for an MI300A in both discrete and APU mode. Norms look good.

…sed against develop.

PaulMullowney · 2025-01-06T20:54:13Z

Any update on getting this merged?

samhatfield · 2025-01-07T09:26:48Z

Yesterday I started coding up the fix suggested by Lukas. Once that's done we can compare the two and pick our favourite to merge.

Alternatively if you'd like this PR merged urgently I can do that, and then we can consider a PR from my branch later.

By the way, @PaulMullowney have you tried the develop branch recently on MI250x? Is it stable for you?

samhatfield self-requested a review December 13, 2024 10:22

samhatfield reviewed Dec 13, 2024

View reviewed changes

src/trans/gpu/internal/trgtol_mod.F90 Outdated Show resolved Hide resolved

samhatfield added bug Something isn't working gpu labels Dec 13, 2024

PaulMullowney force-pushed the apu_fix2 branch from 6c2a79e to ee30d43 Compare December 13, 2024 20:01

PaulMullowney added 3 commits December 18, 2024 07:56

Removing these initializations as they are not necessary. I run sever…

82b4406

…al big jobs 32 (truncation 639) and 64 APUs (truncation 1279) for an MI300A in both discrete and APU mode. Norms look good.

Refactored to do the entire IFLDA computation in one place. Also reba…

d56a116

…sed against develop.

PaulMullowney force-pushed the apu_fix2 branch from ee30d43 to d56a116 Compare December 18, 2024 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MI300A Race Condition Fix #185

MI300A Race Condition Fix #185

PaulMullowney commented Dec 11, 2024

samhatfield commented Dec 13, 2024

lukasm91 commented Dec 13, 2024 •

edited

Loading

PaulMullowney commented Dec 13, 2024

samhatfield commented Dec 13, 2024

PaulMullowney commented Dec 13, 2024 •

edited

Loading

lukasm91 commented Dec 13, 2024 •

edited

Loading

PaulMullowney commented Dec 13, 2024

PaulMullowney commented Jan 6, 2025

samhatfield commented Jan 7, 2025

MI300A Race Condition Fix #185

Are you sure you want to change the base?

MI300A Race Condition Fix #185

Conversation

PaulMullowney commented Dec 11, 2024

samhatfield commented Dec 13, 2024

lukasm91 commented Dec 13, 2024 • edited Loading

PaulMullowney commented Dec 13, 2024

samhatfield commented Dec 13, 2024

PaulMullowney commented Dec 13, 2024 • edited Loading

lukasm91 commented Dec 13, 2024 • edited Loading

PaulMullowney commented Dec 13, 2024

PaulMullowney commented Jan 6, 2025

samhatfield commented Jan 7, 2025

lukasm91 commented Dec 13, 2024 •

edited

Loading

PaulMullowney commented Dec 13, 2024 •

edited

Loading

lukasm91 commented Dec 13, 2024 •

edited

Loading