Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MI300A Race Condition Fix #185

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

PaulMullowney
Copy link
Contributor

This P.R. addresses race conditions that occur when running on MI300A as an APU. Host and device kernels using the same memory simultaneously can lead to incorrect results. The solution is to refactor and stash the results of the host computation so that the GPU kernels can continue to be run asynchronously. In the case of trltog, the refactored host code can be executed during MPI communication.

This branch has been run on MI300A (XNACK=0 (discrete GPU) and XNACK=1 (APU)) and A100s nodes.

@samhatfield samhatfield self-requested a review December 13, 2024 10:22
@samhatfield
Copy link
Collaborator

Currently we are prone to a race condition on APUs in TRLTOG/TRGTOL in the following way:

DO J = 1, N
  VARIABLE = VALUE

  !$ACC DATA COPYIN(VARIABLE) ASYNC(1)
  !$ACC PARALLEL DO ASYNC(1)
  DO I = 1, M
    ! do something with VARIABLE
  END DO
  !$ACC END DATA
END DO

On APUs the COPYIN essentially does nothing because host and device share the same memory. And because the PARALLEL region is executed asynchronously, the next iteration of the loop modifies VARIABLE while the previous iteration is still operating on it.

On GPUs you don't have this problem because the device-side VARIABLE isn't modified (by the COPYIN) until the PARALLEL region on the previous loop iteration has completed. This is because both are submitted to stream 1.

Paul's fix here is to essentially make VARIABLE an array of length N and precompute all values in a prior loop.

Is that right @PaulMullowney?

@lukasm91
Copy link
Collaborator

lukasm91 commented Dec 13, 2024

This is a race condition in the code and we have been lucky so far that it did never show up - nice catch! Though, my recommendation for a fix here is to rather stay with a one-dimensional IFLDA array, something along the line of (pseudocode)

  1. add ISEND_FIELD_OFFSET = CUMSUM(ISEND_FIELD_COUNT)
  2. compute IFLDA as - not tested, semi-pseudocode:
    DO I=1, NPRTRV
      IFLDS = 1
      DO JFLD=1,KF_GP
        IF (IVSET(JFLD) == I) THEN
          IF (PRESENT(KPTRGP)) THEN
            IFLDA(ISEND_FIELD_OFFSET(ISETV)+IFLDS) = KPTRGP(JFLD)
          ELSE
            IFLDA(ISEND_FIELD_OFFSET(ISETV)+IFLDS) = JFLD
          ENDIF
          IFLDS = IFLDS + 1
        ENDIF
      ENDDO
    ENDDO
    
  3. Do not copyin IFLDA(1:FIELD_COUNT), but copyin the full IFLDA.

Later, just use IFLDA(ISEND_FIELD_OFFSET(ISETV)+X) rather than IFLDA(X). This will save us an array that depends on the number of ranks * # fields, which is something I think we want to avoid if possible.

@PaulMullowney
Copy link
Contributor Author

Currently we are prone to a race condition on APUs in TRLTOG/TRGTOL in the following way:

DO J = 1, N
  VARIABLE = VALUE

  !$ACC DATA COPYIN(VARIABLE) ASYNC(1)
  !$ACC PARALLEL DO ASYNC(1)
  DO I = 1, M
    ! do something with VARIABLE
  END DO
  !$ACC END DATA
END DO

On APUs the COPYIN essentially does nothing because host and device share the same memory. And because the PARALLEL region is executed asynchronously, the next iteration of the loop modifies VARIABLE while the previous iteration is still operating on it.

On GPUs you don't have this problem because the device-side VARIABLE isn't modified (by the COPYIN) until the PARALLEL region on the previous loop iteration has completed. This is because both are submitted to stream 1.

Paul's fix here is to essentially make VARIABLE an array of length N and precompute all values in a prior loop.

Is that right @PaulMullowney?

Correct. Provided that the the COPYIN and PARALLEL region are submitted to the same stream in every loop iteration, I don't think this can be an issue on discrete GPUs. This is because subsequent COPYINs (iteration i) will wait until the previous PARALLEL region (iteration i-1) finished.

@samhatfield samhatfield added bug Something isn't working gpu labels Dec 13, 2024
@samhatfield
Copy link
Collaborator

A very interesting suggestion from Lukas there. I'll try coding that up. It seems obvious now, I wonder why it wasn't already like that.

@PaulMullowney
Copy link
Contributor Author

PaulMullowney commented Dec 13, 2024

A very interesting suggestion from Lukas there. I'll try coding that up. It seems obvious now, I wonder why it wasn't already like that.

Is the primary advantage to this approach memory savings? From my perspective, simpler, easier to read (and debug) code is of high value when the resource/performance benefits are insignificant.

In any event, I am about to push 1 more commit that refactors a bit more. In the 2 modules, IFLDA was being computed in multiple places depending on whether one had self communication or not. I am trying to combine these into a single initialization with copy to device. If I were you, I would try to implement off of this newer version.

Also, this new commit will be rebased off of the latest develop.

@lukasm91
Copy link
Collaborator

lukasm91 commented Dec 13, 2024

Well, I don't want to fight for one or the other solution, I just wanted to lay out an alternative I think is worth investigation.

Is the primary advantage to this approach memory savings? From my perspective, simpler, easier to read (and debug) code is of high value when the resource/performance benefits are insignificant.

Not quite memory savings, but yes I think we should avoid copies and allocations that scale with number of rank * number of fields. This might not be as dramatic here, so I totally agree, this is not the only argument.
But I would also argue that code should be concise and self-documenting. In the end, IFLDA is just a permutation of 1...KF_GP, which is made clear by the size of KF_GP with my proposal. On top, it documents "by itself" that each "W-rank" has the same subset of layers, because there is simply only one array subsection; hence it's definition lives without special cases (IFLDA(:,1) in the 2D case). When there is a IFLDA sub-array for each send, you can introduce inconsistencies that are not possible with my proposal; one might say this risk was there before, true, but why not remove it when rewriting the code.

I really didn't want to offend anyone, we can also do it in two steps; I really just looked at this change and thought - very valid point you bring up, but I also see an alternative :)

Finally regarding

I don't think this can be an issue on discrete GPUs

It can, at least on NVIDIA hardware, so I am surprised we didn't run into it. Since all kernels and data regions are async, the CPU loop counter can run ahead and pile up copies of 1D IFLDA. The actual execution of the copies can happen after IFLDA was already overwritten by a next iteration.

@PaulMullowney
Copy link
Contributor Author

Well, I don't want to fight for one or the other solution, I just wanted to lay out an alternative I think is worth investigation.

Is the primary advantage to this approach memory savings? From my perspective, simpler, easier to read (and debug) code is of high value when the resource/performance benefits are insignificant.

Not quite memory savings, but yes I think we should avoid copies and allocations that scale with number of rank * number of fields. This might not be as dramatic here, so I totally agree, this is not the only argument. But I would also argue that code should be concise and self-documenting. In the end, IFLDA is just a permutation of 1...KF_GP, which is made clear by the size of KF_GP with my proposal. On top, it documents "by itself" that each "W-rank" has the same subset of layers, because there is simply only one array subsection; hence it's definition lives without special cases (IFLDA(:,1) in the 2D case). When there is a IFLDA sub-array for each send, you can introduce inconsistencies that are not possible with my proposal; one might say this risk was there before, true, but why not remove it when rewriting the code.

I really didn't want to offend anyone, we can also do it in two steps; I really just looked at this change and thought - very valid point you bring up, but I also see an alternative :)

Finally regarding

I don't think this can be an issue on discrete GPUs

It can, at least on NVIDIA hardware, so I am surprised we didn't run into it. Since all kernels and data regions are async, the CPU loop counter can run ahead and pile up copies of 1D IFLDA. The actual execution of the copies can happen after IFLDA was already overwritten by a next iteration.

No offense taken. My motivation is to get this branch and 1 other into develop sooner rather later. I am playing a constant rebasing game because we're touching a lot of the same code. I'll let others decide upon the preferred solution here.

On 2nd thought I think you're right about the discrete GPU case though I haven't seen it happen either even at scale.

…ously with device kernels on the same memory. This leads to race conditions and incorrect results. In this commit, the host code is rewritten to avoid this situtation. Then the kernels can continue to be executed asychronously.
…al big jobs 32 (truncation 639) and 64 APUs (truncation 1279) for an MI300A in both discrete and APU mode. Norms look good.
@PaulMullowney
Copy link
Contributor Author

Any update on getting this merged?

@samhatfield
Copy link
Collaborator

Yesterday I started coding up the fix suggested by Lukas. Once that's done we can compare the two and pick our favourite to merge.

Alternatively if you'd like this PR merged urgently I can do that, and then we can consider a PR from my branch later.

By the way, @PaulMullowney have you tried the develop branch recently on MI250x? Is it stable for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants