-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document MPI communicator setup protocol #712
Conversation
I recommend viewing the diff in this PR in the "rich diff" mode, so that formatting is rendered. |
|
There are a few distinct concerns I can think of to be reviewed here:
|
@josephzhang8 You may be interested in this too - this describes how I'm planning to feed the setup for a private MPI communicator into SCHISM via the BMI interface that Jason has developed. |
Thx |
@peckhams I forget if we discussed my plans around passing through arguments to |
@hrajagers (Bert Jagers) from Deltares has been also working on BMI MPI parallelization and is interested in how the NextGen project has approached this particular problem. This describes in theory how @PhilMiller plans to feed the setup for a private MPI communicator into NextGen coastal models (SCHISM/DFlowFM). |
If we solely wanted to work through this at an ABI level, and have the framework do all the |
doc/BMIconventions.md
Outdated
For MPI-parallel BMI models, we modify the model lifecycle relative to | ||
models that will run within a single process. Specifically, between | ||
instantiation of the model instance and the call to `initialize`, the | ||
framework will make a single collective call to `set_value_at_indices` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to note here that, as long as the implementation follows the guidance below on only establishing the MPI_Group
and MPI_Comm
, this is a safe call outside of the traditional Initialize -> do stuff -> Finalize
cycle since the next gen registration establishes the BMI model structure.
Also may be a good idea to document the expected semantics of multiple calls to set_value_at_indicies
with a name=bmi_mpi_rank_assignment
variable. If it is allowed to be dynamic, then a quick note on that and what it means. If it shouldn't be called twice, might be good to include some boilerplate logic to check and skip re-initializing the MPI group/comm.
Thank you @jduckerOWP for bringing this thread to my attention. I have had two meetings on the topic of a BMI extension for parallel computing with the CSDMS group and NCAR developers, and the draft approach that we're following currently is documented at https://github.com/csdms/bmi/tree/hrajagers/parallel-bmi or more precisely in the bmi.****.rst files in https://github.com/csdms/bmi/tree/hrajagers/parallel-bmi/docs/source. I see two fundamental differences:
In our proposal we have also defined how the various other calls should behave in a parallel environment: most calls, such as get and set work only on the local data and that must be the same since you don't allow inter-rank communication in such calls. However, the grid_funcs that are used to determine grid dimensions and grid partitioning across the ranks needed some special attention. |
Thanks for sharing @hrajagers! It is quite interesting to see a distributed BMI from the perspective that the caller (framework) manages the entire collective (e.g. stitching together all local data). What I see as an interesting question to ask based on that idea is "How does a framework mix and match MPI compatible components and non MPI components?" It seems that the ABI would require the complete set of functions from both. We have been thinking about this problem from through the lens of "whats the least invasive, minimal requirement" which would allow multiple MPI components to co-exist as components without interfering with each other.
I don't think you are reading this correctly, we are not putting a blanket restriction on MPI communication in these functions, we are restricting which type, specifically One thing that we must maintain for our work is ABI compatibility across multiple language/runtime interfaces and be able to run both MPI and non-MPI components in the same runtime. This has lead us to try some possibly "unorthodox" methods of using the BMI interface. |
Hi @hellkite500 , I realize that you're trying to work within the scope of the existing BMI 2.0 API whereas our design was based on the assumption that this would be a BMI extension ... where it's still an open discussion on how to recognize and identify BMI extensions (csdms/bmi#138). I realize that the Fortran API based on an abstract BMI type with deferred procedures does force you into a direction that requires you to implement all procedures defined. Maybe the extension should then be defined as a second type, but I can see potential issues with that as well. I understand that you're wrapping this Fortran BMI type layer with a C interoperability layer in NextGen. We're doing something similar ourselves where we're moving the effective interface layer to the interface of a shared library. The possibilities for extensions may change somewhat if we look at that type of interface. My interpretation was that only |
I need to answer the calling sequence stuff, but I'm not focused on that at the moment. I'm making a note that there are at least two more reasonable non-extension approaches besides a protocol like what's I described in the initial PR:
I had briefly looked into the c2f/f2c approach when I started down this path. I should have documented that at the time, and my contemporary reason for rejecting it: it didn't look like mpi4py had corresponding bindings. That was wrong, because I searched poorly. Every MPI object type represented in mpi4py has methods MPI Sessions might allow for a broader range of sophisticated behaviors among different models. They're standardized in MPI 4.0, released in June 2021. They're supported in OpenMPI, MPICH, and Intel MPI, but seemingly not in Cray MPI which may be the baseline on our operational target of WCOSS2. I'll do some looking to see whether an approach based on Sessions is worth pursuing, but the integer handle approach seems like the expedient course of action. I'll amend this PR's contents accordingly. |
Hi @PhilMiller , Thank you for the update. Good to hear that you think that the c2f/f2c approach will be portable to a wider set of languages. It's what we now use in-house between C and Fortran, but I hadn't investigated it enough to determine whether it would be portable to all platforms. I'm new to MPI Sessions. The description in Holmes et al (2016) is rather concise. I'm wondering how the local MPI_Session_Init would know where to instantiate or group the MPI threads, but that's of less importance right now. My main concern with MPI Sessions that are created within a component and not visible to the outside, is that it would make all communications between components sequential i.e. single threaded. This may be good enough (or even preferable) for some use cases, but it would limit scalability of BMI communication towards bigger computations in which two parallel models try to communicate. If we move the MPI Session creation outside the component, then I'm not sure that it differs significantly from the other method. |
724e796
to
aa62d85
Compare
aa62d85
to
f1ae29d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description here is concise and easy to understand.
There's an implied constraint in this design that any gather operations of global fields would have to happen in From discussion in today's stand-up, Donald noted that this constraint means that code may run into resource limits if they proactively gather all such fields. The alternative for the time being is that such fields will be enumerated/selected in the runtime configuration. We don't immediately have any models for which that's an issue, so we're not going to block on it. |
Following discussion with folks working on a parallel BMI extension, we've concluded that the path forward for ngen is to use what's described here solely as a temporary measure in SCHISM to allow development progress, and plan to adopt the community-consensus extension design as it solidifies. So, protocol documentation will be in code comments, whose lifetime will line up exactly with actual use of the functionality - when we move away from it the comments can get deleted with the code. |
Describes how we want coastal models and eventually any other MPI-parallelized BMI modules to be informed of the ranks of MPI processes available for their use.
Changes
Checklist