Document MPI communicator setup protocol #712

PhilMiller · 2024-01-19T01:39:12Z

Describes how we want coastal models and eventually any other MPI-parallelized BMI modules to be informed of the ranks of MPI processes available for their use.

Changes

Add description to BMI conventions doc

Checklist

PhilMiller · 2024-01-19T01:41:42Z

I recommend viewing the diff in this PR in the "rich diff" mode, so that formatting is rendered.

PhilMiller · 2024-01-19T01:43:34Z

This is a revision of my earlier idea about passing in arguments for a call to MPI_Comm_split. The upshot of the revision is that the framework no longer needs to ensure any corresponding code runs on non-participating processes to satisfy a collective over MPI_COMM_WORLD - only processes on which a given model instance will exist need to be called.

PhilMiller · 2024-01-19T01:45:53Z

There are a few distinct concerns I can think of to be reviewed here:

Does this write-up clearly describe what we're (prospectively) doing and why?
Is what's described here sensible and desirable?
Is the implementation entailed reasonable?

PhilMiller · 2024-01-19T01:53:26Z

@josephzhang8 You may be interested in this too - this describes how I'm planning to feed the setup for a private MPI communicator into SCHISM via the BMI interface that Jason has developed.

josephzhang8 · 2024-01-19T01:58:48Z

Thx

PhilMiller · 2024-01-20T01:24:46Z

@peckhams I forget if we discussed my plans around passing through arguments to MPI_Comm_split at AGU, or the idea hadn't occurred to me yet. This PR describes a refinement of that idea, and how I intend to get MPI-parallelized model code working within the NextGen Framework in the absence of any better ideas or movement on the standardization front.

jduckerOWP · 2024-01-24T16:49:54Z

@hrajagers (Bert Jagers) from Deltares has been also working on BMI MPI parallelization and is interested in how the NextGen project has approached this particular problem. This describes in theory how @PhilMiller plans to feed the setup for a private MPI communicator into NextGen coastal models (SCHISM/DFlowFM).

PhilMiller · 2024-01-24T16:57:40Z

If we solely wanted to work through this at an ABI level, and have the framework do all the MPI_Comm juggling itself, we could alternatively use the MPI_Comm_c2f family of routines and pass integers (MPI_Fint) through BMI methods. We still have the weird lifecycle stuff to deal with. ~~It's also not applicable to Python, since mpi4py doesn't expose those routines.~~ [that was wrong - there's f2py and py2f]

hellkite500 · 2024-01-25T14:16:26Z

doc/BMIconventions.md

+For MPI-parallel BMI models, we modify the model lifecycle relative to
+models that will run within a single process. Specifically, between
+instantiation of the model instance and the call to `initialize`, the
+framework will make a single collective call to `set_value_at_indices`


It would be good to note here that, as long as the implementation follows the guidance below on only establishing the MPI_Group and MPI_Comm, this is a safe call outside of the traditional Initialize -> do stuff -> Finalize cycle since the next gen registration establishes the BMI model structure.

Also may be a good idea to document the expected semantics of multiple calls to set_value_at_indicies with a name=bmi_mpi_rank_assignment variable. If it is allowed to be dynamic, then a quick note on that and what it means. If it shouldn't be called twice, might be good to include some boilerplate logic to check and skip re-initializing the MPI group/comm.

hrajagers · 2024-01-25T20:54:31Z

Thank you @jduckerOWP for bringing this thread to my attention. I have had two meetings on the topic of a BMI extension for parallel computing with the CSDMS group and NCAR developers, and the draft approach that we're following currently is documented at https://github.com/csdms/bmi/tree/hrajagers/parallel-bmi or more precisely in the bmi.****.rst files in https://github.com/csdms/bmi/tree/hrajagers/parallel-bmi/docs/source. I see two fundamental differences:

In your approach the BMI component generates its communicator from the MPI_COMM_WORLD by means of an MPI_Comm_create_group call. In our approach the model_group communicator is assumed to be created by the calling component/framework and then passed down to the component. I like the fact that your solution avoids passing communicators across language interfaces ... we haven't run into issues on this topic yet, but there are some concerns whether it's portable to all environments. It would be nice to perform some basic tests across Python, C and Fortran languages with both approaches.
You assume that all calls except for initialize, finalize, update and update_until can be called independently per rank, i.e. no MPI communication is allowed within the model for such calls. On the other hand, we had written in our proposal that all calls must be performed across all ranks since communication might be needed within the model component. It would be great for the calling side if it could rely on the fact that no MPI communication is needed, but I'm not yet sure that we're ready yet to make that conclusion. Especially when getting a list of input/output quantities we have been discussing whether this should be one consistent list across all ranks, or may the list differ per rank? We have opted for the former ...

In our proposal we have also defined how the various other calls should behave in a parallel environment: most calls, such as get and set work only on the local data and that must be the same since you don't allow inter-rank communication in such calls. However, the grid_funcs that are used to determine grid dimensions and grid partitioning across the ranks needed some special attention.

hellkite500 · 2024-01-25T23:18:54Z

Thanks for sharing @hrajagers! It is quite interesting to see a distributed BMI from the perspective that the caller (framework) manages the entire collective (e.g. stitching together all local data). What I see as an interesting question to ask based on that idea is "How does a framework mix and match MPI compatible components and non MPI components?" It seems that the ABI would require the complete set of functions from both. We have been thinking about this problem from through the lens of "whats the least invasive, minimal requirement" which would allow multiple MPI components to co-exist as components without interfering with each other.

no MPI communication is allowed within the model for such calls

I don't think you are reading this correctly, we are not putting a blanket restriction on MPI communication in these functions, we are restricting which type, specifically collective communication, that shouldn't be used in those functions.

One thing that we must maintain for our work is ABI compatibility across multiple language/runtime interfaces and be able to run both MPI and non-MPI components in the same runtime. This has lead us to try some possibly "unorthodox" methods of using the BMI interface.

hrajagers · 2024-01-26T15:36:07Z

Hi @hellkite500 , I realize that you're trying to work within the scope of the existing BMI 2.0 API whereas our design was based on the assumption that this would be a BMI extension ... where it's still an open discussion on how to recognize and identify BMI extensions (csdms/bmi#138). I realize that the Fortran API based on an abstract BMI type with deferred procedures does force you into a direction that requires you to implement all procedures defined. Maybe the extension should then be defined as a second type, but I can see potential issues with that as well.

I understand that you're wrapping this Fortran BMI type layer with a C interoperability layer in NextGen. We're doing something similar ourselves where we're moving the effective interface layer to the interface of a shared library. The possibilities for extensions may change somewhat if we look at that type of interface.

My interpretation was that only initialize, finalize, update and update_until would be called collectively, and that the other routines may or may not be called on all ranks. If the called component doesn't know whether the other MPI ranks are called ... it cannot initiate MPI communication with any other rank without risking hanging the program. Without a synchronous BMI call to the MPI rank with which it wants to communicate, there is no activity thread in the component on that rank to "answer" the MPI communication.

PhilMiller · 2024-01-31T02:08:06Z

I need to answer the calling sequence stuff, but I'm not focused on that at the moment.

I'm making a note that there are at least two more reasonable non-extension approaches besides a protocol like what's I described in the initial PR:

Use the MPI_Foo_{c2f,f2c} interfaces (C/Fortran ABI interop routines) to simplify the inter-language ABI issue to one of passing integers across the boundaries
Use MPI Sessions from the MPI 4.0 standard

I had briefly looked into the c2f/f2c approach when I started down this path. I should have documented that at the time, and my contemporary reason for rejecting it: it didn't look like mpi4py had corresponding bindings. That was wrong, because I searched poorly. Every MPI object type represented in mpi4py has methods f2py and py2f that are equivalents. Given that, the framework can uniformly create communicators as it pleases, call MPI_Comm_c2f on them to get an integer handle to them, and set a scalar variable on BMI models with the handle. All of C, C++, Fortran, and Python code can then convert that to a language-native MPI communicator object. This seems strictly superior to asking BMI models to fiddle with their own groups and related communicator creation, and especially to passing lists of processes as an array variable.

MPI Sessions might allow for a broader range of sophisticated behaviors among different models. They're standardized in MPI 4.0, released in June 2021. They're supported in OpenMPI, MPICH, and Intel MPI, but seemingly not in Cray MPI which may be the baseline on our operational target of WCOSS2.

I'll do some looking to see whether an approach based on Sessions is worth pursuing, but the integer handle approach seems like the expedient course of action. I'll amend this PR's contents accordingly.

hrajagers · 2024-01-31T08:15:05Z

Hi @PhilMiller , Thank you for the update.

Good to hear that you think that the c2f/f2c approach will be portable to a wider set of languages. It's what we now use in-house between C and Fortran, but I hadn't investigated it enough to determine whether it would be portable to all platforms.

I'm new to MPI Sessions. The description in Holmes et al (2016) is rather concise. I'm wondering how the local MPI_Session_Init would know where to instantiate or group the MPI threads, but that's of less importance right now. My main concern with MPI Sessions that are created within a component and not visible to the outside, is that it would make all communications between components sequential i.e. single threaded. This may be good enough (or even preferable) for some use cases, but it would limit scalability of BMI communication towards bigger computations in which two parallel models try to communicate. If we move the MPI Session creation outside the component, then I'm not sure that it differs significantly from the other method.

donaldwj

The description here is concise and easy to understand.

PhilMiller · 2024-02-14T16:00:03Z

There's an implied constraint in this design that any gather operations of global fields would have to happen in update.

From discussion in today's stand-up, Donald noted that this constraint means that code may run into resource limits if they proactively gather all such fields. The alternative for the time being is that such fields will be enumerated/selected in the runtime configuration. We don't immediately have any models for which that's an issue, so we're not going to block on it.

PhilMiller · 2024-03-08T18:40:49Z

Following discussion with folks working on a parallel BMI extension, we've concluded that the path forward for ngen is to use what's described here solely as a temporary measure in SCHISM to allow development progress, and plan to adopt the community-consensus extension design as it solidifies. So, protocol documentation will be in code comments, whose lifetime will line up exactly with actual use of the functionality - when we move away from it the comments can get deleted with the code.

PhilMiller requested review from hellkite500, donaldwj, robertbartel, program--, christophertubbs, aaraney and jduckerOWP January 19, 2024 01:39

hellkite500 reviewed Jan 25, 2024

View reviewed changes

program-- mentioned this pull request Feb 1, 2024

Forcings Engine Data Provider Interface and Lumped Forcings Implementation #720

Closed

13 tasks

PhilMiller force-pushed the phil/bmi-mpi branch from 724e796 to aa62d85 Compare February 13, 2024 22:20

PhilMiller mentioned this pull request Feb 14, 2024

Evaluating SCHISM BMI as a NextGen Formulation #547

Open

Draft revised MPI communicator setup protocol

f1ae29d

PhilMiller force-pushed the phil/bmi-mpi branch from aa62d85 to f1ae29d Compare February 14, 2024 01:11

donaldwj approved these changes Feb 14, 2024

View reviewed changes

PhilMiller closed this Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document MPI communicator setup protocol #712

Document MPI communicator setup protocol #712

PhilMiller commented Jan 19, 2024 •

edited

Loading

PhilMiller commented Jan 19, 2024

PhilMiller commented Jan 19, 2024 •

edited

Loading

PhilMiller commented Jan 19, 2024

PhilMiller commented Jan 19, 2024

josephzhang8 commented Jan 19, 2024

PhilMiller commented Jan 20, 2024

jduckerOWP commented Jan 24, 2024

PhilMiller commented Jan 24, 2024 •

edited

Loading

hellkite500 Jan 25, 2024

hrajagers commented Jan 25, 2024

hellkite500 commented Jan 25, 2024

hrajagers commented Jan 26, 2024 •

edited

Loading

PhilMiller commented Jan 31, 2024

hrajagers commented Jan 31, 2024

donaldwj left a comment

PhilMiller commented Feb 14, 2024

PhilMiller commented Mar 8, 2024

Document MPI communicator setup protocol #712

Document MPI communicator setup protocol #712

Conversation

PhilMiller commented Jan 19, 2024 • edited Loading

Changes

Checklist

PhilMiller commented Jan 19, 2024

PhilMiller commented Jan 19, 2024 • edited Loading

PhilMiller commented Jan 19, 2024

PhilMiller commented Jan 19, 2024

josephzhang8 commented Jan 19, 2024

PhilMiller commented Jan 20, 2024

jduckerOWP commented Jan 24, 2024

PhilMiller commented Jan 24, 2024 • edited Loading

hellkite500 Jan 25, 2024

Choose a reason for hiding this comment

hrajagers commented Jan 25, 2024

hellkite500 commented Jan 25, 2024

hrajagers commented Jan 26, 2024 • edited Loading

PhilMiller commented Jan 31, 2024

hrajagers commented Jan 31, 2024

donaldwj left a comment

Choose a reason for hiding this comment

PhilMiller commented Feb 14, 2024

PhilMiller commented Mar 8, 2024

PhilMiller commented Jan 19, 2024 •

edited

Loading

PhilMiller commented Jan 19, 2024 •

edited

Loading

PhilMiller commented Jan 24, 2024 •

edited

Loading

hrajagers commented Jan 26, 2024 •

edited

Loading