Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document MPI communicator setup protocol #712

Closed
wants to merge 1 commit into from

Conversation

PhilMiller
Copy link
Contributor

@PhilMiller PhilMiller commented Jan 19, 2024

Describes how we want coastal models and eventually any other MPI-parallelized BMI modules to be informed of the ranks of MPI processes available for their use.

Changes

  • Add description to BMI conventions doc

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows project standards (link if applicable)
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

@PhilMiller
Copy link
Contributor Author

I recommend viewing the diff in this PR in the "rich diff" mode, so that formatting is rendered.

@PhilMiller
Copy link
Contributor Author

PhilMiller commented Jan 19, 2024

This is a revision of my earlier idea about passing in arguments for a call to MPI_Comm_split. The upshot of the revision is that the framework no longer needs to ensure any corresponding code runs on non-participating processes to satisfy a collective over MPI_COMM_WORLD - only processes on which a given model instance will exist need to be called.

@PhilMiller
Copy link
Contributor Author

There are a few distinct concerns I can think of to be reviewed here:

  • Does this write-up clearly describe what we're (prospectively) doing and why?
  • Is what's described here sensible and desirable?
  • Is the implementation entailed reasonable?

@PhilMiller
Copy link
Contributor Author

@josephzhang8 You may be interested in this too - this describes how I'm planning to feed the setup for a private MPI communicator into SCHISM via the BMI interface that Jason has developed.

@josephzhang8
Copy link

Thx

@PhilMiller
Copy link
Contributor Author

@peckhams I forget if we discussed my plans around passing through arguments to MPI_Comm_split at AGU, or the idea hadn't occurred to me yet. This PR describes a refinement of that idea, and how I intend to get MPI-parallelized model code working within the NextGen Framework in the absence of any better ideas or movement on the standardization front.

@jduckerOWP
Copy link

@hrajagers (Bert Jagers) from Deltares has been also working on BMI MPI parallelization and is interested in how the NextGen project has approached this particular problem. This describes in theory how @PhilMiller plans to feed the setup for a private MPI communicator into NextGen coastal models (SCHISM/DFlowFM).

@PhilMiller
Copy link
Contributor Author

PhilMiller commented Jan 24, 2024

If we solely wanted to work through this at an ABI level, and have the framework do all the MPI_Comm juggling itself, we could alternatively use the MPI_Comm_c2f family of routines and pass integers (MPI_Fint) through BMI methods. We still have the weird lifecycle stuff to deal with. It's also not applicable to Python, since mpi4py doesn't expose those routines. [that was wrong - there's f2py and py2f]

For MPI-parallel BMI models, we modify the model lifecycle relative to
models that will run within a single process. Specifically, between
instantiation of the model instance and the call to `initialize`, the
framework will make a single collective call to `set_value_at_indices`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to note here that, as long as the implementation follows the guidance below on only establishing the MPI_Group and MPI_Comm, this is a safe call outside of the traditional Initialize -> do stuff -> Finalize cycle since the next gen registration establishes the BMI model structure.

Also may be a good idea to document the expected semantics of multiple calls to set_value_at_indicies with a name=bmi_mpi_rank_assignment variable. If it is allowed to be dynamic, then a quick note on that and what it means. If it shouldn't be called twice, might be good to include some boilerplate logic to check and skip re-initializing the MPI group/comm.

@hrajagers
Copy link

Thank you @jduckerOWP for bringing this thread to my attention. I have had two meetings on the topic of a BMI extension for parallel computing with the CSDMS group and NCAR developers, and the draft approach that we're following currently is documented at https://github.com/csdms/bmi/tree/hrajagers/parallel-bmi or more precisely in the bmi.****.rst files in https://github.com/csdms/bmi/tree/hrajagers/parallel-bmi/docs/source. I see two fundamental differences:

  1. In your approach the BMI component generates its communicator from the MPI_COMM_WORLD by means of an MPI_Comm_create_group call. In our approach the model_group communicator is assumed to be created by the calling component/framework and then passed down to the component. I like the fact that your solution avoids passing communicators across language interfaces ... we haven't run into issues on this topic yet, but there are some concerns whether it's portable to all environments. It would be nice to perform some basic tests across Python, C and Fortran languages with both approaches.

  2. You assume that all calls except for initialize, finalize, update and update_until can be called independently per rank, i.e. no MPI communication is allowed within the model for such calls. On the other hand, we had written in our proposal that all calls must be performed across all ranks since communication might be needed within the model component. It would be great for the calling side if it could rely on the fact that no MPI communication is needed, but I'm not yet sure that we're ready yet to make that conclusion. Especially when getting a list of input/output quantities we have been discussing whether this should be one consistent list across all ranks, or may the list differ per rank? We have opted for the former ...

In our proposal we have also defined how the various other calls should behave in a parallel environment: most calls, such as get and set work only on the local data and that must be the same since you don't allow inter-rank communication in such calls. However, the grid_funcs that are used to determine grid dimensions and grid partitioning across the ranks needed some special attention.

@hellkite500
Copy link
Contributor

Thanks for sharing @hrajagers! It is quite interesting to see a distributed BMI from the perspective that the caller (framework) manages the entire collective (e.g. stitching together all local data). What I see as an interesting question to ask based on that idea is "How does a framework mix and match MPI compatible components and non MPI components?" It seems that the ABI would require the complete set of functions from both. We have been thinking about this problem from through the lens of "whats the least invasive, minimal requirement" which would allow multiple MPI components to co-exist as components without interfering with each other.

no MPI communication is allowed within the model for such calls

I don't think you are reading this correctly, we are not putting a blanket restriction on MPI communication in these functions, we are restricting which type, specifically collective communication, that shouldn't be used in those functions.

One thing that we must maintain for our work is ABI compatibility across multiple language/runtime interfaces and be able to run both MPI and non-MPI components in the same runtime. This has lead us to try some possibly "unorthodox" methods of using the BMI interface.

@hrajagers
Copy link

hrajagers commented Jan 26, 2024

Hi @hellkite500 , I realize that you're trying to work within the scope of the existing BMI 2.0 API whereas our design was based on the assumption that this would be a BMI extension ... where it's still an open discussion on how to recognize and identify BMI extensions (csdms/bmi#138). I realize that the Fortran API based on an abstract BMI type with deferred procedures does force you into a direction that requires you to implement all procedures defined. Maybe the extension should then be defined as a second type, but I can see potential issues with that as well.

I understand that you're wrapping this Fortran BMI type layer with a C interoperability layer in NextGen. We're doing something similar ourselves where we're moving the effective interface layer to the interface of a shared library. The possibilities for extensions may change somewhat if we look at that type of interface.

My interpretation was that only initialize, finalize, update and update_until would be called collectively, and that the other routines may or may not be called on all ranks. If the called component doesn't know whether the other MPI ranks are called ... it cannot initiate MPI communication with any other rank without risking hanging the program. Without a synchronous BMI call to the MPI rank with which it wants to communicate, there is no activity thread in the component on that rank to "answer" the MPI communication.

@PhilMiller
Copy link
Contributor Author

I need to answer the calling sequence stuff, but I'm not focused on that at the moment.

I'm making a note that there are at least two more reasonable non-extension approaches besides a protocol like what's I described in the initial PR:

  • Use the MPI_Foo_{c2f,f2c} interfaces (C/Fortran ABI interop routines) to simplify the inter-language ABI issue to one of passing integers across the boundaries
  • Use MPI Sessions from the MPI 4.0 standard

I had briefly looked into the c2f/f2c approach when I started down this path. I should have documented that at the time, and my contemporary reason for rejecting it: it didn't look like mpi4py had corresponding bindings. That was wrong, because I searched poorly. Every MPI object type represented in mpi4py has methods f2py and py2f that are equivalents. Given that, the framework can uniformly create communicators as it pleases, call MPI_Comm_c2f on them to get an integer handle to them, and set a scalar variable on BMI models with the handle. All of C, C++, Fortran, and Python code can then convert that to a language-native MPI communicator object. This seems strictly superior to asking BMI models to fiddle with their own groups and related communicator creation, and especially to passing lists of processes as an array variable.

MPI Sessions might allow for a broader range of sophisticated behaviors among different models. They're standardized in MPI 4.0, released in June 2021. They're supported in OpenMPI, MPICH, and Intel MPI, but seemingly not in Cray MPI which may be the baseline on our operational target of WCOSS2.

I'll do some looking to see whether an approach based on Sessions is worth pursuing, but the integer handle approach seems like the expedient course of action. I'll amend this PR's contents accordingly.

@hrajagers
Copy link

Hi @PhilMiller , Thank you for the update.

Good to hear that you think that the c2f/f2c approach will be portable to a wider set of languages. It's what we now use in-house between C and Fortran, but I hadn't investigated it enough to determine whether it would be portable to all platforms.

I'm new to MPI Sessions. The description in Holmes et al (2016) is rather concise. I'm wondering how the local MPI_Session_Init would know where to instantiate or group the MPI threads, but that's of less importance right now. My main concern with MPI Sessions that are created within a component and not visible to the outside, is that it would make all communications between components sequential i.e. single threaded. This may be good enough (or even preferable) for some use cases, but it would limit scalability of BMI communication towards bigger computations in which two parallel models try to communicate. If we move the MPI Session creation outside the component, then I'm not sure that it differs significantly from the other method.

Copy link
Contributor

@donaldwj donaldwj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description here is concise and easy to understand.

@PhilMiller
Copy link
Contributor Author

There's an implied constraint in this design that any gather operations of global fields would have to happen in update.

From discussion in today's stand-up, Donald noted that this constraint means that code may run into resource limits if they proactively gather all such fields. The alternative for the time being is that such fields will be enumerated/selected in the runtime configuration. We don't immediately have any models for which that's an issue, so we're not going to block on it.

@PhilMiller
Copy link
Contributor Author

Following discussion with folks working on a parallel BMI extension, we've concluded that the path forward for ngen is to use what's described here solely as a temporary measure in SCHISM to allow development progress, and plan to adopt the community-consensus extension design as it solidifies. So, protocol documentation will be in code comments, whose lifetime will line up exactly with actual use of the functionality - when we move away from it the comments can get deleted with the code.

@PhilMiller PhilMiller closed this Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants