-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI-PR: quadratics scaling for the number of open file descriptors with respect to the number of processes/node #257
Comments
We definitely need to hold on to shared memory. The alternative would be to run everything through the progress rank, which would hit performance, probably significantly. Have you tried increasing the number of progress ranks? There is a good chance this would be equivalent to creating virtual nodes. If it isn't, then we could probably fix it up so that it is. The variable to set this is GA_NUM_PROGRESS_RANKS_PER_NODE |
It's possible to allocate one SHM slab per GA. MPI RMA does this under the hood. |
|
Okay, I'll take a look. How many progress ranks per node were you using? My guess is that if you double the number of progress ranks per node, it should be possible to decrease the number of file descriptors by 4. |
Now that I am correctly setting Results on a a single node run with 128 processes
|
So what happens when you increase the number of progress ranks? |
You are correct. Each process creates it own mapped file but only sees the other processes in its own subgroup so I guess a linear decrease is what you would expect. Is this good enough? The only other possibility would be to have one process do a single large allocation and then divide that up among all other processes. That would probably take some significant effort. Also, it looks like the code opens a shared memory segment (which creates a file descriptor), gets the pointer using mmap and then closes the file descriptor, so the descriptors are not hanging around for any great length of time. Do your numbers reflect total number of file descriptor created or maximum open at any one time? |
This is probably good enough for the time being. On a related topic, I still don't understand how the comex mpi-pr code decides the size and number of allocations. I have edited the part of NWChem code I am using for this experiment by drastically reducing the number of
The number of file descriptors I am quoting is the output of
|
Since every shared memory allocation in MPI-PR opens a memory mapped file for each rank and each rank needs to know the file descriptors of all the ranks on the same node ... you end up with a (proc. per node)^2 file descriptors opened for every shared memory allocation.
Since 128-core hardware is becoming common place and 128*128=16K, we already have seen reports of Global Arrays runs that required to increase the kernel limit
/proc/sys/fs/file-max
to values O(10^6)-O(10^7).https://groups.google.com/g/nwchem-forum/c/Q-qvcHP9vP4
nwchemgit/nwchem#338
Can we try to address this from the GA side?
Possible solutions that come to my mind (I have no idea about their feasibility)
The text was updated successfully, but these errors were encountered: