-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling issue with UltraNest #159
Comments
Are the two nodes identical (in terms of CPU speed)? I guess there is some MPI communication overhead to consider, especially since you are not using vectorization features. |
Yes, exactly the same. They are simply two nodes from the HPC. I asked because this behavior is quite unusual—cross-node parallelization significantly increases the computing time, jumping from 14 minutes to over 120 minutes. It seems more like the code is only using a single thread to estimate the problem when attempting cross-node parallelization. Based on your experience, have you ever tried parallelizing UltraNest across multiple nodes? If so, how was the scaling performance? I'll take a look at the SimpleSliceSampler class. The issue on my end is that the real problem I’m trying to estimate with UltraNest cannot be vectorized due to its dependency on an upstream package. This means I need a solution that can be parallelized across multiple nodes without requiring vectorization. |
To be honest, I have not worked across nodes much. Maybe you could also have a look at the CPU usage. If it is not 100%, the MPI communication overhead may be an issue? I am also not sure how much control the mpi4py interface offers for handling this. Another thing, regarding your original setup: each MPI process runs its own independent step sampler, and when one yields a result, the nested sampler progresses and increases to the next likelihood threshold Lmin. Then all step samplers have to update, potentially walking back if they have stepped outside the L>Lmin region (see here). This happens often if you have a low number of likelihood evaluations until success and a high number of parallel MPI processes. Maybe this is part of the issue. If you had a larger number of steps this inefficiency would occur less frequently. |
I think what you could also do is to run ultranest two times, once on each node, with half as many live points. Then take the points.hdf5, merge them and either read the file with the Merging hdf5 files has not been implemented, but you can concatenate the table inside (there is only one, have a look with h5ls) and sort it by Lmin. I would recommend running with |
Also have a look at the number of nested sampling iterations?
Are these plots based on all the means or the best-fits? It would be better to look at each parameter mean and stdev and see if they agree with each other or not.
You would have to increase the number of live points as well to achieve that effect. |
Another potential problem is that this is running N step samplers, one on each core, and always the first that finishes M steps is accepted as a new independent point. A single slice sampling step (of M) may however take a pretty random number of evaluations, because of the stepping out procedure. Then if always the quickest stepping-out sampler is put to the front, this could introduce a bias. The more cores, the more this reordering could become an issue. The (vectorized, not MPI-aware) population step sampler https://johannesbuchner.github.io/UltraNest/ultranest.html#ultranest.popstepsampler.PopulationSliceSampler keeps track of the generation that each step sampler is in, and follows a FIFO model. It is not so trivial to implement, because if the likelihood threshold is raised, then some step samplers have to be reverted, which further diversifies how long it takes until they finish. I used a ring buffer to solve it, where each element is advanced (if it has not done M steps yet), and a pointer to the next element to remove is maintained, so the order cannot change. In the MPI case, we would have to keep track of the MPI rank to pick from and wait for it, and all the step samplers that have already finished M steps would get to pause. |
The 384-core case required 24,841 iterations to converge, the 128-core case required 25,114 iterations, and the 64-core case required 26,180 iterations. It seems like that higher core counts may increase the likelihood of premature stopping, potentially leading to challenges in fully capturing the data's complexity ?
These plots are based on all the means. And here is the apple-by-apple comparison for each of the parameters:
I see. Based on your explanation, the most plausible reason for the variation is that a higher core count might introduce biased sampling in a few preferred (faster) regions. This could result in UltraNest favoring a subset of the parameter space, potentially contributing to the observed challenges. Did I understand that correctly?
|
I see how this approach could address the bias issue effectively. My concern is whether pausing and waiting for all step samplers to finish might significantly increase runtime, especially when parallelizing across multiple nodes with hundreds of threads. Could this impact scalability?
|
yes, that's a problem. The CPU usage would not be 100%. One does not have to wait for all step samplers, but only for the rank=i%P'th step sampler where i is the nested sampling iteration and P is the number of MPI processes. Oversubscribing could help a bit. Perhaps a better design would be to run step samplers always and not revert them, but instead insert them into the NS run at the point where they were launched (Lmin->Lnew). This would quite heavily vary the number of live points. In that case integrator.py/_create_point would need to be modified to collect the step sampler results but not advance the NS iteration yet until all step samplers that were launched at the current Lmin have returned. This can be slightly more refined. One only has to wait for all step samplers that were launched at the current Lmin, but only the first one in submission order. Also, the other step samplers that were launched at a lower Lmin can be carried over to the next iteration if their current L is above the next Lmin. This could also avoid waiting and burstiness in number of live points. |
Description
I'm observing a significant slowdown when parallelizing UltraNest across multiple nodes. I conducted a scaling study using a toy problem with UltraNest. While I observed speedup for intra-node parallelization, performance dropped notably when using more than one node. For example, with two nodes (256 cores), UltraNest took > 1.5h to complete, compared to just 14 minutes when using a single node (128 cores). I wonder if this is due to an error in my parallelization implementation or an internal issue within UltraNest.
What I Did
On the cluster, I have this PBS code to schedule the job for 2 node parallelization
The
UN_toy.py
script contains a simple sinusoidal regression, similar to the example demonstrated at https://johannesbuchner.github.io/UltraNest/example-sine-highd.html, but extended to a 22-dimensional problem.This is the actual code file
The text was updated successfully, but these errors were encountered: