Scaling issue with UltraNest #159

guannant · 2025-01-07T21:03:44Z

UltraNest version: 4.4.0
Python version:3.12
Operating System: Red Hat Enterprise Linux 8.10 (Ootpa)

Description

I'm observing a significant slowdown when parallelizing UltraNest across multiple nodes. I conducted a scaling study using a toy problem with UltraNest. While I observed speedup for intra-node parallelization, performance dropped notably when using more than one node. For example, with two nodes (256 cores), UltraNest took > 1.5h to complete, compared to just 14 minutes when using a single node (128 cores). I wonder if this is due to an error in my parallelization implementation or an internal issue within UltraNest.

What I Did

On the cluster, I have this PBS code to schedule the job for 2 node parallelization

#PBS -l select=2:mpiprocs=256
#PBS -l walltime=20:00:00
.......
module load gcc/13.2.0
module load mpich/4.2.0-gcc-13.2.0
module load anaconda3/2024.02
conda activate espei_un
mpiexec -np 256 python UN_toy.py

The UN_toy.py script contains a simple sinusoidal regression, similar to the example demonstrated at https://johannesbuchner.github.io/UltraNest/example-sine-highd.html, but extended to a 22-dimensional problem.

This is the actual code file

import numpy as np
import scipy
from numpy import sin, pi
import ultranest
import ultranest.stepsampler
import matplotlib.pyplot as plt

def multi_sin(x, params):
    sum = 0
    n = len(params)//3
    for i in range(n):
        A, B, C = params[i*3:i*3+3]
        sum += A * sin(2*pi* (x/B + C))
    sum += params[-1]
    return sum

def transform_to_prior(cube):
    params = cube.copy()
    for i in range(len(params)):
        params[i] = norm_list[i].ppf(params[i])
    return params

def log_lieklihood(params):
    y_guess = multi_sin(t,params)
    return -0.5 * np.sum((y - y_guess)**2)


paramnames = ['param1', 'param2', 'param3','param4','param5','param6','param7','param8','param9','param10','param11',
              'param12', 'param13', 'param14', 'param15', 'param16', 'param17', 'param18', 'param19', 'param20', 'param21',
              'param22']

rng = np.random.default_rng(seed=42)


initial_value = [4.2, 3, 0, 
                 1.2, 1.2, 1.2,
                 0.9, 1.8, 1.1,
                 1.4, 1.5, 1.6,
                 2.5, 1.2, 1.5,
                 3.4, 1.8, 1.4,
                 4.2, 1.5, 1.6]
initial_value.append(1.0)

norm_list= []
t = rng.uniform(0, 5, size=60)
y = rng.normal(multi_sin(t, initial_value), 1)

for item in initial_value:
    norm_list.append(scipy.stats.norm(loc = item, scale = 1))

sampler = ultranest.ReactiveNestedSampler(
    paramnames,
    log_lieklihood,
    transform_to_prior,
    log_dir="scalibility_UN_toy_22para_step44_2node",
)

nsteps = 44#len(paramnames)
sampler.stepsampler = ultranest.stepsampler.SliceSampler(
    nsteps=nsteps,
    generate_direction=ultranest.stepsampler.generate_mixture_random_direction,
)

result = sampler.run(min_num_live_points=400)
sampler.print_results()
sampler.plot()

The text was updated successfully, but these errors were encountered:

JohannesBuchner · 2025-01-08T07:42:22Z

Are the two nodes identical (in terms of CPU speed)?

I guess there is some MPI communication overhead to consider, especially since you are not using vectorization features.
Maybe rewrite your likelihood and transform to be vectorized, and have a look at population step samplers. There is an equivalent slice sampler, and recently the "SimpleSliceSampler" class was contributed, which trades algorithmic efficiency for better parallelisation.

guannant · 2025-01-08T16:44:12Z

Are the two nodes identical (in terms of CPU speed)?

I guess there is some MPI communication overhead to consider, especially since you are not using vectorization features. Maybe rewrite your likelihood and transform to be vectorized, and have a look at population step samplers. There is an equivalent slice sampler, and recently the "SimpleSliceSampler" class was contributed, which trades algorithmic efficiency for better parallelisation.

Yes, exactly the same. They are simply two nodes from the HPC. I asked because this behavior is quite unusual—cross-node parallelization significantly increases the computing time, jumping from 14 minutes to over 120 minutes. It seems more like the code is only using a single thread to estimate the problem when attempting cross-node parallelization. Based on your experience, have you ever tried parallelizing UltraNest across multiple nodes? If so, how was the scaling performance?

I'll take a look at the SimpleSliceSampler class. The issue on my end is that the real problem I’m trying to estimate with UltraNest cannot be vectorized due to its dependency on an upstream package. This means I need a solution that can be parallelized across multiple nodes without requiring vectorization.

JohannesBuchner · 2025-01-09T12:09:28Z

To be honest, I have not worked across nodes much. Maybe you could also have a look at the CPU usage. If it is not 100%, the MPI communication overhead may be an issue?

I am also not sure how much control the mpi4py interface offers for handling this.
The parallelisation happens here:
https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L1896
and is synced in the if self.use_mpi: branch.

Another thing, regarding your original setup: each MPI process runs its own independent step sampler, and when one yields a result, the nested sampler progresses and increases to the next likelihood threshold Lmin. Then all step samplers have to update, potentially walking back if they have stepped outside the L>Lmin region (see here). This happens often if you have a low number of likelihood evaluations until success and a high number of parallel MPI processes. Maybe this is part of the issue. If you had a larger number of steps this inefficiency would occur less frequently.

JohannesBuchner · 2025-01-09T13:34:05Z

I think what you could also do is to run ultranest two times, once on each node, with half as many live points. Then take the points.hdf5, merge them and either read the file with the read_file function or resume from it on a single CPU setting max_ncalls to 1, which hopefully does not compute anything anymore.

Merging hdf5 files has not been implemented, but you can concatenate the table inside (there is only one, have a look with h5ls) and sort it by Lmin. I would recommend running with max_num_improvement_loops=0 and setting num_test_samples=0 when initialising the sampler.

guannant · 2025-01-13T06:26:54Z

I conducted a quick scaling test on my toy problem over the weekend, running up to a maximum of 500,000 calls and profiling the performance as well. Here are the results:

core #	Tot time (s)	time on (_create) (s)	Efficiency
1	1209	304.52	1.00
8	199	39.25	0.76
16	117	31.25	0.65
32	77	26.89	0.49
64	84	24.57	0.22
128	107	45.43	0.09
256	137	85.97	0.03
384	149	96.69	0.02

From the results, it appears that UltraNest benefits from parallelization up to 64 cores, after which the communication overhead outweighs the gains from additional cores. The primary computational cost stems from the _create_point function, which, as you mentioned, is where the parallelization occurs. You are right regarding the impact of communication costs.

What I find particularly unusual is the significant variability in the fitting results from UltraNest depending on the number of cores used for parallelization. I performed three full runs with different numbers of parallel cores (64, 128, and 384). The attached figure illustrates this:

the yellow points represent the original data, and the black line is the reconstructed fit based on UltraNest's parameter estimates. Interestingly, the 64-core fitting result aligns best with the original data, whereas the fits with higher core counts (128 and 384) struggle to accurately describe the data.

Upon examining the output files, I noticed that higher core counts resulted in significantly more calls as UltraNest converged—19 million calls for 384 cores, 11 million for 128 cores, and 9 million for 64 cores. Based on my understanding of UltraNest's parallelization, it distributes the live points across available threads and does the sampling on each thread, with results communicated back to the root thread. This likely explains why more cores result in more likelihood evaluation calls.

However, this raises a key question: does using more cores enable UltraNest to explore the parameter space more thoroughly? If so, why does the 384-core parallelization fail to fit the data as effectively as the 64-core run?

The same behavior was observed when I fixed the number of parallelization cores but increased the step size in the stepsampler. In cases where the parameter space was supposedly explored more thoroughly (higher step size), the fitting results struggled to explain the data. See the attached figure below for another experiment I conducted to investigate how the stepsampler's step size affects UltraNest's fitting performance:

JohannesBuchner · 2025-01-13T08:18:37Z

Also have a look at the number of nested sampling iterations?

What I find particularly unusual is the significant variability in the fitting results from UltraNest depending on the number of cores used for parallelization. I performed three full runs with different numbers of parallel cores (64, 128, and 384). The attached figure illustrates this:

Are these plots based on all the means or the best-fits? It would be better to look at each parameter mean and stdev and see if they agree with each other or not.

However, this raises a key question: does using more cores enable UltraNest to explore the parameter space more thoroughly? If so, why does the 384-core parallelization fail to fit the data as effectively as the 64-core run?

You would have to increase the number of live points as well to achieve that effect.

JohannesBuchner · 2025-01-13T08:30:38Z

Another potential problem is that this is running N step samplers, one on each core, and always the first that finishes M steps is accepted as a new independent point. A single slice sampling step (of M) may however take a pretty random number of evaluations, because of the stepping out procedure. Then if always the quickest stepping-out sampler is put to the front, this could introduce a bias. The more cores, the more this reordering could become an issue.

The (vectorized, not MPI-aware) population step sampler https://johannesbuchner.github.io/UltraNest/ultranest.html#ultranest.popstepsampler.PopulationSliceSampler keeps track of the generation that each step sampler is in, and follows a FIFO model.

It is not so trivial to implement, because if the likelihood threshold is raised, then some step samplers have to be reverted, which further diversifies how long it takes until they finish.

I used a ring buffer to solve it, where each element is advanced (if it has not done M steps yet), and a pointer to the next element to remove is maintained, so the order cannot change. In the MPI case, we would have to keep track of the MPI rank to pick from and wait for it, and all the step samplers that have already finished M steps would get to pause.

guannant · 2025-01-13T15:58:50Z

The 384-core case required 24,841 iterations to converge, the 128-core case required 25,114 iterations, and the 64-core case required 26,180 iterations. It seems like that higher core counts may increase the likelihood of premature stopping, potentially leading to challenges in fully capturing the data's complexity ?

Also have a look at the number of nested sampling iterations?

What I find particularly unusual is the significant variability in the fitting results from UltraNest depending on the number of cores used for parallelization. I performed three full runs with different numbers of parallel cores (64, 128, and 384). The attached figure illustrates this:

These plots are based on all the means. And here is the apple-by-apple comparison for each of the parameters:

They vary to some extent based on the number of cores used. The sampling bias seems to be real.

Are these plots based on all the means or the best-fits? It would be better to look at each parameter mean and stdev and see if they agree with each other or not.

However, this raises a key question: does using more cores enable UltraNest to explore the parameter space more thoroughly? If so, why does the 384-core parallelization fail to fit the data as effectively as the 64-core run?

I see. Based on your explanation, the most plausible reason for the variation is that a higher core count might introduce biased sampling in a few preferred (faster) regions. This could result in UltraNest favoring a subset of the parameter space, potentially contributing to the observed challenges. Did I understand that correctly?

You would have to increase the number of live points as well to achieve that effect.

guannant · 2025-01-13T20:21:20Z

I see how this approach could address the bias issue effectively. My concern is whether pausing and waiting for all step samplers to finish might significantly increase runtime, especially when parallelizing across multiple nodes with hundreds of threads. Could this impact scalability?

I used a ring buffer to solve it, where each element is advanced (if it has not done M steps yet), and a pointer to the next element to remove is maintained, so the order cannot change. In the MPI case, we would have to keep track of the MPI rank to pick from and wait for it, and all the step samplers that have already finished M steps would get to pause.

JohannesBuchner · 2025-01-14T07:38:54Z

yes, that's a problem. The CPU usage would not be 100%. One does not have to wait for all step samplers, but only for the rank=i%P'th step sampler where i is the nested sampling iteration and P is the number of MPI processes.

Oversubscribing could help a bit.

Perhaps a better design would be to run step samplers always and not revert them, but instead insert them into the NS run at the point where they were launched (Lmin->Lnew). This would quite heavily vary the number of live points. In that case integrator.py/_create_point would need to be modified to collect the step sampler results but not advance the NS iteration yet until all step samplers that were launched at the current Lmin have returned.
Then here https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L2733 instead of adding one node one would add all nodes at Lmin->Lnew.

This can be slightly more refined. One only has to wait for all step samplers that were launched at the current Lmin, but only the first one in submission order. Also, the other step samplers that were launched at a lower Lmin can be carried over to the next iteration if their current L is above the next Lmin. This could also avoid waiting and burstiness in number of live points.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling issue with UltraNest #159

Scaling issue with UltraNest #159

guannant commented Jan 7, 2025 •

edited

Loading

JohannesBuchner commented Jan 8, 2025

guannant commented Jan 8, 2025

JohannesBuchner commented Jan 9, 2025

JohannesBuchner commented Jan 9, 2025

guannant commented Jan 13, 2025 •

edited

Loading

JohannesBuchner commented Jan 13, 2025

JohannesBuchner commented Jan 13, 2025

guannant commented Jan 13, 2025 •

edited

Loading

guannant commented Jan 13, 2025

JohannesBuchner commented Jan 14, 2025 •

edited

Loading

Scaling issue with UltraNest #159

Scaling issue with UltraNest #159

Comments

guannant commented Jan 7, 2025 • edited Loading

Description

What I Did

JohannesBuchner commented Jan 8, 2025

guannant commented Jan 8, 2025

JohannesBuchner commented Jan 9, 2025

JohannesBuchner commented Jan 9, 2025

guannant commented Jan 13, 2025 • edited Loading

JohannesBuchner commented Jan 13, 2025

JohannesBuchner commented Jan 13, 2025

guannant commented Jan 13, 2025 • edited Loading

guannant commented Jan 13, 2025

JohannesBuchner commented Jan 14, 2025 • edited Loading

guannant commented Jan 7, 2025 •

edited

Loading

guannant commented Jan 13, 2025 •

edited

Loading

guannant commented Jan 13, 2025 •

edited

Loading

JohannesBuchner commented Jan 14, 2025 •

edited

Loading