Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling issue with UltraNest #159

Open
guannant opened this issue Jan 7, 2025 · 10 comments
Open

Scaling issue with UltraNest #159

guannant opened this issue Jan 7, 2025 · 10 comments

Comments

@guannant
Copy link

guannant commented Jan 7, 2025

  • UltraNest version: 4.4.0
  • Python version:3.12
  • Operating System: Red Hat Enterprise Linux 8.10 (Ootpa)

Description

I'm observing a significant slowdown when parallelizing UltraNest across multiple nodes. I conducted a scaling study using a toy problem with UltraNest. While I observed speedup for intra-node parallelization, performance dropped notably when using more than one node. For example, with two nodes (256 cores), UltraNest took > 1.5h to complete, compared to just 14 minutes when using a single node (128 cores). I wonder if this is due to an error in my parallelization implementation or an internal issue within UltraNest.

What I Did

On the cluster, I have this PBS code to schedule the job for 2 node parallelization

#PBS -l select=2:mpiprocs=256
#PBS -l walltime=20:00:00
.......
module load gcc/13.2.0
module load mpich/4.2.0-gcc-13.2.0
module load anaconda3/2024.02
conda activate espei_un
mpiexec -np 256 python UN_toy.py

The UN_toy.py script contains a simple sinusoidal regression, similar to the example demonstrated at https://johannesbuchner.github.io/UltraNest/example-sine-highd.html, but extended to a 22-dimensional problem.

This is the actual code file

import numpy as np
import scipy
from numpy import sin, pi
import ultranest
import ultranest.stepsampler
import matplotlib.pyplot as plt

def multi_sin(x, params):
    sum = 0
    n = len(params)//3
    for i in range(n):
        A, B, C = params[i*3:i*3+3]
        sum += A * sin(2*pi* (x/B + C))
    sum += params[-1]
    return sum

def transform_to_prior(cube):
    params = cube.copy()
    for i in range(len(params)):
        params[i] = norm_list[i].ppf(params[i])
    return params

def log_lieklihood(params):
    y_guess = multi_sin(t,params)
    return -0.5 * np.sum((y - y_guess)**2)


paramnames = ['param1', 'param2', 'param3','param4','param5','param6','param7','param8','param9','param10','param11',
              'param12', 'param13', 'param14', 'param15', 'param16', 'param17', 'param18', 'param19', 'param20', 'param21',
              'param22']

rng = np.random.default_rng(seed=42)


initial_value = [4.2, 3, 0, 
                 1.2, 1.2, 1.2,
                 0.9, 1.8, 1.1,
                 1.4, 1.5, 1.6,
                 2.5, 1.2, 1.5,
                 3.4, 1.8, 1.4,
                 4.2, 1.5, 1.6]
initial_value.append(1.0)

norm_list= []
t = rng.uniform(0, 5, size=60)
y = rng.normal(multi_sin(t, initial_value), 1)

for item in initial_value:
    norm_list.append(scipy.stats.norm(loc = item, scale = 1))

sampler = ultranest.ReactiveNestedSampler(
    paramnames,
    log_lieklihood,
    transform_to_prior,
    log_dir="scalibility_UN_toy_22para_step44_2node",
)

nsteps = 44#len(paramnames)
sampler.stepsampler = ultranest.stepsampler.SliceSampler(
    nsteps=nsteps,
    generate_direction=ultranest.stepsampler.generate_mixture_random_direction,
)

result = sampler.run(min_num_live_points=400)
sampler.print_results()
sampler.plot()
@JohannesBuchner
Copy link
Owner

Are the two nodes identical (in terms of CPU speed)?

I guess there is some MPI communication overhead to consider, especially since you are not using vectorization features.
Maybe rewrite your likelihood and transform to be vectorized, and have a look at population step samplers. There is an equivalent slice sampler, and recently the "SimpleSliceSampler" class was contributed, which trades algorithmic efficiency for better parallelisation.

@guannant
Copy link
Author

guannant commented Jan 8, 2025

Are the two nodes identical (in terms of CPU speed)?

I guess there is some MPI communication overhead to consider, especially since you are not using vectorization features. Maybe rewrite your likelihood and transform to be vectorized, and have a look at population step samplers. There is an equivalent slice sampler, and recently the "SimpleSliceSampler" class was contributed, which trades algorithmic efficiency for better parallelisation.

Yes, exactly the same. They are simply two nodes from the HPC. I asked because this behavior is quite unusual—cross-node parallelization significantly increases the computing time, jumping from 14 minutes to over 120 minutes. It seems more like the code is only using a single thread to estimate the problem when attempting cross-node parallelization. Based on your experience, have you ever tried parallelizing UltraNest across multiple nodes? If so, how was the scaling performance?

I'll take a look at the SimpleSliceSampler class. The issue on my end is that the real problem I’m trying to estimate with UltraNest cannot be vectorized due to its dependency on an upstream package. This means I need a solution that can be parallelized across multiple nodes without requiring vectorization.

@JohannesBuchner
Copy link
Owner

To be honest, I have not worked across nodes much. Maybe you could also have a look at the CPU usage. If it is not 100%, the MPI communication overhead may be an issue?

I am also not sure how much control the mpi4py interface offers for handling this.
The parallelisation happens here:
https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L1896
and is synced in the if self.use_mpi: branch.

Another thing, regarding your original setup: each MPI process runs its own independent step sampler, and when one yields a result, the nested sampler progresses and increases to the next likelihood threshold Lmin. Then all step samplers have to update, potentially walking back if they have stepped outside the L>Lmin region (see here). This happens often if you have a low number of likelihood evaluations until success and a high number of parallel MPI processes. Maybe this is part of the issue. If you had a larger number of steps this inefficiency would occur less frequently.

@JohannesBuchner
Copy link
Owner

I think what you could also do is to run ultranest two times, once on each node, with half as many live points. Then take the points.hdf5, merge them and either read the file with the read_file function or resume from it on a single CPU setting max_ncalls to 1, which hopefully does not compute anything anymore.

Merging hdf5 files has not been implemented, but you can concatenate the table inside (there is only one, have a look with h5ls) and sort it by Lmin. I would recommend running with max_num_improvement_loops=0 and setting num_test_samples=0 when initialising the sampler.

@guannant
Copy link
Author

guannant commented Jan 13, 2025

I conducted a quick scaling test on my toy problem over the weekend, running up to a maximum of 500,000 calls and profiling the performance as well. Here are the results:

<style> </style>
core # Tot time (s) time on (_create) (s) Efficiency
1 1209 304.52 1.00
8 199 39.25 0.76
16 117 31.25 0.65
32 77 26.89 0.49
64 84 24.57 0.22
128 107 45.43 0.09
256 137 85.97 0.03
384 149 96.69 0.02

From the results, it appears that UltraNest benefits from parallelization up to 64 cores, after which the communication overhead outweighs the gains from additional cores. The primary computational cost stems from the _create_point function, which, as you mentioned, is where the parallelization occurs. You are right regarding the impact of communication costs.

What I find particularly unusual is the significant variability in the fitting results from UltraNest depending on the number of cores used for parallelization. I performed three full runs with different numbers of parallel cores (64, 128, and 384). The attached figure illustrates this:
image
the yellow points represent the original data, and the black line is the reconstructed fit based on UltraNest's parameter estimates. Interestingly, the 64-core fitting result aligns best with the original data, whereas the fits with higher core counts (128 and 384) struggle to accurately describe the data.

Upon examining the output files, I noticed that higher core counts resulted in significantly more calls as UltraNest converged—19 million calls for 384 cores, 11 million for 128 cores, and 9 million for 64 cores. Based on my understanding of UltraNest's parallelization, it distributes the live points across available threads and does the sampling on each thread, with results communicated back to the root thread. This likely explains why more cores result in more likelihood evaluation calls.

However, this raises a key question: does using more cores enable UltraNest to explore the parameter space more thoroughly? If so, why does the 384-core parallelization fail to fit the data as effectively as the 64-core run?

The same behavior was observed when I fixed the number of parallelization cores but increased the step size in the stepsampler. In cases where the parameter space was supposedly explored more thoroughly (higher step size), the fitting results struggled to explain the data. See the attached figure below for another experiment I conducted to investigate how the stepsampler's step size affects UltraNest's fitting performance:
image

@JohannesBuchner
Copy link
Owner

Also have a look at the number of nested sampling iterations?

What I find particularly unusual is the significant variability in the fitting results from UltraNest depending on the number of cores used for parallelization. I performed three full runs with different numbers of parallel cores (64, 128, and 384). The attached figure illustrates this:

Are these plots based on all the means or the best-fits? It would be better to look at each parameter mean and stdev and see if they agree with each other or not.

However, this raises a key question: does using more cores enable UltraNest to explore the parameter space more thoroughly? If so, why does the 384-core parallelization fail to fit the data as effectively as the 64-core run?

You would have to increase the number of live points as well to achieve that effect.

@JohannesBuchner
Copy link
Owner

Another potential problem is that this is running N step samplers, one on each core, and always the first that finishes M steps is accepted as a new independent point. A single slice sampling step (of M) may however take a pretty random number of evaluations, because of the stepping out procedure. Then if always the quickest stepping-out sampler is put to the front, this could introduce a bias. The more cores, the more this reordering could become an issue.

The (vectorized, not MPI-aware) population step sampler https://johannesbuchner.github.io/UltraNest/ultranest.html#ultranest.popstepsampler.PopulationSliceSampler keeps track of the generation that each step sampler is in, and follows a FIFO model.

It is not so trivial to implement, because if the likelihood threshold is raised, then some step samplers have to be reverted, which further diversifies how long it takes until they finish.

I used a ring buffer to solve it, where each element is advanced (if it has not done M steps yet), and a pointer to the next element to remove is maintained, so the order cannot change. In the MPI case, we would have to keep track of the MPI rank to pick from and wait for it, and all the step samplers that have already finished M steps would get to pause.

@guannant
Copy link
Author

guannant commented Jan 13, 2025

The 384-core case required 24,841 iterations to converge, the 128-core case required 25,114 iterations, and the 64-core case required 26,180 iterations. It seems like that higher core counts may increase the likelihood of premature stopping, potentially leading to challenges in fully capturing the data's complexity ?

Also have a look at the number of nested sampling iterations?

What I find particularly unusual is the significant variability in the fitting results from UltraNest depending on the number of cores used for parallelization. I performed three full runs with different numbers of parallel cores (64, 128, and 384). The attached figure illustrates this:

These plots are based on all the means. And here is the apple-by-apple comparison for each of the parameters:
image
They vary to some extent based on the number of cores used. The sampling bias seems to be real.

Are these plots based on all the means or the best-fits? It would be better to look at each parameter mean and stdev and see if they agree with each other or not.

However, this raises a key question: does using more cores enable UltraNest to explore the parameter space more thoroughly? If so, why does the 384-core parallelization fail to fit the data as effectively as the 64-core run?

I see. Based on your explanation, the most plausible reason for the variation is that a higher core count might introduce biased sampling in a few preferred (faster) regions. This could result in UltraNest favoring a subset of the parameter space, potentially contributing to the observed challenges. Did I understand that correctly?

You would have to increase the number of live points as well to achieve that effect.

@guannant
Copy link
Author

I see how this approach could address the bias issue effectively. My concern is whether pausing and waiting for all step samplers to finish might significantly increase runtime, especially when parallelizing across multiple nodes with hundreds of threads. Could this impact scalability?

I used a ring buffer to solve it, where each element is advanced (if it has not done M steps yet), and a pointer to the next element to remove is maintained, so the order cannot change. In the MPI case, we would have to keep track of the MPI rank to pick from and wait for it, and all the step samplers that have already finished M steps would get to pause.

@JohannesBuchner
Copy link
Owner

JohannesBuchner commented Jan 14, 2025

yes, that's a problem. The CPU usage would not be 100%. One does not have to wait for all step samplers, but only for the rank=i%P'th step sampler where i is the nested sampling iteration and P is the number of MPI processes.

Oversubscribing could help a bit.

Perhaps a better design would be to run step samplers always and not revert them, but instead insert them into the NS run at the point where they were launched (Lmin->Lnew). This would quite heavily vary the number of live points. In that case integrator.py/_create_point would need to be modified to collect the step sampler results but not advance the NS iteration yet until all step samplers that were launched at the current Lmin have returned.
Then here https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L2733 instead of adding one node one would add all nodes at Lmin->Lnew.

This can be slightly more refined. One only has to wait for all step samplers that were launched at the current Lmin, but only the first one in submission order. Also, the other step samplers that were launched at a lower Lmin can be carried over to the next iteration if their current L is above the next Lmin. This could also avoid waiting and burstiness in number of live points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants