Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow computation and sometimes biased results with warm starting #155

Open
jacopok opened this issue Nov 22, 2024 · 17 comments
Open

Slow computation and sometimes biased results with warm starting #155

jacopok opened this issue Nov 22, 2024 · 17 comments

Comments

@jacopok
Copy link
Contributor

jacopok commented Nov 22, 2024

  • UltraNest version: 4.3.4
  • Python version: 3.11
  • Operating System: Ubuntu 24.04

Summary

I have been experimenting with warm starting on a toy problem, and I found some strange behaviour.
The sampling seems to reach the correct region quickly as expected, but then it takes a lot for it to converge to the right value of $Z$.
Also, sometimes the estimate for $Z$ is biased (true value several sigmas from the estimate).

Description

The toy problem is: an $n$-dimensional Gaussian likelihood with mean $\mu = 0.5$ on every axis, a small $\sigma \sim 10^{-2}$. The prior transform is the identity for the unit cube.
The evidence is then expected to be $Z \approx 1$ (to a very good approximation).

The point I'm making here also shows in 1D, but the script is versatile: it can run in higher dimensions if desired. The run times are reported for a 3D case, the trace plot is in 1D.

I am comparing a regular NS run to runs done with auxiliary likelihoods and priors obtained with get_auxiliary_contbox_parameterization, with the countours obtained from "guesses" as follows:

  • the correct posterior
  • the correct posterior with a reduced variance
  • the correct posterior with an increased variance

In the language of the SuperNest paper by Petrosyan and Handley, the KL divergence between the original prior and the posterior is (in the $\sigma \ll 1$ approximation)
$$\mathcal{D}_{\pi}(\mathcal{P}) \approx -\frac{1}{2} (1+\log(2\pi)) - \log \sigma$$
per dimension, which comes out to about 3.2 nats for $\sigma=10^{-2}$.

With the guesses, on the other hand, we are going from a Gaussian modified prior with width $k \sigma$ to a Gaussian posterior with width $\sigma$, therefore (the same result as here but in nats)
$$\mathcal{D}_{\tilde{\pi}}(\mathcal{P}) = \log k + \frac{1}{2} \left( \frac{1}{k^{2}} - 1 \right)$$

The examples I'm considering are $k = [0.5, 1, 2]$, with corresponding distances $[0.81, 0, 0.32]$ nats.
The prior being equal to the posterior is a degenerate case, of course, but this still indicates that we should expect a good speed up!

Instead, when I run with the auxiliary sampler, the time performance is sometimes worse.

Also, although the evidence errors are indeed smaller (as they should be), in the case of a too-thin prior the evidence is underestimated (and the error is not correctly estimated).

Ultranest 4.3.4
Scenario: Regular sampling
t=29.16s
logZ=-0.11 +- 0.14
error: 0.78 sigmas

Scenario: Underestimated standard deviation
t=36.75s
logZ=-0.45 +- 0.05
error: 8.89 sigmas

Scenario: Correct standard deviation
t=35.25s
logZ=-0.02 +- 0.06
error: 0.32 sigmas

Scenario: Overestimated standard deviation
t=35.40s
logZ=0.02 +- 0.09
error: 0.25 sigmas

The script is as shown below.

This is what happens with frac_remain is set to a low number ($10^{-3}$) in all cases; if this is higher ($0.5$) things are closer to the expectations:

Ultranest 4.3.4
Scenario: Regular sampling
t=18.25s
logZ=-0.06 +- 0.43
error: 0.15 sigmas

Scenario: Underestimated standard deviation
t=3.96s
logZ=-0.56 +- 0.41
error: 1.37 sigmas

Scenario: Correct standard deviation
t=6.34s
logZ=-0.12 +- 0.41
error: 0.29 sigmas

Scenario: Overestimated standard deviation
t=9.72s
logZ=-0.06 +- 0.42
error: 0.15 sigmas

However, this does not clear up the issue: why does the sampler "get stuck" in the same region making very slow progress for the last contributions to the integral?
Here is a trace plot for the same problem, with frac_remain=1e-3 but in 1D, in the correctly estimated standard deviation case.
What is going on? Why are the points getting more spread out?

trace

import ultranest
import numpy as np
from ultranest.hotstart import get_auxiliary_contbox_parameterization, get_auxiliary_problem, get_extended_auxiliary_independent_problem
from time import perf_counter
import string
import matplotlib.pyplot as plt

N_live = 1000
N_post = 1000
N_par = 3
scale = 1e-2
frac_remain=1e-3

param_names = [string.ascii_lowercase[i] for i in range(N_par)]

def loglike(par):
    return -0.5 * np.sum((par-0.5)**2, axis=1) / scale**2 - N_par / 2 * np.log(2*np.pi*scale**2)

def prior_transform(cube):
    # flat prior in [0, 1]^n
    return cube


run_times = []
z_estimates = []
z_errors = []

sampler = ultranest.ReactiveNestedSampler(
    param_names, 
    loglike, 
    prior_transform,
    log_dir="regular_sampling", 
    resume='overwrite', 
    vectorized=True
)

t1 = perf_counter()
result = sampler.run(min_num_live_points=N_live, frac_remain=frac_remain)
t2 = perf_counter()

sampler.print_results()

sampler.plot_trace()
plt.savefig('regular_sampling.png')
plt.close()

z_estimates.append(result['logz'])
z_errors.append(result['logzerr'])

run_times.append(t2-t1)

rng = np.random.default_rng(1)

for scale_factor in [0.5, 1, 2.]:
    
    simulated_posterior = rng.normal(
        loc=0.5*np.ones(N_par), 
        scale=scale*scale_factor, 
        size=(N_post, N_par)
    )

    aux_param_names, aux_loglike, aux_transform, vectorized = get_auxiliary_contbox_parameterization(
            param_names, 
            loglike, 
            prior_transform, 
            simulated_posterior, 
            np.ones(N_post) / N_post,
            vectorized=True,
        )
    
    sampler_aux = ultranest.ReactiveNestedSampler(
        aux_param_names, 
        aux_loglike, 
        aux_transform,
        log_dir=f"aux_sampling_{scale_factor}", 
        resume='overwrite', 
        vectorized=True
    )

    t3 = perf_counter()
    result_aux = sampler_aux.run(min_num_live_points=N_live, frac_remain=frac_remain)
    t4 = perf_counter()
    run_times.append(t4-t3)

    sampler_aux.print_results()
    sampler_aux.plot_trace()
    z_estimates.append(result_aux['logz'])
    z_errors.append(result_aux['logzerr'])


scenarios = [
    'Regular sampling',
    'Underestimated standard deviation',
    'Correct standard deviation',
    'Overestimated standard deviation',
]

print(f'Ultranest {ultranest.__version__}')

for i in range(4):
    print(f'Scenario: {scenarios[i]}')
    print(f't={run_times[i]:.2f}s')
    print(f'logZ={z_estimates[i]:.2f} +- {z_errors[i]:.2f}')
    print(f'error: {abs(z_estimates[i])/z_errors[i]:.2f} sigmas\n')
@jacopok
Copy link
Contributor Author

jacopok commented Nov 22, 2024

If it will be necessary to make changes to address this I am happy to work on them in a PR!

@JohannesBuchner
Copy link
Owner

Yeah, I found something similar, but am not sure why it occurs.

The code is here:
https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/hotstart.py#L346
which sets up the auxiliary transform functions. Maybe you can take this function apart?

Maybe the compute_quantile_intervals_refined function needs some explanation, it creates a sequence of intervals with their weight. This is how the parameter space is squeezed. In principle, any such interval and weight should be give the same result, as the squeezing and reweighting should cancel each other. Obviously, it does not seem to be like that in practice.

@JohannesBuchner
Copy link
Owner

The efficiency can go down if the parameter space squeezing then creates holes and thus all the corners (in high dimensions there are many) need to be explored. Typically, elliptical likelihood contours are most efficient to sample.

@jacopok
Copy link
Contributor Author

jacopok commented Nov 22, 2024

I'll delve into it! If you have any notes written down on the details of the mathematical formulation of the coordinate change it'd be useful, else I'll just reverse-engineer it from the code.

@JohannesBuchner
Copy link
Owner

It is just that a zoom-interval from xlo to xhi is downweighted by 1/(xhi - xlo)

A list of intervals is first derived from marginal quantiles.

An auxiliary variable t is introduced to interpolate a interval of interest.

My first attempt used a student-t distribution (get_extended_auxiliary_problem), but that also showed issues.

@jacopok
Copy link
Contributor Author

jacopok commented Nov 27, 2024

I think I understand the issue now.

This is what the transformation looks like, as a function of the coordinates $v$ and $t$: it "expands" the region of interest as it should.
Transformation

Why is sampling from this not faster?
Plotting the loglikelihood shows the problem:
Loglikelihood

Due to the aux volume correction, the maximum likelihood is attained for $t=1$: therefore, nested sampling will quickly reach the region of small $t$ and moderately high likelihood, but then it has to iterate until it finds the peak at $t=1$, meaning it still needs to reach the same level of prior compression as in the original problem, negating the benefits of the transformation!

I think this can be solved by implementing a fixed prior transformation with no auxiliary variable, something like this:
New_transformation

A spline representation of the downsampled CDF of the posterior guess is easy to sample from and differentiate to compute the volume compression.

I am working out the details (e.g. I'd like to ensure the CDF's slope is never lower than the slope outside the "compression region") and will make a PR shortly.

@jacopok
Copy link
Contributor Author

jacopok commented Nov 27, 2024

Preliminary results, sampling a Gaussian with $$\sigma = 10^{-6}$$ along each axis in 6D, 200 live points, frac_remain=1e-2:

Ultranest 4.3.4
Scenario: Regular sampling
t=88.10s
logZ=-0.00 +- 0.64
error: 0.00 sigmas

Scenario: Correct standard deviation
t=8.46s
logZ=-0.01 +- 0.10
error: 0.09 sigmas

Scenario: Overestimated standard deviation
t=4.97s
logZ=0.11 +- 0.14
error: 0.82 sigmas

Scenario: Way overestimated standard deviation
t=7.31s
logZ=0.13 +- 0.33
error: 0.39 sigmas

@JohannesBuchner
Copy link
Owner

Thanks for looking into this.

My thought was that even with a "wrong" transformation, the sampler could zoom out and back out to the original prior. But yes, the transform should ideally be built a bit broader than the expected posterior, so nested sampling is actually zooming in (which is efficient), and does not need to zoom out and navigate the difficult funnel. That is, the modified likelihood should monotonically rise to the peak like the likelihood.

The current compute_quantile_intervals_refined function doesn't do a good job of that.

In addition, there is the problem that parameter space is actually cut off at the moment by the zooming, which can lose posterior mass. Probably the transform should have 1% of the original full prior space and 99% of the zoomed space, or something like that.

@jacopok
Copy link
Contributor Author

jacopok commented Nov 27, 2024

Thanks for looking into this.

My thought was that even with a "wrong" transformation, the sampler could zoom out and back out to the original prior. But yes, the transform should ideally be built a bit broader than the expected posterior, so nested sampling is actually zooming in (which is efficient), and does not need to zoom out and navigate the difficult funnel. That is, the modified likelihood should monotonically rise to the peak like the likelihood.

I agree with these considerations, but I do not think that is the problem with the current implementation: inference is slow regardless of the breadth of the transform since the aux_likelihood always peaks at (its untransformed peak, t=1), and the prior volume compression needed to reach that area is the same as the original problem.

The current compute_quantile_intervals_refined function doesn't do a good job of that.

In addition, there is the problem that parameter space is actually cut off at the moment by the zooming, which can lose posterior mass. Probably the transform should have 1% of the original full prior space and 99% of the zoomed space, or something like that.

Indeed, in the implementation I'm drafting which is sketched in the figure in my previous comment some space is reserved to the original prior - I think even 50% is fine, since it should just take $O(n_{\text{live}} \log 2)$ iterations to traverse which is not that many.

@JohannesBuchner
Copy link
Owner

I am also a bit worried about the rectangles, as they introduce many corners, and therefore the modified likelihood contours becomes non-elliptical and highly inefficient to sample. Would applying your curve in a spherically symmetric way be feasible? Applying an affine transformation could consider the scale in each variable -- I think the code for this is already there for the student-t case.

@jacopok
Copy link
Contributor Author

jacopok commented Nov 27, 2024

I agree that corners are to be avoided, and indeed I iterated towards a smoother parameterization - with the interpolator I'm using now in the PR, the derivative of the deformation map (whose log is shown as a dashed line) is constrained to be smooth. I'm thinking of implementing some refinement algorithm to start from the full guess CDF sampled at all the posterior points, and remove sampling points (greedily?) until the transform is "smooth enough".

I don't quite understand the points about symmetry and the multivariate t-student distribution. The approach I'm proposing is essentially non-parametric: the marginal posterior guesses are allowed to have skewness or even be multimodal - this seems like a good feature to have, as long as the transform remains smooth enough, right?

@JohannesBuchner
Copy link
Owner

Maybe it is okay.

I really liked your insightful visualizations of the transformation.

Could you make a 2d one with circles looking at how a randomly sampled point with circles around it would look like in the transformed space?

Something like:

ctr = np.random.uniform(size=2)
length = np.random.uniform()
theta = np.linspace(0, np.pi, 200)
vec = np.zeros((len(theta), len(ctr)) + ctr.reshape((1, -1))
vec[:,0] += length * np.sin(theta)
vec[:,1] += length * np.cos(theta)
vec_transformed = prior_transform(vec)
plt.plot(vec_transformed[:,0], vec_transformed[:,1])

I have trouble imagining how this typically looks like as the circles cross the "compression region", whether they look plus-shaped or boxy.

@jacopok
Copy link
Contributor Author

jacopok commented Nov 30, 2024

It looks something like this:

all_circles

In the pre-transform space, the circles all have radius 0.05 and their centers are randomly sampled in [0.05, 0.95]^2; the transform is computed based on a 2D gaussian at the center of the unit box with standard deviations equal to 0.05.

Here is a zoom on the transition region:

circles_zoom

(This is with the new transform I'm proposing, which I think is what you were referring to right?)

@jacopok
Copy link
Contributor Author

jacopok commented Nov 30, 2024

By the way, a neat thing about this way of doing it is that the log_auxweight in the samples' visualization can indicate whether the transform was done sensibly: if, after a few iterations, there is a significant region being sampled with logweight > 0 we know that we probably set the prior transformation to be too tight while reaching logweight << 0 means we did things correctly (and we are gaining speed and accuracy thanks to the transform).

Currently, I'm working on a data-driven way to downsample the CDF to make a smooth spline. I like the spline approach since it's naturally differentiable but also flexible, so we can be sure to be correcting for the volume by the exact right amount.

@JohannesBuchner
Copy link
Owner

Thanks for the visualisations. That's enlightening and looks encouraging.

It shows perhaps that it is important that the compression region should be conservative, because if ellipsoids are laid around the live points, and the padding of these ellipsoids are extending to outside the compression region, then they are strongly elongated. It shouldn't be too bad because 1) the sampling density in that stretched outside part should be lower than in the compression region, and 2) the wrapping ellipsoid in u-space and t-space of ultranest can cut off sampling there if the likelihood is Gaussian-like.

@JohannesBuchner
Copy link
Owner

Splines sound scary to me because they can be unstable. Hope the code is not becoming too complicated.
I guess you could also go for a KDE getdist-style (it can consider the uniform bounds, see last example in https://getdist.readthedocs.io/en/latest/plot_gallery.html).

@jacopok
Copy link
Contributor Author

jacopok commented Nov 30, 2024

Right, if the splines can somehow work they will generally need to be heavily downsampled; my idea to determine how much to downsample (which I haven't really tried implementing yet) is to do a sort of k-fold validation, using as a cost function the cross entropy between the distribution estimated from a subset of the points and another disjoint subset of points.
At the moment I was simply evaluating them at 20 points, which seemed to work well.

Probably this analysis is overkill, and it would be enough to use some heuristic to estimate a sensible value (I guess this goes for the splines but also got the hyper parameters of any method chosen).
I'll also look into the approach you suggest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants