-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow computation and sometimes biased results with warm starting #155
Comments
If it will be necessary to make changes to address this I am happy to work on them in a PR! |
Yeah, I found something similar, but am not sure why it occurs. The code is here: Maybe the |
The efficiency can go down if the parameter space squeezing then creates holes and thus all the corners (in high dimensions there are many) need to be explored. Typically, elliptical likelihood contours are most efficient to sample. |
I'll delve into it! If you have any notes written down on the details of the mathematical formulation of the coordinate change it'd be useful, else I'll just reverse-engineer it from the code. |
It is just that a zoom-interval from xlo to xhi is downweighted by 1/(xhi - xlo) A list of intervals is first derived from marginal quantiles. An auxiliary variable t is introduced to interpolate a interval of interest. My first attempt used a student-t distribution ( |
Preliminary results, sampling a Gaussian with
|
Thanks for looking into this. My thought was that even with a "wrong" transformation, the sampler could zoom out and back out to the original prior. But yes, the transform should ideally be built a bit broader than the expected posterior, so nested sampling is actually zooming in (which is efficient), and does not need to zoom out and navigate the difficult funnel. That is, the modified likelihood should monotonically rise to the peak like the likelihood. The current In addition, there is the problem that parameter space is actually cut off at the moment by the zooming, which can lose posterior mass. Probably the transform should have 1% of the original full prior space and 99% of the zoomed space, or something like that. |
I agree with these considerations, but I do not think that is the problem with the current implementation: inference is slow regardless of the breadth of the transform since the aux_likelihood always peaks at (its untransformed peak, t=1), and the prior volume compression needed to reach that area is the same as the original problem.
Indeed, in the implementation I'm drafting which is sketched in the figure in my previous comment some space is reserved to the original prior - I think even 50% is fine, since it should just take |
I am also a bit worried about the rectangles, as they introduce many corners, and therefore the modified likelihood contours becomes non-elliptical and highly inefficient to sample. Would applying your curve in a spherically symmetric way be feasible? Applying an affine transformation could consider the scale in each variable -- I think the code for this is already there for the student-t case. |
I agree that corners are to be avoided, and indeed I iterated towards a smoother parameterization - with the interpolator I'm using now in the PR, the derivative of the deformation map (whose log is shown as a dashed line) is constrained to be smooth. I'm thinking of implementing some refinement algorithm to start from the full guess CDF sampled at all the posterior points, and remove sampling points (greedily?) until the transform is "smooth enough". I don't quite understand the points about symmetry and the multivariate t-student distribution. The approach I'm proposing is essentially non-parametric: the marginal posterior guesses are allowed to have skewness or even be multimodal - this seems like a good feature to have, as long as the transform remains smooth enough, right? |
Maybe it is okay. I really liked your insightful visualizations of the transformation. Could you make a 2d one with circles looking at how a randomly sampled point with circles around it would look like in the transformed space? Something like: ctr = np.random.uniform(size=2)
length = np.random.uniform()
theta = np.linspace(0, np.pi, 200)
vec = np.zeros((len(theta), len(ctr)) + ctr.reshape((1, -1))
vec[:,0] += length * np.sin(theta)
vec[:,1] += length * np.cos(theta)
vec_transformed = prior_transform(vec)
plt.plot(vec_transformed[:,0], vec_transformed[:,1]) I have trouble imagining how this typically looks like as the circles cross the "compression region", whether they look plus-shaped or boxy. |
It looks something like this: In the pre-transform space, the circles all have radius 0.05 and their centers are randomly sampled in [0.05, 0.95]^2; the transform is computed based on a 2D gaussian at the center of the unit box with standard deviations equal to 0.05. Here is a zoom on the transition region: (This is with the new transform I'm proposing, which I think is what you were referring to right?) |
By the way, a neat thing about this way of doing it is that the Currently, I'm working on a data-driven way to downsample the CDF to make a smooth spline. I like the spline approach since it's naturally differentiable but also flexible, so we can be sure to be correcting for the volume by the exact right amount. |
Thanks for the visualisations. That's enlightening and looks encouraging. It shows perhaps that it is important that the compression region should be conservative, because if ellipsoids are laid around the live points, and the padding of these ellipsoids are extending to outside the compression region, then they are strongly elongated. It shouldn't be too bad because 1) the sampling density in that stretched outside part should be lower than in the compression region, and 2) the wrapping ellipsoid in u-space and t-space of ultranest can cut off sampling there if the likelihood is Gaussian-like. |
Splines sound scary to me because they can be unstable. Hope the code is not becoming too complicated. |
Right, if the splines can somehow work they will generally need to be heavily downsampled; my idea to determine how much to downsample (which I haven't really tried implementing yet) is to do a sort of k-fold validation, using as a cost function the cross entropy between the distribution estimated from a subset of the points and another disjoint subset of points. Probably this analysis is overkill, and it would be enough to use some heuristic to estimate a sensible value (I guess this goes for the splines but also got the hyper parameters of any method chosen). |
Summary
I have been experimenting with warm starting on a toy problem, and I found some strange behaviour.$Z$ .$Z$ is biased (true value several sigmas from the estimate).
The sampling seems to reach the correct region quickly as expected, but then it takes a lot for it to converge to the right value of
Also, sometimes the estimate for
Description
The toy problem is: an$n$ -dimensional Gaussian likelihood with mean $\mu = 0.5$ on every axis, a small $\sigma \sim 10^{-2}$ . The prior transform is the identity for the unit cube.$Z \approx 1$ (to a very good approximation).
The evidence is then expected to be
The point I'm making here also shows in 1D, but the script is versatile: it can run in higher dimensions if desired. The run times are reported for a 3D case, the trace plot is in 1D.
I am comparing a regular NS run to runs done with auxiliary likelihoods and priors obtained with
get_auxiliary_contbox_parameterization
, with the countours obtained from "guesses" as follows:In the language of the$\sigma \ll 1$ approximation)
$$\mathcal{D}_{\pi}(\mathcal{P}) \approx -\frac{1}{2} (1+\log(2\pi)) - \log \sigma$$ $\sigma=10^{-2}$ .
SuperNest
paper by Petrosyan and Handley, the KL divergence between the original prior and the posterior is (in theper dimension, which comes out to about 3.2 nats for
With the guesses, on the other hand, we are going from a Gaussian modified prior with width$k \sigma$ to a Gaussian posterior with width $\sigma$ , therefore (the same result as here but in nats)
$$\mathcal{D}_{\tilde{\pi}}(\mathcal{P}) = \log k + \frac{1}{2} \left( \frac{1}{k^{2}} - 1 \right)$$
The examples I'm considering are$k = [0.5, 1, 2]$ , with corresponding distances $[0.81, 0, 0.32]$ nats.
The prior being equal to the posterior is a degenerate case, of course, but this still indicates that we should expect a good speed up!
Instead, when I run with the auxiliary sampler, the time performance is sometimes worse.
Also, although the evidence errors are indeed smaller (as they should be), in the case of a too-thin prior the evidence is underestimated (and the error is not correctly estimated).
The script is as shown below.
This is what happens with$10^{-3}$ ) in all cases; if this is higher ($0.5$ ) things are closer to the expectations:
frac_remain
is set to a low number (However, this does not clear up the issue: why does the sampler "get stuck" in the same region making very slow progress for the last contributions to the integral?
Here is a trace plot for the same problem, with
frac_remain=1e-3
but in 1D, in the correctly estimated standard deviation case.What is going on? Why are the points getting more spread out?
The text was updated successfully, but these errors were encountered: