Pitfalls in the Bayesian GP-LVM GPyTorch implementation? #1879

joacorapela · 2022-01-07T21:06:14Z

joacorapela
Jan 7, 2022

I is useful that Titsias and Lawrence, 2010 derived analytically the optimal variational distribution on inducing points, which allowed them to remove this distribution from the lower bound. That is, in Titsias and Lawrence, 2010, the variational distribution on the inducing points is not estimated. However, in the current GPyTroch Bayesian GPLVM implementation, it is estimated.

Also, from the introduction of the GPyTroch Bayesian GPLVM notebook it seems that the GPyTorch implementation uses diagonal matrices with equal values along the diagonal as the covariance matrix for the variational distribution for the latents (i.e., $q(X) = \prod_{n=1}^{N}\mathcal{N}(\mathbf{x}{n}; \mu{n}, s_{n}\mathbb{I}_{Q})$). However, in Titsias and Lawrence, 2010, these diagonal matrices are not constrained to have equal values along the diagonal.

To test the relevance of the points above, I wrote a new implementation of Bayesian GPLVM based on GPyTorch, borrowing code from GPflow and GPy: repository, Google Colab. I posted a few derivations documenting the GPflow implementation here. I used the psi statistics from the GPy implementation. From GPyTorch I used the RBF kernel as well as the parameters constraints.

The previous points seem to make a difference in the quality of the model estimates. I estimated a model with the same data and initial parameters used to generate Figure 1 in Titsias and Lawrence, 2010, and it seems that latents and lengthscales estimated by my implementation are closer to those estimated by Titsias and Lawrence 2010 than the estimates from the GPyTorch implementation. Compare the two-dimensional latent subspace estimated by my implementation here with that of the current GPyTorch implementation here.

The ELBO achieved with my implementation is in the order of 9000, but that in the GPyTorch implementation is on the order of -5. I don't understand the reason for this large difference. After estimating the parameters of the GPyTorch GPLVM model, following this notebook, I calculated the lower bound over the whole dataset using:

output = model(sample)
lower_bound = -mll(output, Y.T).sum()
print("Lower bound={:f}".format(lower_bound))

and obtained:

Lower bound=-5.561378

It seem that my new implementation of Bayesian GP-LVM performs better than the current GPyTorch one. However, to be certain about this, a more thorough comparison is needed. If building a better implementation of Bayesian GP-LVM in GPyTorch is something of interest to the community, I could try to do it. I would appreciate feedback.

Joaquin

vr308 · 2022-01-09T18:36:03Z

vr308
Jan 9, 2022

Hi Joaquin,

Thanks for the message above. So, with these models there are several subtle variations possible, I will try and list them below. The classes have been written to support several of these variations without needing to modify any of the base code. But I will also clarify some other things below.

The VariationalLatentVariable class in latent_variable.py should learn different sigmas along the dimensions, I think the notation $s_{n}$ gives the impression that it is a scalar but it is not, it is in $R^{Q}$.
The idea behind the SVI implementation is that everything is learnt stochastically including the expectations around the kernel matrices, the original work used analytic forms for the psi statistictics but those are only available for the SE kernel so if you used a custom kernel you would have to lapse to the stochastic learning. The closed form Psi statistics will give a tighter bound no doubt, I think what would be ideal is to use the analytic psi statistics if the usage is for the SE-kernel but revert to stochastic learning if it is a non-SE kernel. What do you think @joacorapela ?
Another unconventional thing about my gpytorch notebook is the inducing locations - for this example, I am learning inducing locations per dimension however one does not need to do that (it anyway quite expensive for bigger datasets)

self.inducing_inputs = torch.randn(data_dim, n_inducing, latent_dim) ## shape is D x M x Q
one can just easily drop data_dim and initialise as self.inducing_inputs = torch.randn(data_dim, n_inducing, latent_dim) ## shape is M x Q so the inducing locations are now shared between dimensions. Not sure how it is in your implementation? But the code should support both incarnations.

Next is the noise variance, the way it is written we learn an independent noise per Y dimension, likelihood = GaussianLikelihood(batch_shape=model.batch_shape), to share this parameter one can just drop the batch_shape.
Also I haven't compared this implementation with other optimisers like L-BFGS so it might be worth comparing.

I definitely think the closed form psi statistics for the SE kernel would improve the implementation and @gpleiss would welcome a PR for that? But we should try and keep the whole thing as generalised as possible - i.e. it should work with any kernel or likelihood.

0 replies

joacorapela · 2022-01-10T18:18:53Z

joacorapela
Jan 10, 2022
Author

Thanks for your reply @vr308.

Your implementation is very elegant by its generality and simplicity. But I want to know how it compares with previous ones. I think this is a crucial comparison.

I am surprised by the large difference in achieved lower bounds between the implementation I proposed (~ 9000) and yours (~ -5), and I want to understand the reason for it. Could it be due to initial conditions? Could it be that our lower bounds are ignoring constants wrt to the model parameters?

Another way of comparing these implementations is to use predictions, as described in Section 4 of Titsias and Lawrence, 2010. I will try to support these predictions in the implementation that I proposed.

A crucial difference is that the implementation I proposed does not require the estimation of the variational distribution on the inducing points. This estimation can be quite costly, specially when using a large number of inducing points. Note that we can avoid estimating the variational distribution on inducing points for any kernel. What is valid only for the SE kernel (or for the linear kernel) is the availability of close-form solutions of the psi statistics. Also note that the integration of the variational distribution on the inducing points assumes that the inducing points are independent of the latents, which may be a poor assumption, and your stochastic estimation may work better in this case.

If we conclude that the implementation I proposed performs better than yours, it would be interesting to understand if this difference is a consequence of not estimating the variational distribution on inducing points, or a consequence of using close form expressions for the psi statistics.

In any case, it should be useful to compare the performance of our implementations. I will start by supporting predictions to new data in the implementation I proposed. Would it be difficult to support these predictions in your implementation?

Thanks again, Joaquin

3 replies

vr308 Jan 19, 2022

Hi @joacorapela - I certainly agree and happy to collaborate on this ....yes the testing part is not so straight-forward and I have it implemented in a private codebase but a lot of people enquire about how they can compute test latent variables for unseen data so its been on my plate to merge that feature in.

Let us pencil in a time to chat about this, maybe end of the month? (a bit hectic over this and next week).

Also, I think your implementation draws on this work (link below) as the gpflow implementation follows this distributed algorithm. https://proceedings.neurips.cc/paper/2014/file/9a4400501febb2a95e79248486a5f6d3-Paper.pdf

Best,
Vidhi

joacorapela Jan 27, 2022
Author

Hello @vr308,

Thanks for your reply and sorry for being slow in getting back to you. I am implementing predictions to observations given latents following a similar approach as in here. However, I think there is a problem there in the equation to predict the mean, as mentioned here. I am a little busy these days, but I expect to be done with this implementation by mid February. I will write you then to try to meet.

My implementation is not distributed. What I find elegant about the GPflow implementation of Bayesian GP-LVM it that it is almost identical to the implementation of Sparse Gaussian Process Regression (compare the last equation in Section Marginal likelihood bound here with Eq. 13 here). I am very interested in measuring the performance of the GPyTorch against that of my implementation (e.g., using the lower bound and/or predictions) and I look forward to work with you on this.

Thanks again, Joaquin

joacorapela Feb 1, 2022
Author

Hello @vr308,

The GPflow formula of the mean of the predictive distribution was correct, as shown here. I will next compare your implementation of Bayesian GP-LVM with mine using predictions and let you know what I find.

Joaquin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pitfalls in the Bayesian GP-LVM GPyTorch implementation? #1879

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Pitfalls in the Bayesian GP-LVM GPyTorch implementation? #1879

joacorapela Jan 7, 2022

Replies: 2 comments · 3 replies

vr308 Jan 9, 2022

joacorapela Jan 10, 2022 Author

vr308 Jan 19, 2022

joacorapela Jan 27, 2022 Author

joacorapela Feb 1, 2022 Author

joacorapela
Jan 7, 2022

Replies: 2 comments 3 replies

vr308
Jan 9, 2022

joacorapela
Jan 10, 2022
Author

joacorapela Jan 27, 2022
Author

joacorapela Feb 1, 2022
Author