Pitfalls in the Bayesian GP-LVM GPyTorch implementation? #1879
Replies: 2 comments 3 replies
-
Hi Joaquin, Thanks for the message above. So, with these models there are several subtle variations possible, I will try and list them below. The classes have been written to support several of these variations without needing to modify any of the base code. But I will also clarify some other things below.
I definitely think the closed form psi statistics for the SE kernel would improve the implementation and @gpleiss would welcome a PR for that? But we should try and keep the whole thing as generalised as possible - i.e. it should work with any kernel or likelihood. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your reply @vr308. Your implementation is very elegant by its generality and simplicity. But I want to know how it compares with previous ones. I think this is a crucial comparison. I am surprised by the large difference in achieved lower bounds between the implementation I proposed (~ 9000) and yours (~ -5), and I want to understand the reason for it. Could it be due to initial conditions? Could it be that our lower bounds are ignoring constants wrt to the model parameters? Another way of comparing these implementations is to use predictions, as described in Section 4 of Titsias and Lawrence, 2010. I will try to support these predictions in the implementation that I proposed. A crucial difference is that the implementation I proposed does not require the estimation of the variational distribution on the inducing points. This estimation can be quite costly, specially when using a large number of inducing points. Note that we can avoid estimating the variational distribution on inducing points for any kernel. What is valid only for the SE kernel (or for the linear kernel) is the availability of close-form solutions of the psi statistics. Also note that the integration of the variational distribution on the inducing points assumes that the inducing points are independent of the latents, which may be a poor assumption, and your stochastic estimation may work better in this case. If we conclude that the implementation I proposed performs better than yours, it would be interesting to understand if this difference is a consequence of not estimating the variational distribution on inducing points, or a consequence of using close form expressions for the psi statistics. In any case, it should be useful to compare the performance of our implementations. I will start by supporting predictions to new data in the implementation I proposed. Would it be difficult to support these predictions in your implementation? Thanks again, Joaquin |
Beta Was this translation helpful? Give feedback.
-
@vr308 @gpleiss
I is useful that Titsias and Lawrence, 2010 derived analytically the optimal variational distribution on inducing points, which allowed them to remove this distribution from the lower bound. That is, in Titsias and Lawrence, 2010, the variational distribution on the inducing points is not estimated. However, in the current GPyTroch Bayesian GPLVM implementation, it is estimated.
Also, from the introduction of the GPyTroch Bayesian GPLVM notebook it seems that the GPyTorch implementation uses diagonal matrices with equal values along the diagonal as the covariance matrix for the variational distribution for the latents (i.e., $q(X) = \prod_{n=1}^{N}\mathcal{N}(\mathbf{x}{n}; \mu{n}, s_{n}\mathbb{I}_{Q})$). However, in Titsias and Lawrence, 2010, these diagonal matrices are not constrained to have equal values along the diagonal.
To test the relevance of the points above, I wrote a new implementation of Bayesian GPLVM based on GPyTorch, borrowing code from GPflow and GPy: repository, Google Colab. I posted a few derivations documenting the GPflow implementation here. I used the psi statistics from the GPy implementation. From GPyTorch I used the RBF kernel as well as the parameters constraints.
The previous points seem to make a difference in the quality of the model estimates. I estimated a model with the same data and initial parameters used to generate Figure 1 in Titsias and Lawrence, 2010, and it seems that latents and lengthscales estimated by my implementation are closer to those estimated by Titsias and Lawrence 2010 than the estimates from the GPyTorch implementation. Compare the two-dimensional latent subspace estimated by my implementation here with that of the current GPyTorch implementation here.
The ELBO achieved with my implementation is in the order of 9000, but that in the GPyTorch implementation is on the order of -5. I don't understand the reason for this large difference. After estimating the parameters of the GPyTorch GPLVM model, following this notebook, I calculated the lower bound over the whole dataset using:
and obtained:
It seem that my new implementation of Bayesian GP-LVM performs better than the current GPyTorch one. However, to be certain about this, a more thorough comparison is needed. If building a better implementation of Bayesian GP-LVM in GPyTorch is something of interest to the community, I could try to do it. I would appreciate feedback.
Joaquin
Beta Was this translation helpful? Give feedback.
All reactions