Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMC #265

Merged
merged 61 commits into from
Jan 11, 2016
Merged

HMC #265

merged 61 commits into from
Jan 11, 2016

Conversation

null-a
Copy link
Member

@null-a null-a commented Nov 9, 2015

This is a cleaned up version of my gradients branch. There are a few loose ends to tie up (see below) but I don't expect them to change things much, so now might be a good time to start reviewing my changes.

This PR includes:

  • ad.js integration
  • A revised MCMC interface which allows per-kernel options to be specified
  • A HMC kernel which can be used with MCMC and SMC

Using HMC with the new interface looks like this:

MCMC(model, { samples: 100, kernel: 'HMC' })
SMC(model, { particles: 100, rejuvSteps: 5, rejuvKernel: 'HMC' })
MCMC(model, { samples: 100, kernel: { sequence: { kernels: ['HMC', { MH: discreteOnly: true }] } } })

Before use, the ERP score code needs to be transformed, this is now done like so:

./scripts/transformERP

This should happen automatically after an npm install. If erp.js is missing when webppl is run a descriptive error message is shown.

Still to do

  • Switch to the latest ad.js for tensor support. (See Merge variational inference #259.) Test Dirichlet. (See Should Dirichlet scorer check that values sum to 1? #275.) Switch to version of isTape exported by ad.js (in aggregation.js) once available. Postponed, will tackle as a separate piece of work.
  • Avoid unnecessarily creating tapes.
  • Remove TODO notes. I've intentionally left these in for now as they might be useful while the PR is been reviewed.
  • Add more continuous and mixed continuous/discrete tests, we only have 1 or 2 right now. Suggestions for good tests welcome!
  • Add a HMC function or some other convenient way of saying kernel: { sequence: { kernels: ['HMC', { MH: discreteOnly: true }] } }. Thoughts?
  • Update inference docs.
  • Update MAP and entropy methods on ERP. (Not needed if we drop constraint stuff.)
  • Maybe: Add erp.js to repo.
  • Maybe: Reject proposal rather than throw error on structure change, if we think doing so is correct. Discussed in Implement HMC #81.

Speed

I ran a few quick benchmarks and in the model I tested the time spent doing inference increased by ~20% using MH and by ~40% using SMC after these changes. I've not spent any time trying to optimize that away. The inference tests are probably quite a bit more than 40% slower because of the increase in the time taken to compile programs.

Future work

As discussed in #81, there are several things not included here that we might like in the future:

  • Constraint handling by transformation. (Do we need to handle hard factor statements?)
  • Step size adaption.
  • Support for nesting models.

The current test is hard to understand. It's also likely to generate
false positives as there are several unexpected values which e1 and e2
can take where the model returns true. (e.g. "isNaN({} + {})".) This
would not cause the test to fail as returning true with probability 1 is
within the histogram tolerance of most inference tests.

This new deterministic cache test would have failed prior to the recent
update which fixed serialized of ±Infinity.

This reverts the change made to the stochastic test made in 875812b.
Functions which operate on reals need to be transformed for AD.
This is preferable as it avoids adding a non-WebPPL function as a global
variable and for consistency.
@stuhlmueller
Copy link
Member

This looks good overall -- and like it was a lot of work! Thanks!

Are there any examples where the current implementation of HMC does visibly better than plain MH? Some model with multiple continuous variables that are strongly correlated under the posterior so that all need to be updated jointly? It would be good to include at least one such example in the tests.

In general, the main thing we should convince ourselves of is that the algorithm is correctly implemented. I'd add tests that include various sorts of dependencies between discrete and continuous erps and factors. For example:

  • Direct dependencies:

    var x = sample(continuousERP)
    var y = sample(continuousERP, [x])
    
    var x = sample(discreteERP)
    var y = sample(continuousERP, [x])
    
    var x = sample(continuousERP)
    var y = sample(discreteERP, [x])
    
  • Indirect dependencies: Each of the three cases above, but with x transformed by a compound function that calls out to a primitive function:

    var f = function(x) {
      return Math.log(x + x);
    }
    
  • Single factor depending on multiple ERPs: Each of the six cases above, but including a factor with a score that depends on a function of x and y (probably need to write a different function for each combination of discrete/continuous).

  • Multiple factors depending on a single ERP

  • More complicated programs:

    var x = sample(continuousERP)
    var y = sample(continuousERP, [f(x)])
    var w = sample(discreteERP, [g(x, y)])
    factor(h(x, y, w))
    var z = sample(continuousERP, [i(x, y, w)])
    factor(j(y, z, w))
    

Other comments:

  • The installation instructions in docs/development/install.rst now include a mention of ./scripts/transformERP, but you say it happens automatically with npm install -- are the extra install instructions unnecessary?
  • I'd rename scoreFunctions.js to something that makes it clear that it's about automatic differentiation (maybe adscorers.js). Also, to make it more like the other transforms, don't export the transform function directly, but rather as property of an object?
  • Having a list of helpers in scoreFunctions.js seems a bit fragile. Could we have a way to mark the corresponding functions directly in erp.ad.js?
  • As you point out, untapify in src/aggregation.js should probably be integrated into untapify in ad.js. If not, the condition should be factored out as isNonTapeObject or something like that to make it clearer what's going on.
  • Is there a reference for the constraint handling mechanism you implemented, or did you make it up? It looks sensible, but it would be good to have some assurance that it's doing the right thing. At least we'd want some tests that exercise it.
  • Add a comment to erp.ad.js explaining why sum needs to be defined there even though we already have one in util.js?
  • hmckernel.js: Add a reference to the top of the file that points to the main source for the algorithm? (The Radford Neal handbook chapter?)
  • hmckernel.js: Add a comment that explains what ad.yGradientR in line 153 does?
  • What's the role that thisPositionStepCont plays? Add a comment to clarify?
  • What's the motivation for the interface change to Initialize and kernels? In particular, I'm not convinced that kernels should take runWppl as an argument. Maybe it should be part of options, or a method on the trace object?
  • Related to this interface change (but not entirely explained by it): I found src/inference/kernels.js difficult to understand, in particular the top half. I think it would have helped a little to rename MH, HMC, and sequence to MHKernel, HMCKernel, and SequenceKernel respectively.
  • In various places (hmckernel.js, initialize.js, smc.js, mhkernel.js, trace.js), ad is accessed without being explicitly imported. I assume this works because, by the time these modules are imported, ad has already been made available globally? If so, I would feel better if it were imported explicitly. If you want reference to a single instance, do this via env?
  • mhkernel.js:65 - I'd avoid initializing multiple variables in one line, or at least limit it to the case where all are uninitialized. I was confused for a moment, thinking we were destructuring the return value of this.oldTrace.findChoice.
  • In smc.js, I wonder if it would be useful to keep the particle logWeight (~ estimate of normalization constant) tapified. Isn't the gradient of this quantity one of the ingredients we need for variational inference? (If not, how is it different?)
  • Now that the MCMC/SMC interface is getting more complex, we should start documenting it.
  • Regarding a convenient way to say kernel: { sequence: { kernels: ['HMC', { MH: discreteOnly: true }] } }: I think your proposed HMC function would be reasonable. We want to make it easier to use HMC+MH than pure HMC, so that users don't accidentally use the latter and then get confused about why the results for mixed continuous-discrete models aren't right.
  • Typos: hmckernel.js: structual->structural; mhkernel.js: indicies -> indices

This is less brittle than the previous approach of maintaining a list in
adscorers.js.
Prior to this commit, when using sequenceKernel, only the final kernel
in the sequence was taken into account when computing the acceptance
ratio.

Query handling had to be modified for a similar reason. i.e. It needs to
happen after each kernel in the sequence, not once at the end of the
sequence.
@null-a
Copy link
Member Author

null-a commented Nov 20, 2015

are the extra install instructions unnecessary?

You're right. I'd still like to include something about this, would you be in favor of including a section on this under workflow? Something like this.

Having a list of helpers in scoreFunctions.js seems a bit fragile.

Good point. I've changed this to transform functions with names ending in AD. There are more elegant ways, but this is nice and simple. It also makes it easier see at a glance whether a score function only calls ADified helpers.

As you point out, untapify in src/aggregation.js should probably be integrated into untapify in ad.js.

I've started talking to Sid about this.

Is there a reference for the constraint handling mechanism you implemented

It's from Radford Neal's handbook chapter. Section 5.1, p36.

I don't think the current implementation is very general though. It only checks the constraint which comes from the support of the variable which is been updated which I think can come apart when constraints depend on other variables.

It's conceivable that simply rejecting proposals that wind up in areas of zero probability is correct, but I don't know that it is.

In #81 the suggestion was to transform constrained variables into unbounded spaces which sounds reasonable. Assuming we tackle this as a separate piece of work, the immediate decision is whether to stick with what we have, take it out, or improve it. Of those, the first two probably make most sense.

Add a comment to erp.ad.js explaining why sum needs to be defined there even though we already have one in util.js?

This is now sumAD. Is that enough of a hint, or do you still want a comment?

hmckernel.js: Add a comment that explains what ad.yGradientR in line 153 does?

What do you have in mind? I happy to do it, I'm just not sure what jumps out to you about this compared to any other part of the algorithm. Does // Compute gradient of score w.r.t. continuous variables cover it?

What's the role that thisPositionStepCont plays? Add a comment to clarify?

This is indeed a little awkward. I've added a comment to explain.

What's the motivation for the interface change to Initialize and kernels? In particular, I'm not convinced that kernels should take runWppl as an argument. Maybe it should be part of options, or a method on the trace object?

The idea is that it's a convenient way to give HMC access to wpplFn and the appropriate store and address needed to run the program from the beginning. Initialize already took these as three separate parameters, it now just takes a single runWppl argument instead. This has the nice side-effect of allowing us to clone the store (before running the program) in just one place.

I did considered including it on Trace but decided against it as:

  1. You have to include it in the logic for copying/slicing traces
  2. You have to pass it along to each new Trace() in HMC
  3. It doesn't work for Initialize as it doesn't take a Trace

It could go in options, but the problem is that it doesn't play so well with the current approach of gradually building up the kernel by partially applying arguments to the main kernel function. It's not impossible of course (we can do it by wrapping the original function with a function that merges option objects appropriately) but I'm not sure it's worth it in this case. Another way you could go is to question the current approach of partially applying arguments -- I'm happy to talk about that if you like.

I think it would have helped a little to rename MH, HMC, and sequence to MHKernel, HMCKernel, and SequenceKernel respectively.

Done. Also added a comment to parseOptions which might help.

In various places (hmckernel.js, initialize.js, smc.js, mhkernel.js, trace.js), ad is accessed without being explicitly imported.

Indeed, this works for the reason you mention. The reason I didn't do the obvious thing and require('ad.js') in each file is that doing so requires specifying which mode ad.js should use each time require is called. I was happier calling methods on a global ad variable (which has to be there for ad.js anyway) than repeatedly having to say "reverse mode please" everywhere.

I'm sure this could be cleaner, but does accessing it via env.ad buy us much since we'll still have a top-level ad hanging around?

In smc.js, I wonder if it would be useful to keep the particle logWeight (~ estimate of normalization constant) tapified. Isn't the gradient of this quantity one of the ingredients we need for variational inference?

Not sure. Won't we need to change lots of things for VI? Can we worry about it when we need it?

Now that the MCMC/SMC interface is getting more complex, we should start documenting it.

Sure. I've made a start on this here, see what you think. I'll consider this a separate piece of work. i.e. #131.

I think your proposed HMC function would be reasonable.

I've thought about this and if people already use MCMC routinely then I don't think adding an HMC method is enough. Perhaps we should have the common case of HMC + MH be available using MCMC(model, { kernel: 'HMC' }) and come up with another name for the "HMC only" kernel. Alternatively we could have something like MCMC(model, { kernel: 'HMC+MH' }) be short hand for the thing you usually want.

Everything else you mention under "Other comments" has been done.

@null-a
Copy link
Member Author

null-a commented Nov 23, 2015

I think I've now added most of the tests you suggested. I hope the following makes sense:

Var Types Direct Dependence Indirect Dependence Direct + Factor Indirect + Factor
CC bivariateGaussian indirectDependency bivariateGaussianFactor* gaussianMeanVar
DC mixed2 mixed4 mixed2Factor mixed4Factor
CD mixed1 mixed3 mixed1Factor mixed3Factor

e.g. DC = var x = discrete(); var y = continuous(x);

*Note -- in bivariateGaussianFactor the factor depends only on x, not x and y.

The "Multiple factors depending on a single ERP" case might be covered by gaussianMean?

As far as tests go, I think that just leaves your "More complicated programs" suggestion.

@stuhlmueller
Copy link
Member

Another reason it seems important to avoid always creating tapes is so that it remains possible to do inference with MH over models which pass numbers to external JS libraries.

Good point!

I have a working implementation of a more local method of deciding whether to create tapes here. The basic idea is to figure out whether tapes are needed by looking at which kernel we're using in the MCMC and SMC methods. What do you think of this approach?

This looks like a much better idea. I didn't realize all tapify calls are happening within the MCMC/SMC functions such that this works.

I wonder if this.ad should be this.adRequired everywhere. That would make it clearer that it's a Boolean flag.

Do you have thoughts on how to get rid of the overhead from AD-ified scorers? Maybe the ERP transform could keep around the original scorers, and dispatch calls to score functions of the right type (AD-ified or not) based on whether the current coroutine has an adRequired property that evaluates to true?

@null-a
Copy link
Member Author

null-a commented Dec 15, 2015

this.ad should be this.adRequired everywhere

Definitely.

Do you have thoughts on how to get rid of the overhead from AD-ified scorers?

I had a half-baked idea that we might be able to have an adRequired flag on Trace, and that we might be able to use that to delegate to the appropriate score function. Your way seems much better though, so I'll take a look at that.

@null-a
Copy link
Member Author

null-a commented Dec 18, 2015

Maybe the ERP transform could keep around the original scorers, and dispatch calls to score functions of the right type

Just to clarify, are you suggesting the score methods each become something like:

function score(params, x) {
    return env.coroutine.adRequired ? originalScore(params,x) : adScore(params, x);
}

(I think that would work.)

@stuhlmueller
Copy link
Member

Yes, that's what I was thinking.

@null-a
Copy link
Member Author

null-a commented Dec 18, 2015

Yes, that's what I was thinking.

Great, thanks. I have a version working here. I've not yet checked how much of the performance drop this recovers. I'll do that next.

@null-a
Copy link
Member Author

null-a commented Jan 5, 2016

I've not yet checked how much of the performance drop this recovers. I'll do that next.

The following shows the fractional slow down incurred by some of the models in examples. "HMC" refers to this PR in its current state, "HMC both scorers" refers to HMC with the additional change we've discussed in the last few comments. (i.e. Maintaining two score functions for each ERP.)

Model HMC HMC both scorers
linearRegression 0.27 0.1
lda (20k) 0.18 0.09
logisticRegression 0.19 0.1
hmmIncremental (n=10, full enum) 0.01 0.02
ldaCollapsed (10k) 0.81 0.74

(The reason ldaCollapsed is so much slower is that the normalize helper it uses is now written in webppl rather than JS. So there'll be some extra CPS overhead, and perhaps it's also affected by #174. I'll ignore this here.)

For the first 3 models listed it looks like this change helps a bit, although things weren't that much slower to begin with. Seeing these results, it's not 100% clear to me that it's worth the extra effort of keeping both scorers around. My main concern is the fact that the code to transform the scorers is now pretty complicated and rather tightly coupled to the current structure of erp.ad.js, which could make future changes to erp.ad.js painful.

If we do think it's worth keeping both scorers around, perhaps I could revisit Sid/Daniel's original approach of splitting the scorers out into a separate file. (With the hope that transforming that file will be simpler, and perhaps delegation to the correct scorer can be added in the ERP constructor or something.)

It's also a drag having to thread env though all the chains of requires which end with require('erp'), though that's not such a big deal.

Any thoughts?

@stuhlmueller
Copy link
Member

Thanks for running these experiments! I agree that the additional complexity together with the relatively small improvement in speed doesn't make it worth keeping both scorers around, but it's good that we checked.

Suppose we take the performance hit and decide to merge the PR essentially as is. Then what else is left to do? Based on the original comment, it looks like we want to switch to the latest ad.js for tensor support. What needs to happen there? Let's aim to merge the PR this week, and let's create issues for any (non-critical) remaining issues.

@null-a
Copy link
Member Author

null-a commented Jan 6, 2016

Updating to a newer version of ad.js is the only thing I have in mind to do. The main reason for doing so would be to gain the ability to take derivatives w.r.t. tensor valued variables. (This hack avoids creating tapes for Dirichlet/mv Gaussian at present.)

I'm not sure how far away ad.js is from been ready for this, I'd need to check. Elsewhere I've played with integrating Daniel's implementation of ad, so that might be an option if getting this working soon is important. It's not a trivial change though, since (as far as I can tell) it also requires reimplementing some basic matrix/vector operations we currently get from numeric.js. (Some of which I've done, though it would be a stretch to get it finished this week.) There was some talk of having both these ad libraries share a common interface, an idea which might be relevant here.

On a different note, can I just check, are you OK with the change (described here) to ERP.prototype.support? (It's used for constraint handling in HMC.)

@stuhlmueller
Copy link
Member

OK, keep me updated on the ad.js upgrade status.

If it looks like this is going to take a while, would it make sense to keep the PR as is, but detect when HMC is used with multivariate distributions and throw an error that says that it's not supported yet?

On a different note, can I just check, are you OK with the change (described here) to ERP.prototype.support? (It's used for constraint handling in HMC.)

Fine with me.

@null-a
Copy link
Member Author

null-a commented Jan 11, 2016

I do think this will take a little while (at least a week, probably longer) so if you're happy to do so I think we should merge this. I've made a few final tweaks (including throwing an error if multivariate ERP are used with HMC as you suggested) so I think it's ready.

If we do merge it, I'll open a new issue for updating ad.js and also include the following on it:

  • Move the version of untapify in aggregation.js into the ad package.
  • Don't hardcode the path to the ad.js macros file.

@stuhlmueller
Copy link
Member

Sounds good. I'm going to merge this now—thanks for making this happen! :-)

stuhlmueller added a commit that referenced this pull request Jan 11, 2016
@stuhlmueller stuhlmueller merged commit 49bf7d0 into probmods:dev Jan 11, 2016
@null-a null-a deleted the hmc branch January 19, 2016 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants