Laplace approximation

(This tutorial follows the Maximum a posteriori estimation tutorial.)

Maximum a posteriori (MAP) estimation approximates the posterior \(p(\mathbf{z} \mid \mathbf{x})\) with a point mass (delta function) by simply capturing its mode. MAP is attractive because it is fast and efficient. How can we use MAP to construct a better approximation to the posterior?

The Laplace approximation (Laplace, 1986) is one way of improving on an MAP estimate. The idea is to approximate the posterior with a normal distribution centered at the MAP estimate \[\begin{aligned} p(\mathbf{z} \mid \mathbf{x}) &\approx \text{Normal}(\mathbf{z}\;;\; \mathbf{z}_\text{MAP}, \Lambda^{-1}).\end{aligned}\] This requires computing a precision matrix \(\Lambda\). The Laplace approximation uses the Hessian of the log joint density at the MAP estimate, defined component-wise as \[\begin{aligned} \Lambda_{ij} &= \frac{\partial^2 \log p(\mathbf{x}, \mathbf{z})}{\partial z_i \partial z_j}.\end{aligned}\] Edward uses automatic differentiation, specifically with TensorFlow’s computational graphs, making this gradient computation both simple and efficient to distribute (although more expensive than a first-order gradient).

For more details, see the API as well as its implementation in Edward’s code base.

References

Laplace, P. S. (1986). Memoir on the probability of the causes of events. Statistical Science, 1(3), 364–378.