Laplace Approximation
Laplace Approximation
(This tutorial follows the Maximum a posteriori estimation tutorial.)
Maximum a posteriori (MAP) estimation approximates the posterior $(p(\mathbf{z} \mid \mathbf{x}))$ with a point mass (delta function) by simply capturing its mode. MAP is attractive because it is fast and efficient. How can we use MAP to construct a better approximation to the posterior?
The Laplace approximation (Laplace, 1986) is one way of improving a MAP estimate. The idea is to approximate the posterior with a normal distribution centered at the MAP estimate, \[\begin{aligned} p(\mathbf{z} \mid \mathbf{x}) &\approx \text{Normal}(\mathbf{z}\;;\; \mathbf{z}_\text{MAP}, \Lambda^{-1}).\end{aligned}\] This requires computing a precision matrix $(\Lambda)$. Derived from a Taylor expansion, the Laplace approximation uses the Hessian of the negative log joint density at the MAP estimate. It is defined component-wise as \[\begin{aligned} \Lambda_{ij} &= \frac{\partial^2}{\partial z_i \partial z_j} -\log p(\mathbf{x}, \mathbf{z}).\end{aligned}\] For flat priors (which reduces MAP to maximum likelihood), the precision matrix is known as the observed Fisher information (Fisher, 1925). Edward uses TensorFlow’s automatic differentiation, making this second-order gradient computation both simple and efficient to distribute.
For more details, see the API as well as its implementation in Edward’s code base.
References
Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(5).
Laplace, P. S. (1986). Memoir on the probability of the causes of events. Statistical Science, 1(3), 364–378.