Maximum a Posteriori Estimation

Maximum a posteriori (MAP) estimation is a form of approximate posterior inference. It uses the mode as a point estimate of the posterior distribution, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} p(\mathbf{z} \mid \mathbf{x})\\ &= \arg \max_\mathbf{z} \log p(\mathbf{z} \mid \mathbf{x}).\end{aligned}\] In practice, we work with logarithms of densities to avoid numerical underflow issues (Murphy, 2012).

The MAP estimate is the most likely configuration of the hidden patterns \(\mathbf{z}\) under the model. However, we cannot directly solve this optimization problem because the posterior is typically intractable. To circumvent this, we use Bayes’ rule to optimize over the joint density, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} \log p(\mathbf{z} \mid \mathbf{x})\\ &= \arg \max_\mathbf{z} \log p(\mathbf{x}, \mathbf{z}).\end{aligned}\] This is valid because \[\begin{aligned} \log p(\mathbf{z} \mid \mathbf{x}) &= \log p(\mathbf{x}, \mathbf{z}) - \log p(\mathbf{x})\\ &= \log p(\mathbf{x}, \mathbf{z}) - \text{constant in terms of } \mathbf{z}.\end{aligned}\] MAP estimation includes the common scenario of maximum likelihood estimation as a special case, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} p(\mathbf{x}, \mathbf{z})\\ &= \arg \max_\mathbf{z} p(\mathbf{x}\mid \mathbf{z}),\end{aligned}\] where the prior \(p(\mathbf{z})\) is flat, placing uniform probability over all values \(\mathbf{z}\) supports. Placing a nonuniform prior can be thought of as regularizing the estimation, penalizing values away from maximizing the likelihood, which can lead to overfitting. For example, a normal prior or Laplace prior on \(\mathbf{z}\) corresponds to \(\ell_2\) penalization, also known as ridge regression, and \(\ell_1\) penalization, also known as the LASSO.

Maximum likelihood is also known as cross entropy minimization. For a data set \(\mathbf{x}=\{x_n\}\), \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} \log p(\mathbf{x}\mid \mathbf{z}) \\ &= \arg \max_\mathbf{z} \sum_{n=1}^N \log p(x_n\mid \mathbf{z}) \\ &= \arg \min_\mathbf{z} -\frac{1}{N}\sum_{n=1}^N \log p(x_n\mid \mathbf{z}).\end{aligned}\] The last expression can be thought of as an approximation to the cross entropy between the true data distribution and \(p(\mathbf{x}\mid \mathbf{z})\), using a set of \(N\) data points.

Gradient descent

To find the MAP estimate of the latent variables \(\mathbf{z}\), we use the gradient of the log joint density \[\begin{aligned} \nabla_\mathbf{z} \log p(\mathbf{x}, \mathbf{z})\end{aligned}\] and follow it to a (local) optima. Edward uses TensorFlow’s automatic differentiation, making this gradient computation both simple and efficient to distribute.

Edward currently does not support MAP for discrete latent variables. This amounts to discrete optimization, which is difficult.

For more details, see the API as well as its implementation in Edward’s code base.

References

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press.