Maximum a posteriori estimation

Maximum a posteriori (MAP) estimation is a form of approximate posterior inference. It uses the mode as a point estimate of the posterior distribution, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} p(\mathbf{z} \mid \mathbf{x})\\ &= \arg \max_\mathbf{z} \log p(\mathbf{z} \mid \mathbf{x}).\end{aligned}\] In practice, we work with logarithms of densities to avoid numerical underflow issues (Murphy, 2012).

The MAP estimate is the most likely configuration of the hidden patterns \(\mathbf{z}\) under the model. However, we cannot directly solve this optimization problem because the posterior is typically intractable. To circumvent this, we use Bayes’ rule to optimize over the joint density, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} \log p(\mathbf{z} \mid \mathbf{x})\\ &= \arg \max_\mathbf{z} \log p(\mathbf{x}, \mathbf{z}).\end{aligned}\] This is valid because \[\begin{aligned} \log p(\mathbf{z} \mid \mathbf{x}) &= \log p(\mathbf{x}, \mathbf{z}) - \log p(\mathbf{x})\\ &= \log p(\mathbf{x}, \mathbf{z}) - \text{constant in terms of } \mathbf{z}.\end{aligned}\] MAP estimation includes the common scenario of maximum likelihood estimation as a special case, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} p(\mathbf{x}, \mathbf{z})\\ &= \arg \max_\mathbf{z} p(\mathbf{x}\mid \mathbf{z}),\end{aligned}\] where the prior \(p(\mathbf{z})\) is flat, placing uniform probability over all values \(\mathbf{z}\) supports. Placing a nonuniform prior can be thought of as regularizing the estimation, penalizing values away from maximizing the likelihood, which can lead to overfitting. For example, a normal prior or Laplace prior on \(\mathbf{z}\) corresponds to \(\ell_2\) penalization, also known as ridge regression, and \(\ell_1\) penalization, also known as the LASSO.

Maximum likelihood is also known as minimizing the cross entropy, where for a data set \(\mathbf{x}=\{x_n\}\), \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} \log p(\mathbf{x}\mid \mathbf{z}) \\ &= \arg \max_\mathbf{z} \sum_{n=1}^N \log p(x_n\mid \mathbf{z}) \\ &= \arg \min_\mathbf{z} -\frac{1}{N}\sum_{n=1}^N \log p(x_n\mid \mathbf{z}).\end{aligned}\] The last expression can be thought of as an approximation to the cross entropy between the true data distribution and \(p(\mathbf{x}\mid \mathbf{z})\), using a set of \(N\) data points.

Gradient descent

To find the MAP estimate of the latent variables \(\mathbf{z}\), we simply calculate the gradient of the log joint density \[\begin{aligned} \nabla_\mathbf{z} \log p(\mathbf{x}, \mathbf{z})\end{aligned}\] and follow it to a (local) optima. Edward uses automatic differentiation, specifically with TensorFlow’s computational graphs, making this gradient computation both simple and efficient to distribute.

Implementation

Note: These details are outdated since Edward v1.1.3.

In Edward, we view MAP estimation as a special case of \(KL(q\|p)\) minimization, where the variational distribution is a point mass (delta function) distribution. This makes explicit the additional assumptions underlying MAP estimation from the viewpoint of variational inference: namely, it approximates the posterior using a point to summarize the full distribution.

class MAP(VariationalInference):
  def __init__(self, latent_vars, data=None, model_wrapper=None):
    ...
    super(MAP, self).__init__(latent_vars, data, model_wrapper)

  def build_loss(self):
    x = self.data
    z = {key: rv.sample([1]) for key, rv in six.iteritems(self.latent_vars)}
    self.loss = tf.squeeze(self.model_wrapper.log_prob(x, z))
    return -self.loss

The MAP class inherits from VariationalInference and initializes a point mass as the variational distribution. It requires defining only one method: build_loss(), which implicitly specifies the gradient of the log joint. TensorFlow’s automatic differentiation will apply to z, thus backpropagating (applying the chain rule of differentiation) to compute the gradient of the objective.

There is a nuance here. TensorFlow’s optimizers are configured to minimize an objective function. So the gradient is set to be the negative of the log density’s gradient.

References

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press.