Maximum a posteriori (MAP) estimation is a form of approximate posterior inference. It uses the mode as a point estimate of the posterior distribution, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} p(\mathbf{z} \mid \mathbf{x})\\ &= \arg \max_\mathbf{z} \log p(\mathbf{z} \mid \mathbf{x}).\end{aligned}\] In practice, we work with logarithms of densities to avoid numerical underflow issues (Murphy, 2012).

The MAP estimate is the most likely configuration of the hidden patterns \(\mathbf{z}\) under the model. However, we cannot directly solve this optimization problem because the posterior is typically intractable. To circumvent this, we use Bayes’ rule to optimize over the joint density, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} \log p(\mathbf{z} \mid \mathbf{x})\\ &= \arg \max_\mathbf{z} \log p(\mathbf{x}, \mathbf{z}).\end{aligned}\] This is valid because \[\begin{aligned} \log p(\mathbf{z} \mid \mathbf{x}) &= \log p(\mathbf{x}, \mathbf{z}) - \log p(\mathbf{x})\\ &= \log p(\mathbf{x}, \mathbf{z}) - \text{constant in terms of } \mathbf{z}.\end{aligned}\] MAP estimation includes the common scenario of maximum likelihood estimation as a special case, \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} p(\mathbf{x}, \mathbf{z})\\ &= \arg \max_\mathbf{z} p(\mathbf{x}\mid \mathbf{z}),\end{aligned}\] where the prior \(p(\mathbf{z})\) is flat, placing uniform probability over all values \(\mathbf{z}\) supports. Placing a nonuniform prior can be thought of as regularizing the estimation, penalizing values away from maximizing the likelihood, which can lead to overfitting. For example, a normal prior or Laplace prior on \(\mathbf{z}\) corresponds to \(\ell_2\) penalization, also known as ridge regression, and \(\ell_1\) penalization, also known as the LASSO.

Maximum likelihood is also known as minimizing the cross entropy, where for a data set \(\mathbf{x}=\{x_n\}\), \[\begin{aligned} \mathbf{z}_\text{MAP} &= \arg \max_\mathbf{z} \log p(\mathbf{x}\mid \mathbf{z}) \\ &= \arg \max_\mathbf{z} \sum_{n=1}^N \log p(x_n\mid \mathbf{z}) \\ &= \arg \min_\mathbf{z} -\frac{1}{N}\sum_{n=1}^N \log p(x_n\mid \mathbf{z}).\end{aligned}\] The last expression can be thought of as an approximation to the cross entropy between the true data distribution and \(p(\mathbf{x}\mid \mathbf{z})\), using a set of \(N\) data points.

To find the MAP estimate of the latent variables \(\mathbf{z}\), we simply calculate the gradient of the log joint density \[\begin{aligned} \nabla_\mathbf{z} \log p(\mathbf{x}, \mathbf{z})\end{aligned}\] and follow it to a (local) optima. Edward uses automatic differentiation, specifically with TensorFlow’s computational graphs, making this gradient computation both simple and efficient to distribute.

**Note: These details are outdated since Edward v1.1.3.**

In Edward, we view MAP estimation as a special case of \(KL(q\|p)\) minimization, where the variational distribution is a point mass (delta function) distribution. This makes explicit the additional assumptions underlying MAP estimation from the viewpoint of variational inference: namely, it approximates the posterior using a point to summarize the full distribution.

```
class MAP(VariationalInference):
def __init__(self, latent_vars, data=None, model_wrapper=None):
...
super(MAP, self).__init__(latent_vars, data, model_wrapper)
def build_loss(self):
x = self.data
z = {key: rv.sample([1]) for key, rv in six.iteritems(self.latent_vars)}
self.loss = tf.squeeze(self.model_wrapper.log_prob(x, z))
return -self.loss
```

The `MAP`

class inherits from `VariationalInference`

and initializes a point mass as the variational distribution. It requires defining only one method: `build_loss()`

, which implicitly specifies the gradient of the log joint. TensorFlow’s automatic differentiation will apply to `z`

, thus backpropagating (applying the chain rule of differentiation) to compute the gradient of the objective.

There is a nuance here. TensorFlow’s optimizers are configured to *minimize* an objective function. So the gradient is set to be the negative of the log density’s gradient.

Murphy, K. P. (2012). *Machine learning: A probabilistic perspective*. MIT Press.