## Variational inference

Variational inference is an umbrella term for algorithms which cast posterior inference as optimization (Hinton & Van Camp, 1993; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Waterhouse, MacKay, & Robinson, 1996).

The core idea involves two steps:

- posit a family of distributions \(q(\mathbf{z}\;;\;\lambda)\) over the latent variables;
- match \(q(\mathbf{z}\;;\;\lambda)\) to the posterior by optimizing over its parameters \(\lambda\).

This strategy converts the problem of computing the posterior \(p(\mathbf{z} \mid \mathbf{x})\) into an optimization problem, which minimizes a divergence measure \[\begin{aligned}
\lambda^*
&=
\arg\min_\lambda \text{divergence}(
p(\mathbf{z} \mid \mathbf{x})
,
q(\mathbf{z}\;;\;\lambda)
).\end{aligned}\] The optimized distribution \(q(\mathbf{z}\;;\;\lambda)\) is then used as a proxy to the posterior \(p(\mathbf{z}\mid \mathbf{x})\).

Edward takes the perspective that the posterior is (typically) intractable, and thus we must build a model of latent variables that best approximates the posterior. It is analogous to the perspective that the true data generating process is unknown, and thus we build models of data to best approximate the true process.

For details on the variational inference base class defined in Edward, see the inference API. For examples of specific variational inference algorithms in Edward, see the other inference tutorials.

### References

Hinton, G. E., & Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In *Proceedings of the sixth annual conference on computational learning theory* (pp. 5–13). ACM.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. *Machine Learning*, *37*(2), 183–233.

Waterhouse, S., MacKay, D., & Robinson, T. (1996). Bayesian methods for mixtures of experts. *Advances in Neural Information Processing Systems*, 351–357.