Variational inference is an umbrella term for algorithms which cast posterior inference as optimization (Hinton & Camp, 1993; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Waterhouse, MacKay, & Robinson, 1996).
The core idea involves two steps:
This strategy converts the problem of computing the posterior \(p(\mathbf{z} \mid \mathbf{x})\) into an optimization problem: minimize a divergence measure \[\begin{aligned} \lambda^* &= \arg\min_\lambda \text{divergence}( p(\mathbf{z} \mid \mathbf{x}) , q(\mathbf{z}\;;\;\lambda) ).\end{aligned}\] The optimized distribution \(q(\mathbf{z}\;;\;\lambda^*)\) is used as a proxy to the posterior \(p(\mathbf{z}\mid \mathbf{x})\).
Edward takes the perspective that the posterior is (typically) intractable, and thus we must build a model of latent variables that best approximates the posterior. It is analogous to the perspective that the true data generating process is unknown, and thus we build models of data to best approximate the true process.
For details on variational inference classes defined in Edward, see the inference API. For background on specific variational inference algorithms in Edward, see the other inference tutorials.
Hinton, G. E., & Camp, D. van. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Conference on learning theory. ACM.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.
Waterhouse, S., MacKay, D., & Robinson, T. (1996). Bayesian methods for mixtures of experts. Advances in Neural Information Processing Systems, 351–357.