\(\text{KL}(q\|p)\) minimization

One form of variational inference minimizes the Kullback-Leibler divergence from \(q(\mathbf{z}\;;\;\lambda)\) to \(p(\mathbf{z} \mid \mathbf{x})\), \[\begin{aligned} \lambda^* &= \arg\min_\lambda \text{KL}( q(\mathbf{z}\;;\;\lambda) \;\|\; p(\mathbf{z} \mid \mathbf{x}) )\\ &= \arg\min_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log q(\mathbf{z}\;;\;\lambda) - \log p(\mathbf{z} \mid \mathbf{x}) \big].\end{aligned}\] The KL divergence is a non-symmetric, information theoretic measure of similarity between two probability distributions (Hinton & Van Camp, 1993; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Waterhouse, MacKay, & Robinson, 1996).

The Evidence Lower Bound

The above optimization problem is intractable because it directly depends on the posterior \(p(\mathbf{z} \mid \mathbf{x})\). To tackle this, consider the property \[\begin{aligned} \log p(\mathbf{x}) &= \text{KL}( q(\mathbf{z}\;;\;\lambda) \;\|\; p(\mathbf{z} \mid \mathbf{x}) )\\ &\quad+\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big]\end{aligned}\] where the left hand side is the logarithm of the marginal likelihood \(p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) \text{d}\mathbf{z}\), also known as the model evidence. (Try deriving this using Bayes’ rule!)

The evidence is a constant with respect to the variational parameters \(\lambda\), so we can minimize \(\text{KL}(q\|p)\) by instead maximizing the Evidence Lower BOund, \[\begin{aligned} \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big].\end{aligned}\] In the ELBO, both \(p(\mathbf{x}, \mathbf{z})\) and \(q(\mathbf{z}\;;\;\lambda)\) are tractable. The optimization problem we seek to solve becomes \[\begin{aligned} \lambda^* &= \arg \max_\lambda \text{ELBO}(\lambda).\end{aligned}\] As per its name, the ELBO is a lower bound on the evidence, and optimizing it tries to maximize the probability of observing the data. What does maximizing the ELBO do? Splitting the ELBO reveals a trade-off \[\begin{aligned} \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\mathbf{z} \;;\; \lambda)}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_{q(\mathbf{z} \;;\; \lambda)}[\log q(\mathbf{z}\;;\;\lambda)],\end{aligned}\] where the first term represents an energy and the second term (including the minus sign) represents the entropy of \(q\). The energy encourages \(q\) to focus probability mass where the model puts high probability, \(p(\mathbf{x}, \mathbf{z})\). The entropy encourages \(q\) to spread probability mass to avoid concentrating to one location.

Edward uses two generic strategies to obtain gradients for optimization.

References

Hinton, G. E., & Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on computational learning theory (pp. 5–13). ACM.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.

Waterhouse, S., MacKay, D., & Robinson, T. (1996). Bayesian methods for mixtures of experts. Advances in Neural Information Processing Systems, 351–357.