(This tutorial follows the \(\text{KL}(q\|p)\) minimization tutorial.)

We seek to maximize the ELBO, \[\begin{aligned} \lambda^* &= \arg \max_\lambda \text{ELBO}(\lambda)\\ &=\; \arg \max_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big],\end{aligned}\] using a “black box” algorithm. This means generically inferring the posterior while making few assumptions about the model.

If the model has differentiable latent variables, then it is generally advantageous to leverage gradient information from the model in order to better traverse the optimization space. One approach to doing this is the reparameterization gradient.

Gradient descent is a standard approach for optimizing complicated objectives like the ELBO. The idea is to calculate its gradient \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &= \nabla_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big],\end{aligned}\] and update the current set of parameters proportional to the gradient.

Some variational distributions \(q(\mathbf{z}\;;\;\lambda)\) admit useful reparameterizations. For example, we can reparameterize a normal distribution \(\mathbf{z} \sim \text{Normal}(\mu, \Sigma)\) as \(\mathbf{z} \sim \mu + L \text{Normal}(0, I)\) where \(\Sigma = LL^\top\). In general, write this as \[\begin{aligned}
\epsilon &\sim q(\epsilon)\\
\mathbf{z} &= \mathbf{z}(\epsilon \;;\; \lambda),\end{aligned}\] where \(\epsilon\) is a random variable that does **not** depend on the variational parameters \(\lambda\). The deterministic function \(\mathbf{z}(\cdot;\lambda)\) encapsulates the variational parameters instead, and following the process is equivalent to directly drawing \(\mathbf{z}\) from the original distribution.

The reparameterization gradient leverages this property of the variational distribution to write the gradient as \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\epsilon)} \big[ \nabla_\lambda \big( \log p(\mathbf{x}, \mathbf{z}(\epsilon \;;\; \lambda)) - \log q(\mathbf{z}(\epsilon \;;\; \lambda) \;;\;\lambda) \big) \big].\end{aligned}\] The gradient of the ELBO is an expectation over the base distribution \(q(\epsilon)\), and the gradient can be applied directly to the inner expression (Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014).

Edward uses automatic differentiation, specifically with TensorFlow’s computational graphs, making this gradient computation both simple and efficient to distribute.

We can use Monte Carlo integration to obtain noisy estimates of both the ELBO and its gradient. The basic procedure follows these steps:

- draw \(S\) samples \(\{\epsilon_s\}_1^S \sim q(\epsilon)\),
- evaluate the argument of the expectation using \(\{\epsilon_s\}_1^S\), and
- compute the empirical mean of the evaluated quantities.

A Monte Carlo estimate of the gradient is then \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &\approx\; \frac{1}{S} \sum_{s=1}^{S} \big[ \nabla_\lambda \big( \log p(\mathbf{x}, \mathbf{z}(\epsilon \;;\; \lambda)) - \log q(\mathbf{z}(\epsilon \;;\; \lambda) \;;\;\lambda) \big) \big].\end{aligned}\] This is an unbiased estimate of the actual gradient of the ELBO. Empirically, it exhibits lower variance than the score function gradient, leading to faster convergence in a large set of problems.

**Note: These details are outdated since Edward v1.1.3.**

We implement \(\text{KL}(q\|p)\) minimization with the reparameterization gradient in the class `KLqp`

. It inherits from `VariationalInference`

, which provides a collection of default methods such as an optimizer.

```
class KLqp(VariationalInference):
def __init__(self, *args, **kwargs):
super(KLqp, self).__init__(*args, **kwargs)
def initialize(self, n_samples=1, ...):
...
self.n_samples = n_samples
return super(KLqp, self).initialize(*args, **kwargs)
def build_reparam_loss(self):
x = self.data
z = {key: rv.sample([self.n_samples])
for key, rv in six.iteritems(self.latent_vars)}
p_log_prob = self.model_wrapper.log_prob(x, z)
q_log_prob = 0.0
for key, rv in six.iteritems(self.latent_vars):
q_log_prob += tf.reduce_sum(rv.log_prob(z[key]),
list(range(1, len(rv.get_batch_shape()) + 1)))
self.loss = -tf.reduce_mean(p_log_prob - q_log_prob)
return self.loss
```

Two methods are added: `initialize()`

and `build_reparam_loss()`

. The method `initialize()`

follows the same initialization as `VariationalInference`

, and adds an argument: `n_samples`

for the number of samples from the variational model.

The method `build_reparam_loss()`

implements the ELBO and its gradient. It draws `self.n_samples`

samples from the variational model. It registers the Monte Carlo estimate of the ELBO in TensorFlow’s computational graph, and stores its negative in `self.loss`

, to track progress of the inference for diagnostics.

The method returns an object whose automatic differentiation is the reparameterization gradient of the ELBO. The computational graph will simply traverse the nodes during backpropagation; unlike the score function gradient, the reparameterization gradient does not need TensorFlow’s `tf.stop_gradient()`

function.

There is a nuance here. TensorFlow’s optimizers are configured to *minimize* an objective function. So the gradient is set to be the negative of the ELBO’s gradient.

See the API for more details.

Kingma, D., & Welling, M. (2014). Auto-encoding variational Bayes. In *International conference on learning representations*.

Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In *ICML* (pp. 1278–1286).