Reparameterization gradient

(This tutorial follows the \(\text{KL}(q\|p)\) minimization tutorial.)

We seek to maximize the ELBO, \[\begin{aligned} \lambda^* &= \arg \max_\lambda \text{ELBO}(\lambda)\\ &=\; \arg \max_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big],\end{aligned}\] using a “black box” algorithm. This means generically inferring the posterior while making few assumptions about the model.

If the model has differentiable latent variables, then it is generally advantageous to leverage gradient information from the model in order to better traverse the optimization space. One approach to doing this is the reparameterization gradient.

The reparameterization identity

Gradient descent is a standard approach for optimizing complicated objectives like the ELBO. The idea is to calculate its gradient \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &= \nabla_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big],\end{aligned}\] and update the current set of parameters proportional to the gradient.

Some variational distributions \(q(\mathbf{z}\;;\;\lambda)\) admit useful reparameterizations. For example, we can reparameterize a normal distribution \(\mathbf{z} \sim \text{Normal}(\mu, \Sigma)\) as \(\mathbf{z} \sim \mu + L \text{Normal}(0, I)\) where \(\Sigma = LL^\top\). In general, write this as \[\begin{aligned} \epsilon &\sim q(\epsilon)\\ \mathbf{z} &= \mathbf{z}(\epsilon \;;\; \lambda),\end{aligned}\] where \(\epsilon\) is a random variable that does not depend on the variational parameters \(\lambda\). The deterministic function \(\mathbf{z}(\cdot;\lambda)\) encapsulates the variational parameters instead, and following the process is equivalent to directly drawing \(\mathbf{z}\) from the original distribution.

The reparameterization gradient leverages this property of the variational distribution to write the gradient as \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\epsilon)} \big[ \nabla_\lambda \big( \log p(\mathbf{x}, \mathbf{z}(\epsilon \;;\; \lambda)) - \log q(\mathbf{z}(\epsilon \;;\; \lambda) \;;\;\lambda) \big) \big].\end{aligned}\] The gradient of the ELBO is an expectation over the base distribution \(q(\epsilon)\), and the gradient can be applied directly to the inner expression (Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014).

Edward uses automatic differentiation, specifically with TensorFlow’s computational graphs, making this gradient computation both simple and efficient to distribute.

Noisy estimates using Monte Carlo integration

We can use Monte Carlo integration to obtain noisy estimates of both the ELBO and its gradient. The basic procedure follows these steps:

  1. draw \(S\) samples \(\{\epsilon_s\}_1^S \sim q(\epsilon)\),
  2. evaluate the argument of the expectation using \(\{\epsilon_s\}_1^S\), and
  3. compute the empirical mean of the evaluated quantities.

A Monte Carlo estimate of the gradient is then \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &\approx\; \frac{1}{S} \sum_{s=1}^{S} \big[ \nabla_\lambda \big( \log p(\mathbf{x}, \mathbf{z}(\epsilon \;;\; \lambda)) - \log q(\mathbf{z}(\epsilon \;;\; \lambda) \;;\;\lambda) \big) \big].\end{aligned}\] This is an unbiased estimate of the actual gradient of the ELBO. Empirically, it exhibits lower variance than the score function gradient, leading to faster convergence in a large set of problems.

Implementation

Note: These details are outdated since Edward v1.1.3.

We implement \(\text{KL}(q\|p)\) minimization with the reparameterization gradient in the class KLqp. It inherits from VariationalInference, which provides a collection of default methods such as an optimizer.

class KLqp(VariationalInference):
  def __init__(self, *args, **kwargs):
    super(KLqp, self).__init__(*args, **kwargs)

  def initialize(self, n_samples=1, ...):
    ...
    self.n_samples = n_samples
    return super(KLqp, self).initialize(*args, **kwargs)

  def build_reparam_loss(self):
    x = self.data
    z = {key: rv.sample([self.n_samples])
         for key, rv in six.iteritems(self.latent_vars)}

    p_log_prob = self.model_wrapper.log_prob(x, z)
    q_log_prob = 0.0
    for key, rv in six.iteritems(self.latent_vars):
      q_log_prob += tf.reduce_sum(rv.log_prob(z[key]),
                                  list(range(1, len(rv.get_batch_shape()) + 1)))

    self.loss = -tf.reduce_mean(p_log_prob - q_log_prob)
    return self.loss

Two methods are added: initialize() and build_reparam_loss(). The method initialize() follows the same initialization as VariationalInference, and adds an argument: n_samples for the number of samples from the variational model.

The method build_reparam_loss() implements the ELBO and its gradient. It draws self.n_samples samples from the variational model. It registers the Monte Carlo estimate of the ELBO in TensorFlow’s computational graph, and stores its negative in self.loss, to track progress of the inference for diagnostics.

The method returns an object whose automatic differentiation is the reparameterization gradient of the ELBO. The computational graph will simply traverse the nodes during backpropagation; unlike the score function gradient, the reparameterization gradient does not need TensorFlow’s tf.stop_gradient() function.

There is a nuance here. TensorFlow’s optimizers are configured to minimize an objective function. So the gradient is set to be the negative of the ELBO’s gradient.

See the API for more details.

References

Kingma, D., & Welling, M. (2014). Auto-encoding variational Bayes. In International conference on learning representations.

Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML (pp. 1278–1286).