Score function gradient

(This tutorial follows the \(\text{KL}(q\|p)\) minimization tutorial.)

We seek to maximize the ELBO, \[\begin{aligned} \lambda^* &= \arg \max_\lambda \text{ELBO}(\lambda)\\ &=\; \arg \max_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big],\end{aligned}\] using a “black box” algorithm. This means generically inferring the posterior while making few assumptions about the model.

The score function identity

Gradient descent is a standard approach for optimizing complicated objectives like the ELBO. The idea is to calculate its gradient \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &= \nabla_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big],\end{aligned}\] and update the current set of parameters proportional to the gradient.

The score function gradient estimator leverages a property of logarithms to write the gradient as \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \nabla_\lambda \log q(\mathbf{z}\;;\;\lambda) \: \big( \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big) \big].\end{aligned}\] The gradient of the ELBO is an expectation over the variational model \(q(\mathbf{z}\;;\;\lambda)\); the only new ingredient it requires is the score function \(\nabla_\lambda \log q(\mathbf{z}\;;\;\lambda)\) (Paisley, Blei, & Jordan, 2012; Ranganath, Gerrish, & Blei, 2014).

Edward uses automatic differentiation, specifically with TensorFlow’s computational graphs, making this gradient computation both simple and efficient to distribute.

Noisy estimates using Monte Carlo integration

We can use Monte Carlo integration to obtain noisy estimates of both the ELBO and its gradient. The basic procedure follows these steps:

  1. draw \(S\) samples \(\{\mathbf{z}_s\}_1^S \sim q(\mathbf{z}\;;\;\lambda)\),
  2. evaluate the argument of the expectation using \(\{\mathbf{z}_s\}_1^S\), and
  3. compute the empirical mean of the evaluated quantities.

A Monte Carlo estimate of the gradient is then \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &\approx\; \frac{1}{S} \sum_{s=1}^{S} \big[ \big( \log p(\mathbf{x}, \mathbf{z}_s) - \log q(\mathbf{z}_s\;;\;\lambda) \big) \: \nabla_\lambda \log q(\mathbf{z}_s\;;\;\lambda) \big].\end{aligned}\] This is an unbiased estimate of the actual gradient of the ELBO.

Implementation

Note: These details are outdated since Edward v1.1.3.

We implement \(\text{KL}(q\|p)\) minimization with the score function gradient in the class KLqp. It inherits from VariationalInference, which provides a collection of default methods such as an optimizer.

class KLqp(VariationalInference):
  def __init__(self, *args, **kwargs):
    super(KLqp, self).__init__(*args, **kwargs)

  def initialize(self, n_samples=1, ...):
    ...
    self.n_samples = n_samples
    return super(KLqp, self).initialize(*args, **kwargs)

  def build_score_loss(self):
    x = self.data
    z = {key: rv.sample([self.n_samples])
         for key, rv in six.iteritems(self.latent_vars)}

    p_log_prob = self.model_wrapper.log_prob(x, z)
    q_log_prob = 0.0
    for key, rv in six.iteritems(self.latent_vars):
      q_log_prob += tf.reduce_sum(rv.log_prob(tf.stop_gradient(z[key])),
                                  list(range(1, len(rv.get_batch_shape()) + 1)))

    losses = p_log_prob - q_log_prob
    self.loss = -tf.reduce_mean(losses)
    return -tf.reduce_mean(q_log_prob * tf.stop_gradient(losses))

Two methods are added: initialize() and build_score_loss(). The method initialize() follows the same initialization as VariationalInference, and adds an argument: n_samples for the number of samples from the variational model.

The method build_score_loss() implements the ELBO and its gradient. It draws self.n_samples samples from the variational model. It registers the Monte Carlo estimate of the ELBO in TensorFlow’s computational graph, and stores it negative in self.loss, to track progress of the inference for diagnostics.

The method returns an object whose automatic differentiation is the score function gradient of the ELBO. The TensorFlow function tf.stop_gradient() tells the computational graph to stop traversing nodes to propagate gradients. In this case, the only gradients taken are \(\nabla_\lambda \log q(\mathbf{z}_s\;;\;\lambda)\), one for each sample \(\mathbf{z}_s\). We multiply it by losses element-wise and return the mean.

There is a nuance here. TensorFlow’s optimizers are configured to minimize an objective function. So the gradient is set to be the negative of the ELBO’s gradient.

See the API for more details.

References

Paisley, J., Blei, D. M., & Jordan, M. I. (2012). Variational bayesian inference with stochastic search. In International conference on machine learning (pp. 1367–1374).

Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial intelligence and statistics.