Core to Edward’s design is compositionality. Compositionality enables fine control of inference, where we can write inference as a collection of separate inference programs.

We outline how to write popular classes of compositional inferences using Edward: hybrid algorithms and message passing algorithms. We use the running example of a mixture model with latent mixture assignments `z`

, latent cluster means `beta`

, and observations `x`

.

Hybrid algorithms leverage different inferences for each latent variable in the posterior. As an example, we demonstrate variational EM, with an approximate E-step over local variables and an M-step over global variables. We alternate with one update of each (Neal & Hinton, 1993).

```
from edward.models import Categorical, PointMass
qbeta = PointMass(params=tf.Variable(tf.zeros([K, D])))
qz = Categorical(logits=tf.Variable(tf.zeros[N, K]))
inference_e = ed.VariationalInference({z: qz}, data={x: x_data, beta: qbeta})
inference_m = ed.MAP({beta: qbeta}, data={x: x_data, z: qz})
...
for _ in range(10000):
inference_e.update()
inference_m.update()
```

In `data`

, we include bindings of prior latent variables (`z`

or `beta`

) to posterior latent variables (`qz`

or `qbeta`

). This performs conditional inference, where only a subset of the posterior is inferred while the rest are fixed using other inferences.

This extends to many algorithms: for example, exact EM for exponential families; contrastive divergence (Hinton, 2002); pseudo-marginal and ABC methods (Andrieu & Roberts, 2009); Gibbs sampling within variational inference (Wang & Blei, 2012); Laplace variational inference (Wang & Blei, 2013); and structured variational auto-encoders (Johnson, Duvenaud, Wiltschko, Datta, & Adams, 2016).

Message passing algorithms operate on the posterior distribution using a collection of local inferences (Koller & Friedman, 2009). As an example, we demonstrate expectation propagation. We split a mixture model to be over two random variables `x1`

and `x2`

along with their latent mixture assignments `z1`

and `z2`

.

```
from edward.models import Categorical, Normal
N1 = 1000 # number of data points in first data set
N2 = 2000 # number of data points in second data set
D = 2 # data dimension
K = 5 # number of clusters
# MODEL
beta = Normal(loc=tf.zeros([K, D]), scale=tf.ones([K, D]))
z1 = Categorical(logits=tf.zeros([N1, K]))
z2 = Categorical(logits=tf.zeros([N2, K]))
x1 = Normal(loc=tf.gather(beta, z1), scale=tf.ones([N1, D]))
x2 = Normal(loc=tf.gather(beta, z2), scale=tf.ones([N2, D]))
# INFERENCE
qbeta = Normal(loc=tf.Variable(tf.zeros([K, D])),
scale=tf.nn.softplus(tf.Variable(tf.zeros([K, D]))))
qz1 = Categorical(logits=tf.Variable(tf.zeros[N1, K]))
qz2 = Categorical(logits=tf.Variable(tf.zeros[N2, K]))
inference_z1 = ed.KLpq({beta: qbeta, z1: qz1}, {x1: x1_train})
inference_z2 = ed.KLpq({beta: qbeta, z2: qz2}, {x2: x2_train})
...
for _ in range(10000):
inference_z1.update()
inference_z2.update()
```

We alternate updates for each local inference, where the global posterior factor \(q(\beta)\) is shared across both inferences (Gelman et al., 2017).

With TensorFlow’s distributed training, compositionality enables *distributed* message passing over a cluster with many workers. The computation can be further sped up with the use of GPUs via data and model parallelism.

This extends to many algorithms: for example, classical message passing, which performs exact local inferences; Gibbs sampling, which draws samples from conditionally conjugate inferences (Geman & Geman, 1984); expectation propagation, which locally minimizes \(\text{KL}(p || q)\) over exponential families (Minka, 2001); integrated nested Laplace approximation, which performs local Laplace approximations (Rue, Martino, & Chopin, 2009); and all the instantiations of EP-like algorithms in Gelman et al. (2017).

In the above, we perform local inferences split over individual random variables. At the moment, Edward does not support local inferences within a random variable itself. We cannot do local inferences when representing the random variable for all data points and their cluster membership as `x`

and `z`

rather than `x1`

, `x2`

, `z1`

, and `z2`

.

Andrieu, C., & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. *The Annals of Statistics*, 697–725.

Gelman, A., Vehtari, A., Jylänki, P., Sivula, T., Tran, D., Sahai, S., … Robert, C. (2017). Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data. *arXiv Preprint arXiv:1412.4869v2*.

Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, (6), 721–741.

Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. *Neural Computation*, *14*(8), 1771–1800.

Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., & Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference. *arXiv Preprint arXiv:1603.06277*.

Koller, D., & Friedman, N. (2009). *Probabilistic graphical models: Principles and techniques*. MIT press.

Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In *Uncertainty in artificial intelligence*.

Neal, R. M., & Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. In *Learning in graphical models* (pp. 355–368).

Rue, H., Martino, S., & Chopin, N. (2009). Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. *Journal of the Royal Statistical Society: Series B (Statistical Methodology)*, *71*(2), 319–392.

Wang, C., & Blei, D. M. (2012). Truncation-free online variational inference for bayesian nonparametric models. In *Neural information processing systems*.

Wang, C., & Blei, D. M. (2013). Variational inference in nonconjugate models. *Journal of Machine Learning Research*, *14*, 1005–1031.