## Deep Probabilistic Programming

This webpage is a companion to the article, Deep Probabilistic Programming (Tran et al., 2017). Here we provide more details for plug-and-play with the code snippets. An interactive version with Jupyter notebook is available here.

The code snippets assume the following versions.

pip install edward==1.3.1
pip install tensorflow==1.1.0  # alternatively, tensorflow-gpu==1.1.0
pip install keras==2.0.0

### Section 3. Compositional Representations for Probabilistic Models

Figure 1. Beta-Bernoulli program.

import tensorflow as tf
from edward.models import Bernoulli, Beta

theta = Beta(1.0, 1.0)
x = Bernoulli(tf.ones(50) * theta)

For an example of it in use, see examples/beta_bernoulli.py in the Github repository.

Figure 2. Variational auto-encoder for a data set of 28 x 28 pixel images (Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014).

import tensorflow as tf
from edward.models import Bernoulli, Normal
from keras.layers import Dense

N = 55000  # number of data points
d = 50  # latent dimension

# Probabilistic model
z = Normal(loc=tf.zeros([N, d]), scale=tf.ones([N, d]))
h = Dense(256, activation='relu')(z)
x = Bernoulli(logits=Dense(28 * 28, activation=None)(h))

# Variational model
qx = tf.placeholder(tf.float32, [N, 28 * 28])
qh = Dense(256, activation='relu')(qx)
qz = Normal(loc=Dense(d, activation=None)(qh),
scale=Dense(d, activation='softplus')(qh))

For an example of it in use, see examples/vae.py in the Github repository.

Figure 3. Bayesian recurrent neural network (Neal, 2012). The program has an unspecified number of time steps; it uses a symbolic for loop (tf.scan).

import edward as ed
import tensorflow as tf
from edward.models import Normal

H = 50  # number of hidden units
D = 10  # number of features

def rnn_cell(hprev, xt):
return tf.tanh(ed.dot(hprev, Wh) + ed.dot(xt, Wx) + bh)

Wh = Normal(loc=tf.zeros([H, H]), scale=tf.ones([H, H]))
Wx = Normal(loc=tf.zeros([D, H]), scale=tf.ones([D, H]))
Wy = Normal(loc=tf.zeros([H, 1]), scale=tf.ones([H, 1]))
bh = Normal(loc=tf.zeros(H), scale=tf.ones(H))
by = Normal(loc=tf.zeros(1), scale=tf.ones(1))

x = tf.placeholder(tf.float32, [None, D])
h = tf.scan(rnn_cell, x, initializer=tf.zeros(H))
y = Normal(loc=tf.matmul(h, Wy) + by, scale=1.0)

### Section 4. Compositional Representations for Inference

Figure 5. Hierarchical model (Gelman & Hill, 2006). It is a mixture of Gaussians over $$D$$-dimensional data $$\{x_n\}\in\mathbb{R}^{N\times D}$$. There are $$K$$ latent cluster means $$\beta\in\mathbb{R}^{K\times D}$$.

import tensorflow as tf
from edward.models import Categorical, Normal

N = 10000  # number of data points
D = 2  # data dimension
K = 5  # number of clusters

beta = Normal(loc=tf.zeros([K, D]), scale=tf.ones([K, D]))
z = Categorical(logits=tf.zeros([N, K]))
x = Normal(loc=tf.gather(beta, z), scale=tf.ones([N, D]))

It is used below in Figure 6 (left/right) and Figure * (variational EM).

Figure 6 (left). Variational inference (Jordan, Ghahramani, Jaakkola, & Saul, 1999). It performs inference on the model defined in Figure 5.

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Categorical, Normal

x_train = np.zeros([N, D])

qbeta = Normal(loc=tf.Variable(tf.zeros([K, D])),
scale=tf.exp(tf.Variable(tf.zeros([K, D]))))
qz = Categorical(logits=tf.Variable(tf.zeros([N, K])))

inference = ed.VariationalInference({beta: qbeta, z: qz}, data={x: x_train})

Figure 6 (right). Monte Carlo (Robert & Casella, 1999). It performs inference on the model defined in Figure 5.

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Empirical

x_train = np.zeros([N, D])

T = 10000  # number of samples
qbeta = Empirical(params=tf.Variable(tf.zeros([T, K, D])))
qz = Empirical(params=tf.Variable(tf.zeros([T, N])))

inference = ed.MonteCarlo({beta: qbeta, z: qz}, data={x: x_train})

Figure 7. Generative adversarial network (Goodfellow et al., 2014).

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Normal
from keras.layers import Dense

N = 55000  # number of data points
d = 50  # latent dimension

def generative_network(eps):
h = Dense(256, activation='relu')(eps)
return Dense(28 * 28, activation=None)(h)

def discriminative_network(x):
h = Dense(28 * 28, activation='relu')(x)
return Dense(1, activation=None)(h)

# DATA
x_train = np.zeros([N, 28 * 28])

# MODEL
eps = Normal(loc=tf.zeros([N, d]), scale=tf.ones([N, d]))
x = generative_network(eps)

# INFERENCE
inference = ed.GANInference(data={x: x_train},
discriminator=discriminative_network)

For an example of it in use, see the generative adversarial networks tutorial.

Figure *. Variational EM (Neal & Hinton, 1993). It performs inference on the model defined in Figure 5.

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Categorical, PointMass

# DATA
x_train = np.zeros([N, D])

# INFERENCE
qbeta = PointMass(params=tf.Variable(tf.zeros([K, D])))
qz = Categorical(logits=tf.Variable(tf.zeros([N, K])))

inference_e = ed.VariationalInference({z: qz}, data={x: x_train, beta: qbeta})
inference_m = ed.MAP({beta: qbeta}, data={x: x_train, z: qz})

inference_e.initialize()
inference_m.initialize()

tf.initialize_all_variables().run()

for _ in range(10000):
inference_e.update()
inference_m.update()

For more details, see the inference compositionality webpage. See examples/factor_analysis.py for a version performing Monte Carlo EM for logistic factor analysis in the Github repository. It leverages Hamiltonian Monte Carlo for the E-step to perform maximum marginal a posteriori.

Figure *. Data subsampling.

import edward as ed
import tensorflow as tf
from edward.models import Categorical, Normal

N = 10000  # number of data points
M = 128  # batch size during training
D = 2  # data dimension
K = 5  # number of clusters

# DATA
x_batch = tf.placeholder(tf.float32, [M, D])

# MODEL
beta = Normal(loc=tf.zeros([K, D]), scale=tf.ones([K, D]))
z = Categorical(logits=tf.zeros([M, K]))
x = Normal(loc=tf.gather(beta, z), scale=tf.ones([M, D]))

# INFERENCE
qbeta = Normal(loc=tf.Variable(tf.zeros([K, D])),
scale=tf.nn.softplus(tf.Variable(tf.zeros([K, D]))))
qz = Categorical(logits=tf.Variable(tf.zeros([M, D])))

inference = ed.VariationalInference({beta: qbeta, z: qz}, data={x: x_batch})
inference.initialize(scale={x: float(N) / M, z: float(N) / M})

For more details, see the data subsampling webpage.

### Section 5. Experiments

Figure 9. Bayesian logistic regression with Hamiltonian Monte Carlo.

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Bernoulli, Empirical, Normal

N = 581012  # number of data points
D = 54  # number of features
T = 100  # number of empirical samples

# DATA
x_data = np.zeros([N, D])
y_data = np.zeros([N])

# MODEL
x = tf.Variable(x_data, trainable=False)
beta = Normal(loc=tf.zeros(D), scale=tf.ones(D))
y = Bernoulli(logits=ed.dot(x, beta))

# INFERENCE
qbeta = Empirical(params=tf.Variable(tf.zeros([T, D])))
inference = ed.HMC({beta: qbeta}, data={y: y_data})
inference.run(step_size=0.5 / N, n_steps=10)

For an example of it in use, see examples/bayesian_logistic_regression.py in the Github repository.

### Appendix A. Model Examples

Figure 10. Bayesian neural network for classification (Denker, Schwartz, Wittner, & Solla, 1987).

import tensorflow as tf
from edward.models import Bernoulli, Normal

N = 1000  # number of data points
D = 528  # number of features
H = 256  # hidden layer size

W_0 = Normal(loc=tf.zeros([D, H]), scale=tf.ones([D, H]))
W_1 = Normal(loc=tf.zeros([H, 1]), scale=tf.ones([H, 1]))
b_0 = Normal(loc=tf.zeros(H), scale=tf.ones(H))
b_1 = Normal(loc=tf.zeros(1), scale=tf.ones(1))

x = tf.placeholder(tf.float32, [N, D])
y = Bernoulli(logits=tf.matmul(tf.nn.tanh(tf.matmul(x, W_0) + b_0), W_1) + b_1)

For an example of it in use, see examples/getting_started_example.py in the Github repository.

Figure 11. Latent Dirichlet allocation (Blei, Ng, & Jordan, 2003).

import tensorflow as tf
from edward.models import Categorical, Dirichlet

D = 4  # number of documents
N = [11502, 213, 1523, 1351]  # words per doc
K = 10  # number of topics
V = 100000  # vocabulary size

theta = Dirichlet(tf.zeros([D, K]) + 0.1)
phi = Dirichlet(tf.zeros([K, V]) + 0.05)
z = [[0] * N] * D
w = [[0] * N] * D
for d in range(D):
for n in range(N[d]):
z[d][n] = Categorical(theta[d, :])
w[d][n] = Categorical(phi[z[d][n], :])

Figure 12. Gaussian matrix factorization (Salakhutdinov & Mnih, 2011).

import tensorflow as tf
from edward.models import Normal

N = 10
M = 10
K = 5  # latent dimension

U = Normal(loc=tf.zeros([M, K]), scale=tf.ones([M, K]))
V = Normal(loc=tf.zeros([N, K]), scale=tf.ones([N, K]))
Y = Normal(loc=tf.matmul(U, V, transpose_b=True), scale=tf.ones([N, M]))

Figure 13. Dirichlet process mixture model (Antoniak, 1974).

import tensorflow as tf
from edward.models import DirichletProcess, Normal

N = 1000  # number of data points
D = 5  # data dimensionality

dp = DirichletProcess(alpha=1.0, base=Normal(loc=tf.zeros(D), scale=tf.ones(D)))
mu = dp.sample(N)
x = Normal(loc=mu, scale=tf.ones([N, D]))

To see the essential component defining the DirichletProcess, see examples/pp_dirichlet_process.py in the Github repository. Its source implementation can be found at edward/models/dirichlet_process.py.

### Appendix B. Inference Examples

Figure *. Stochastic variational inference (Hoffman, Blei, Wang, & Paisley, 2013). For more details, see the data subsampling webpage.

### Appendix C. Complete Examples

Figure 15. Variational auto-encoder (Kingma & Welling, 2014; Rezende et al., 2014). See the script examples/vae.py in the Github repository.

Figure 16. Exponential family embedding (Rudolph, Ruiz, Mandt, & Blei, 2016). A Github repository with comprehensive features is available at mariru/exponential_family_embeddings.

### References

Antoniak, C. E. (1974). Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 1152–1174.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.

Denker, J., Schwartz, D., Wittner, B., & Solla, S. (1987). Large automatic learning, rule extraction, and generalization. Complex Systems.

Gelman, A., & Hill, J. L. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Neural information processing systems.

Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1), 1303–1347.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.

Kingma, D., & Welling, M. (2014). Auto-encoding variational Bayes. In International conference on learning representations.

Neal, R. M. (2012). Bayesian learning for neural networks (Vol. 118). Springer Science & Business Media.

Neal, R. M., & Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. In Learning in graphical models (pp. 355–368).

Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML (pp. 1278–1286).

Robert, C. P., & Casella, G. (1999). Monte carlo statistical methods. Springer.

Rudolph, M. R., Ruiz, F. J. R., Mandt, S., & Blei, D. M. (2016). Exponential Family Embeddings. In Neural information processing systems.

Salakhutdinov, R., & Mnih, A. (2011). Probabilistic matrix factorization. In Neural information processing systems.

Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., & Blei, D. M. (2017). Deep probabilistic programming. In International conference on learning representations.