Deep Probabilistic Programming

This webpage is a companion to the article, Deep Probabilistic Programming (Tran et al., 2017). Here we provide more details for plug-and-play with the code snippets. An interactive version with Jupyter notebook is available here.

The code snippets assume the following versions.

pip install edward==1.2.4
pip install tensorflow==1.0.0  # alternatively, tensorflow-gpu==1.0.0
pip install keras==1.0.0

Section 3. Compositional Representations for Probabilistic Models

Figure 1. Beta-Bernoulli program.

import tensorflow as tf
from edward.models import Bernoulli, Beta

theta = Beta(a=1.0, b=1.0)
x = Bernoulli(p=tf.ones(50) * theta)

For an example of it in use, see examples/beta_bernoulli.py in the Github repository.

Figure 2. Variational auto-encoder for a data set of 28 x 28 pixel images (Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014).

import tensorflow as tf
from edward.models import Bernoulli, Normal
from keras.layers import Dense

N = 55000  # number of data points
d = 50  # latent dimension

# Probabilistic model
z = Normal(mu=tf.zeros([N, d]), sigma=tf.ones([N, d]))
h = Dense(256, activation='relu')(z)
x = Bernoulli(logits=Dense(28 * 28, activation=None)(h))

# Variational model
qx = tf.placeholder(tf.float32, [N, 28 * 28])
qh = Dense(256, activation='relu')(qx)
qz = Normal(mu=Dense(d, activation=None)(qh),
            sigma=Dense(d, activation='softplus')(qh))

For an example of it in use, see examples/vae.py in the Github repository.

Figure 3. Bayesian recurrent neural network (Radford M Neal, 2012). The program has an unspecified number of time steps; it uses a symbolic for loop (tf.scan).

import edward as ed
import tensorflow as tf
from edward.models import Normal

H = 50  # number of hidden units
D = 10  # number of features

def rnn_cell(hprev, xt):
  return tf.tanh(ed.dot(hprev, Wh) + ed.dot(xt, Wx) + bh)

Wh = Normal(mu=tf.zeros([H, H]), sigma=tf.ones([H, H]))
Wx = Normal(mu=tf.zeros([D, H]), sigma=tf.ones([D, H]))
Wy = Normal(mu=tf.zeros([H, 1]), sigma=tf.ones([H, 1]))
bh = Normal(mu=tf.zeros(H), sigma=tf.ones(H))
by = Normal(mu=tf.zeros(1), sigma=tf.ones(1))

x = tf.placeholder(tf.float32, [None, D])
h = tf.scan(rnn_cell, x, initializer=tf.zeros(H))
y = Normal(mu=tf.matmul(h, Wy) + by, sigma=1.0)

Section 4. Compositional Representations for Inference

Figure 5. Hierarchical model (Gelman & Hill, 2006). It is a mixture of Gaussians over \(D\)-dimensional data \(\{x_n\}\in\mathbb{R}^{N\times D}\). There are \(K\) latent cluster means \(\beta\in\mathbb{R}^{K\times D}\).

import tensorflow as tf
from edward.models import Categorical, Normal

N = 10000  # number of data points
D = 2  # data dimension
K = 5  # number of clusters

beta = Normal(mu=tf.zeros([K, D]), sigma=tf.ones([K, D]))
z = Categorical(logits=tf.zeros([N, K]))
x = Normal(mu=tf.gather(beta, z), sigma=tf.ones([N, D]))

It is used below in Figure 6 (left/right) and Figure * (variational EM).

Figure 6 (left). Variational inference (Jordan, Ghahramani, Jaakkola, & Saul, 1999). It performs inference on the model defined in Figure 5.

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Categorical, Normal

x_train = np.zeros([N, D])

qbeta = Normal(mu=tf.Variable(tf.zeros([K, D])),
               sigma=tf.exp(tf.Variable(tf.zeros([K, D]))))
qz = Categorical(logits=tf.Variable(tf.zeros([N, K])))

inference = ed.VariationalInference({beta: qbeta, z: qz}, data={x: x_train})

Figure 6 (right). Monte Carlo (Robert & Casella, 1999). It performs inference on the model defined in Figure 5.

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Empirical

x_train = np.zeros([N, D])

T = 10000  # number of samples
qbeta = Empirical(params=tf.Variable(tf.zeros([T, K, D])))
qz = Empirical(params=tf.Variable(tf.zeros([T, N])))

inference = ed.MonteCarlo({beta: qbeta, z: qz}, data={x: x_train})

Figure 7. Generative adversarial network (Goodfellow et al., 2014).

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Normal
from keras.layers import Dense

N = 55000  # number of data points
d = 50  # latent dimension

def generative_network(eps):
  h = Dense(256, activation='relu')(eps)
  return Dense(28 * 28, activation=None)(h)

def discriminative_network(x):
  h = Dense(28 * 28, activation='relu')(x)
  return Dense(1, activation=None)(h)

# DATA
x_train = np.zeros([N, 28 * 28])

# MODEL
eps = Normal(mu=tf.zeros([N, d]), sigma=tf.ones([N, d]))
x = generative_network(eps)

# INFERENCE
inference = ed.GANInference(data={x: x_train},
    discriminator=discriminative_network)

For an example of it in use, see the generative adversarial networks tutorial.

Figure *. Variational EM (Radford M. Neal & Hinton, 1993). It performs inference on the model defined in Figure 5.

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Categorical, PointMass

# DATA
x_train = np.zeros([N, D])

# INFERENCE
qbeta = PointMass(params=tf.Variable(tf.zeros([K, D])))
qz = Categorical(logits=tf.Variable(tf.zeros([N, K])))

inference_e = ed.VariationalInference({z: qz}, data={x: x_train, beta: qbeta})
inference_m = ed.MAP({beta: qbeta}, data={x: x_train, z: qz})

inference_e.initialize()
inference_m.initialize()

tf.initialize_all_variables().run()

for _ in range(10000):
  inference_e.update()
  inference_m.update()

For more details, see the inference compositionality webpage. See examples/factor_analysis.py for a version performing Monte Carlo EM for logistic factor analysis in the Github repository. It leverages Hamiltonian Monte Carlo for the E-step to perform maximum marginal a posteriori.

Figure *. Data subsampling.

import edward as ed
import tensorflow as tf
from edward.models import Categorical, Normal

N = 10000  # number of data points
M = 128  # batch size during training
D = 2  # data dimension
K = 5  # number of clusters

# DATA
x_batch = tf.placeholder(tf.float32, [M, D])

# MODEL
beta = Normal(mu=tf.zeros([K, D]), sigma=tf.ones([K, D]))
z = Categorical(logits=tf.zeros([M, K]))
x = Normal(mu=tf.gather(beta, z), sigma=tf.ones([M, D]))

# INFERENCE
qbeta = Normal(mu=tf.Variable(tf.zeros([K, D])),
               sigma=tf.nn.softplus(tf.Variable(tf.zeros([K, D]))))
qz = Categorical(logits=tf.Variable(tf.zeros([M, D])))

inference = ed.VariationalInference({beta: qbeta, z: qz}, data={x: x_batch})
inference.initialize(scale={x: float(N) / M, z: float(N) / M})

For more details, see the data subsampling webpage.

Section 5. Experiments

Figure 9. Bayesian logistic regression with Hamiltonian Monte Carlo.

import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Bernoulli, Empirical, Normal

N = 581012  # number of data points
D = 54  # number of features
T = 100  # number of empirical samples

# DATA
x_data = np.zeros([N, D])
y_data = np.zeros([N])

# MODEL
x = tf.Variable(x_data, trainable=False)
beta = Normal(mu=tf.zeros(D), sigma=tf.ones(D))
y = Bernoulli(logits=ed.dot(x, beta))

# INFERENCE
qbeta = Empirical(params=tf.Variable(tf.zeros([T, D])))
inference = ed.HMC({beta: qbeta}, data={y: y_data})
inference.run(step_size=0.5 / N, n_steps=10)

For an example of it in use, see examples/bayesian_logistic_regression.py in the Github repository.

Appendix A. Model Examples

Figure 10. Bayesian neural network for classification (Denker, Schwartz, Wittner, & Solla, 1987).

import tensorflow as tf
from edward.models import Bernoulli, Normal

N = 1000  # number of data points
D = 528  # number of features
H = 256  # hidden layer size

W_0 = Normal(mu=tf.zeros([D, H]), sigma=tf.ones([D, H]))
W_1 = Normal(mu=tf.zeros([H, 1]), sigma=tf.ones([H, 1]))
b_0 = Normal(mu=tf.zeros(H), sigma=tf.ones(H))
b_1 = Normal(mu=tf.zeros(1), sigma=tf.ones(1))

x = tf.placeholder(tf.float32, [N, D])
y = Bernoulli(logits=tf.matmul(tf.nn.tanh(tf.matmul(x, W_0) + b_0), W_1) + b_1)

For an example of it in use, see examples/getting_started_example.py in the Github repository.

Figure 11. Latent Dirichlet allocation (D. M. Blei, Ng, & Jordan, 2003).

import tensorflow as tf
from edward.models import Categorical, Dirichlet

D = 4  # number of documents
N = [11502, 213, 1523, 1351]  # words per doc
K = 10  # number of topics
V = 100000  # vocabulary size

theta = Dirichlet(alpha=tf.zeros([D, K]) + 0.1)
phi = Dirichlet(alpha=tf.zeros([K, V]) + 0.05)
z = [[0] * N] * D
w = [[0] * N] * D
for d in range(D):
  for n in range(N[d]):
    z[d][n] = Categorical(pi=theta[d, :])
    w[d][n] = Categorical(pi=phi[z[d][n], :])

Figure 12. Gaussian matrix factorization (Salakhutdinov & Mnih, 2011).

import tensorflow as tf
from edward.models import Normal

N = 10
M = 10
K = 5  # latent dimension

U = Normal(mu=tf.zeros([M, K]), sigma=tf.ones([M, K]))
V = Normal(mu=tf.zeros([N, K]), sigma=tf.ones([N, K]))
Y = Normal(mu=tf.matmul(U, V, transpose_b=True), sigma=tf.ones([N, M]))

Figure 13. Dirichlet process mixture model (Antoniak, 1974).

import tensorflow as tf
from edward.models import DirichletProcess, Normal

N = 1000  # number of data points
D = 5  # data dimensionality

dp = DirichletProcess(alpha=1.0, base=Normal(mu=tf.zeros(D), sigma=tf.ones(D)))
mu = dp.sample(N)
x = Normal(mu=mu, sigma=tf.ones([N, D]))

To see the essential component defining the DirichletProcess, see examples/pp_dirichlet_process.py in the Github repository. Its source implementation can be found at edward/models/dirichlet_process.py.

Appendix B. Inference Examples

Figure *. Stochastic variational inference (M. D. Hoffman, Blei, Wang, & Paisley, 2013). For more details, see the data subsampling webpage.

Appendix C. Complete Examples

Figure 15. Variational auto-encoder (Kingma & Welling, 2014; Rezende et al., 2014). See the script examples/vae.py in the Github repository.

Figure 16. Exponential family embedding (Rudolph, Ruiz, Mandt, & Blei, 2016). A Github repository with comprehensive features is available at mariru/exponential_family_embeddings.

References

Antoniak, C. E. (1974). Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 1152–1174.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.

Denker, J., Schwartz, D., Wittner, B., & Solla, S. (1987). Large automatic learning, rule extraction, and generalization. Complex Systems.

Gelman, A., & Hill, J. L. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Neural information processing systems.

Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1), 1303–1347.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.

Kingma, D., & Welling, M. (2014). Auto-encoding variational Bayes. In International conference on learning representations.

Neal, R. M. (2012). Bayesian learning for neural networks (Vol. 118). Springer Science & Business Media.

Neal, R. M., & Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. In Learning in graphical models (pp. 355–368).

Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML (pp. 1278–1286).

Robert, C. P., & Casella, G. (1999). Monte carlo statistical methods. Springer.

Rudolph, M. R., Ruiz, F. J. R., Mandt, S., & Blei, D. M. (2016). Exponential Family Embeddings. In Neural information processing systems.

Salakhutdinov, R., & Mnih, A. (2011). Probabilistic matrix factorization. In Neural information processing systems.

Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., & Blei, D. M. (2017). Deep probabilistic programming. In International conference on learning representations.