Inference is broadly classified under three classes: variational inference, Monte Carlo, and exact inference. We highlight how to use inference algorithms from each class.
As an example, we assume a mixture model with latent mixture assignments z
, latent cluster means beta
, and observations x
: \[p(\mathbf{x}, \mathbf{z}, \beta)
=
\text{Normal}(\mathbf{x} \mid \beta_{\mathbf{z}}, \mathbf{I})
~
\text{Categorical}(\mathbf{z}\mid \pi)
~
\text{Normal}(\beta\mid \mathbf{0}, \mathbf{I}).\]
In variational inference, the idea is to posit a family of approximating distributions and to find the closest member in the family to the posterior (Jordan, Ghahramani, Jaakkola, & Saul, 1999). We write an approximating family, \[\begin{aligned} q(\beta;\mu,\sigma) &= \text{Normal}(\beta; \mu,\sigma), \\[1.5ex] q(\mathbf{z};\pi) &= \text{Categorical}(\mathbf{z};\pi),\end{aligned}\] using TensorFlow variables to represent its parameters \(\lambda=\{\pi,\mu,\sigma\}\).
from edward.models import Categorical, Normal
qbeta = Normal(mu=tf.Variable(tf.zeros([K, D])),
sigma=tf.exp(tf.Variable(tf.zeros[K, D])))
qz = Categorical(logits=tf.Variable(tf.zeros[N, K]))
inference = ed.VariationalInference({beta: qbeta, z: qz}, data={x: x_train})
Given an objective function, variational inference optimizes the family with respect to tf.Variable
s.
Specific variational inference algorithms inherit from the VariationalInference
class to define their own methods, such as a loss function and gradient. For example, we represent MAP estimation with an approximating family (qbeta
and qz
) of PointMass
random variables, i.e., with all probability mass concentrated at a point.
from edward.models import PointMass
qbeta = PointMass(params=tf.Variable(tf.zeros([K, D])))
qz = PointMass(params=tf.Variable(tf.zeros(N)))
inference = ed.MAP({beta: qbeta, z: qz}, data={x: x_train})
MAP
inherits from VariationalInference
and defines a loss function and update rules; it uses existing optimizers inside TensorFlow.
Monte Carlo approximates the posterior using samples (Robert & Casella, 1999). Monte Carlo is an inference where the approximating family is an empirical distribution, \[\begin{aligned} q(\beta; \{\beta^{(t)}\}) &= \frac{1}{T}\sum_{t=1}^T \delta(\beta, \beta^{(t)}), \\[1.5ex] q(\mathbf{z}; \{\mathbf{z}^{(t)}\}) &= \frac{1}{T}\sum_{t=1}^T \delta(\mathbf{z}, \mathbf{z}^{(t)}).\end{aligned}\] The parameters are \(\lambda=\{\beta^{(t)},\mathbf{z}^{(t)}\}\).
from edward.models import Empirical
T = 10000 # number of samples
qbeta = Empirical(params=tf.Variable(tf.zeros([T, K, D]))
qz = Empirical(params=tf.Variable(tf.zeros([T, N]))
inference = ed.MonteCarlo({beta: qbeta, z: qz}, data={x: x_train})
Monte Carlo algorithms proceed by updating one sample \(\beta^{(t)},\mathbf{z}^{(t)}\) at a time in the empirical approximation. Monte Carlo algorithms proceed by updating one sample \(\beta^{(t)},\mathbf{(z)}^{(t)}\) at a time in the empirical approximation. Markov chain Monte Carlo does this sequentially to update the current sample (index \(t\) of tf.Variable
s) conditional on the last sample (index \(t1\) of tf.Variable
s). Specific Monte Carlo samplers determine the update rules; they can use gradients such as in Hamiltonian Monte Carlo (Neal, 2011) and graph structure such as in sequential Monte Carlo (Doucet, De Freitas, & Gordon, 2001).
As a library for probabilistic modeling (not necessarily Bayesian modeling), Edward is agnostic to the paradigm for inference. This means Edward can use frequentist (populationbased) inferences, strictly point estimation, and alternative foundations for parameter uncertainty.
For example, Edward supports nonBayesian methods such as generative adversarial networks (GANs) (Goodfellow et al., 2014). For more details, see the GAN tutorial.
In general, we think opening the door to nonBayesian approaches is a crucial feature for probabilistic programming. This enables advances in other fields such as deep learning to be complementary: all is in service for probabilistic models and thus it makes sense to combine our efforts.
This approach also extends to algorithms that usually require tedious algebraic manipulation. With symbolic algebra on the nodes of the computational graph, we can uncover conjugacy relationships between random variables. Users can then integrate out variables to automatically derive classical Gibbs (Gelfand & Smith, 1990), meanfield updates (Bishop, 2006), and exact inference.
The classes below inherit methods from base inference classes; see the development page for more details.
edward.inferences.
VariationalInference
(*args, **kwargs)[source]Abstract base class for variational inference. Specific
variational inference methods inherit from VariationalInference
,
sharing methods such as a default optimizer.
To build an algorithm inheriting from VariaitonalInference
, one
must at the minimum implement build_loss_and_gradients
: it
determines the loss function and gradients to apply for a given
optimizer.
Methods
initialize
(optimizer=None, var_list=None, use_prettytensor=False, *args, **kwargs)[source]Initialize variational inference.
Parameters: 

optimizer : str or tf.train.Optimizer, optional
var_list : list of tf.Variable, optional
use_prettytensor : bool, optional

update
(feed_dict=None)[source]Run one iteration of optimizer for variational inference.
Parameters: 

feed_dict : dict, optional

Returns: 
dict

print_progress
(info_dict)[source]Print progress to output.
build_loss_and_gradients
(var_list)[source]Build loss function and its gradients. They will be leveraged in an optimizer to update the model and variational parameters.
Any derived class of VariationalInference
must implement
this method.
Raises: 

NotImplementedError 
edward.inferences.
KLqp
(*args, **kwargs)[source]Variational inference with the KL divergence
This class minimizes the objective by automatically selecting from a variety of black box inference techniques.
Notes
KLqp
also optimizes any model parameters \(p(z \mid x;
\theta)\). It does this by variational EM, minimizing
with respect to \(\theta\).
In conditional inference, we infer \(z\) in \(p(z, \beta \mid x)\) while fixing inference over \(\beta\) using another distribution \(q(\beta)\). During gradient calculation, instead of using the model’s density
for each sample \(s=1,\ldots,S\), KLqp
uses
where \(z^{(s)} \sim q(z; \lambda)\) and \(\beta^{(s)} \sim q(\beta)\).
Methods
initialize
(n_samples=1, kl_scaling=None, *args, **kwargs)[source]Initialization.
Parameters: 

n_samples : int, optional
kl_scaling : dict of RandomVariable to float, optional

build_loss_and_gradients
(var_list)[source]Wrapper for the KLqp
loss function.
KLqp supports
of the loss function.
If the KL divergence between the variational model and the prior is tractable, then the loss function can be written as
where the KL term is computed analytically (Kingma and Welling, 2014). We compute this automatically when \(p(z)\) and \(q(z; \lambda)\) are Normal.
edward.inferences.
ReparameterizationKLqp
(*args, **kwargs)[source]Variational inference with the KL divergence
This class minimizes the objective using the reparameterization gradient.
edward.inferences.
ReparameterizationKLKLqp
(*args, **kwargs)[source]Variational inference with the KL divergence
This class minimizes the objective using the reparameterization gradient and an analytic KL term.
edward.inferences.
ReparameterizationEntropyKLqp
(*args, **kwargs)[source]Variational inference with the KL divergence
This class minimizes the objective using the reparameterization gradient and an analytic entropy term.
edward.inferences.
ScoreKLqp
(*args, **kwargs)[source]Variational inference with the KL divergence
This class minimizes the objective using the score function gradient.
edward.inferences.
ScoreKLKLqp
(*args, **kwargs)[source]Variational inference with the KL divergence
This class minimizes the objective using the score function gradient and an analytic KL term.
edward.inferences.
ScoreEntropyKLqp
(*args, **kwargs)[source]Variational inference with the KL divergence
This class minimizes the objective using the score function gradient and an analytic entropy term.
edward.inferences.
GANInference
(data, discriminator)[source]Parameter estimation with GANstyle training (Goodfellow et al., 2014).
Works for the class of implicit (and differentiable) probabilistic models. These models do not require a tractable density and assume only a program that generates samples.
Methods
Parameters: 

data : dict
discriminator : function

Notes
GANInference
does not support latent variable inference. Note
that GANstyle training also samples from the prior: this does not
work well for latent variables that are shared across many data
points (global variables).
In building the computation graph for inference, the discriminator’s parameters can be accessed with the variable scope “Disc”.
GANs also only work for one observed random variable in data
.
Examples
z = Normal(mu=tf.zeros([100, 10]), sigma=tf.ones([100, 10]))
x = generative_network(z)
inference = ed.GANInference({x: x_data}, discriminator)
Methods
initialize
(optimizer=None, optimizer_d=None, global_step=None, global_step_d=None, var_list=None, *args, **kwargs)[source]Initialize variational inference.
Parameters: 

optimizer : str or tf.train.Optimizer, optional
optimizer_d : str or tf.train.Optimizer, optional
global_step : tf.Variable, optional
global_step_d : tf.Variable, optional
var_list : list of tf.Variable, optional

update
(feed_dict=None, variables=None)[source]Run one iteration of optimization.
Parameters: 

feed_dict : dict, optional
variables : str, optional

Returns: 
dict

Notes
The outputted iteration number is the total number of calls to
update
. Each update may include updating only a subset of
parameters.
print_progress
(info_dict)[source]Print progress to output.
edward.inferences.
WGANInference
(*args, **kwargs)[source]Parameter estimation with GANstyle training (Goodfellow et al., 2014), using the Wasserstein distance (Arjovsky et al., 2017).
Works for the class of implicit (and differentiable) probabilistic models. These models do not require a tractable density and assume only a program that generates samples.
Methods
Notes
Argumentwise, the only difference from GANInference
is
conceptual: the discriminator
is better described as a test
function or critic. WGANInference
continues to use
discriminator
only to share methods and attributes with
GANInference
.
Examples
z = Normal(mu=tf.zeros([100, 10]), sigma=tf.ones([100, 10]))
x = generative_network(z)
inference = ed.WGANInference({x: x_data}, discriminator)
edward.inferences.
ImplicitKLqp
(latent_vars, data=None, discriminator=None, global_vars=None)[source]Variational inference with implicit probabilistic models (Tran et al., 2017).
It minimizes the KL divergence
where \(z\) are local variables associated to a data point and \(\beta\) are global variables shared across data points.
Global latent variables require log_prob()
and need to return a
random sample when fetched from the graph. Local latent variables
and observed variables require only a random sample when fetched
from the graph. (This is true for both \(p\) and \(q\).)
All variational factors must be reparameterizable: each of the
random variables (rv
) satisfies rv.is_reparameterized
and
rv.is_continuous
.
Methods
Parameters: 

discriminator : function
global_vars : dict of RandomVariable to RandomVariable, optional

Notes
Unlike GANInference
, discriminator
takes dict’s as input,
and must subset to the appropriate values through lexical scoping
from the previously defined model and latent variables. This is
necessary as the discriminator can take an arbitrary set of data,
latent, and global variables.
Note the type for discriminator
‘s output changes when one
passes in the scale
argument to initialize()
.
scale
has at most one item, then discriminator
outputs a tensor whose multiplication with that element is
broadcastable. (For example, the output is a tensor and the single
scale factor is a scalar.)
+ If scale
has more than one item, then in order to scale
its corresponding output, discriminator
must output a
dictionary of same size and keys as scale
.
Methods
initialize
(ratio_loss='log', *args, **kwargs)[source]Initialization.
Parameters: 

ratio_loss : str or fn, optional

build_loss_and_gradients
(var_list)[source]Build loss function
We minimize it with respect to parameterized variational families \(q(z, \beta; \lambda)\).
\(r^*(x_n, z_n, \beta)\) is a function of a single data point \(x_n\), single local variable \(z_n\), and all global variables \(\beta\). It is equal to the logratio
where \(q(x_n)\) is the empirical data distribution. Rather
than explicit calculation, \(r^*(x, z, \beta)\) is the
solution to a ratio estimation problem, minimizing the specified
ratio_loss
.
Gradients are taken using the reparameterization trick (Kingma and Welling, 2014).
Notes
This also includes model parameters \(p(x, z, \beta; \theta)\) and variational distributions with inference networks \(q(z\mid x)\).
There are a bunch of extensions we could easily do in this implementation:
copy()
utility
function for q’s as well, and an additional loop. we opt not to
because it complicates the code;edward.inferences.
KLpq
(*args, **kwargs)[source]Variational inference with the KL divergence
To perform the optimization, this class uses a technique from adaptive importance sampling (Cappe et al., 2008).
Notes
KLpq
also optimizes any model parameters \(p(z\mid x;
\theta)\). It does this by variational EM, minimizing
with respect to \(\theta\).
In conditional inference, we infer \(z\) in \(p(z, \beta \mid x)\) while fixing inference over \(\beta\) using another distribution \(q(\beta)\). During gradient calculation, instead of using the model’s density
for each sample \(s=1,\ldots,S\), KLpq
uses
where \(z^{(s)} \sim q(z; \lambda)\) and \(\beta^{(s)} \sim q(\beta)\).
Methods
initialize
(n_samples=1, *args, **kwargs)[source]Initialization.
Parameters: 

n_samples : int, optional

build_loss_and_gradients
(var_list)[source]Build loss function
and stochastic gradients based on importance sampling.
The loss function can be estimated as
where for \(z^s \sim q(z; \lambda)\),
normalizes the importance weights, \(w(z^s; \lambda) = p(x, z^s) / q(z^s; \lambda)\).
This provides a gradient,
edward.inferences.
MAP
(latent_vars=None, data=None)[source]Maximum a posteriori.
This class implements gradientbased optimization to solve the optimization problem,
This is equivalent to using a PointMass
variational distribution
and minimizing the unnormalized objective,
Notes
This class is currently restricted to optimization over differentiable latent variables. For example, it does not solve discrete optimization.
This class also minimizes the loss with respect to any model parameters \(p(z \mid x; \theta)\).
In conditional inference, we infer \(z\) in \(p(z, \beta
\mid x)\) while fixing inference over \(\beta\) using another
distribution \(q(\beta)\). MAP
optimizes
\(\mathbb{E}_{q(\beta)} [ \log p(x, z, \beta) ]\), leveraging
a single Monte Carlo sample, \(\log p(x, z, \beta^*)\), where
\(\beta^* \sim q(\beta)\). This is a lower bound to the
marginal density \(\log p(x, z)\), and it is exact if
\(q(\beta) = p(\beta \mid x)\) (up to stochasticity).
Methods
Parameters: 

latent_vars : list of RandomVariable or

Examples
Most explicitly, MAP
is specified via a dictionary:
qpi = PointMass(params=ed.to_simplex(tf.Variable(tf.zeros(K1))))
qmu = PointMass(params=tf.Variable(tf.zeros(K*D)))
qsigma = PointMass(params=tf.nn.softplus(tf.Variable(tf.zeros(K*D))))
ed.MAP({pi: qpi, mu: qmu, sigma: qsigma}, data)
We also automate the specification of PointMass
distributions,
so one can pass in a list of latent variables instead:
ed.MAP([beta], data)
ed.MAP([pi, mu, sigma], data)
Currently, MAP
can only instantiate PointMass
random variables
with unconstrained support. To constrain their support, one must
manually pass in the PointMass
family.
Methods
build_loss_and_gradients
(var_list)[source]Build loss function. Its automatic differentiation is the gradient of
edward.inferences.
Laplace
(latent_vars, data=None)[source]Laplace approximation (Laplace, 1774).
It approximates the posterior distribution using a multivariate normal distribution centered at the mode of the posterior.
We implement this by running MAP
to find the posterior mode.
This forms the mean of the normal approximation. We then compute the
inverse Hessian at the mode of the posterior. This forms the
covariance of the normal approximation.
Methods
Parameters: 

latent_vars : list of RandomVariable or

Notes
If MultivariateNormalDiag
random variables are specified as
approximations, then the Laplace approximation will only produce
the diagonal. This does not capture correlation among the
variables but it does not require a potentially expensive matrix
inversion.
Examples
X = tf.placeholder(tf.float32, [N, D])
w = Normal(mu=tf.zeros(D), sigma=tf.ones(D))
y = Normal(mu=ed.dot(X, w), sigma=tf.ones(N))
qw = MultivariateNormalFull(mu=tf.Variable(tf.random_normal([D])),
sigma=tf.Variable(tf.random_normal([D, D])))
inference = ed.Laplace({w: qw}, data={X: X_train, y: y_train})
Methods
finalize
(feed_dict=None)[source]Function to call after convergence.
Computes the Hessian at the mode.
Parameters: 

feed_dict : dict, optional

edward.inferences.
MonteCarlo
(latent_vars=None, data=None)[source]Abstract base class for Monte Carlo. Specific Monte Carlo methods
inherit from MonteCarlo
, sharing methods in this class.
To build an algorithm inheriting from MonteCarlo
, one must at the
minimum implement build_update
: it determines how to assign
the samples in the Empirical
approximations.
Methods
Initialization.
Parameters: 

latent_vars : list or dict, optional
data : dict, optional

Notes
The number of Monte Carlo iterations is set according to the
minimum of all Empirical
sizes.
Initialization is assumed from params[0, :]
. This generalizes
initializing randomly and initializing from user input. Updates
are along this outer dimension, where iteration t updates
params[t, :]
in each Empirical
random variable.
No warmup is implemented. Users must run MCMC for a long period of time, then manually burn in the Empirical random variable.
Examples
Most explicitly, MonteCarlo
is specified via a dictionary:
qpi = Empirical(params=tf.Variable(tf.zeros([T, K1])))
qmu = Empirical(params=tf.Variable(tf.zeros([T, K*D])))
qsigma = Empirical(params=tf.Variable(tf.zeros([T, K*D])))
ed.MonteCarlo({pi: qpi, mu: qmu, sigma: qsigma}, data)
The inferred posterior is comprised of Empirical
random
variables with T
samples. We also automate the specification
of Empirical
random variables. One can pass in a list of
latent variables instead:
ed.MonteCarlo([beta], data)
ed.MonteCarlo([pi, mu, sigma], data)
It defaults to Empirical
random variables with 10,000 samples for
each dimension.
Methods
update
(feed_dict=None)[source]Run one iteration of sampling for Monte Carlo.
Parameters: 

feed_dict : dict, optional

Returns: 
dict

Notes
We run the increment of t
separately from other ops. Whether the
others op run with the t
before incrementing or after incrementing
depends on which is run faster in the TensorFlow graph. Running it
separately forces a consistent behavior.
print_progress
(info_dict)[source]Print progress to output.
build_update
()[source]Build update rules, returning an assign op for parameters in
the Empirical
random variables.
Any derived class of MonteCarlo
must implement this method.
Raises: 

NotImplementedError 
edward.inferences.
MetropolisHastings
(latent_vars, proposal_vars, data=None)[source]MetropolisHastings (Metropolis et al., 1953; Hastings, 1970).
Notes
In conditional inference, we infer \(z\) in \(p(z, \beta
\mid x)\) while fixing inference over \(\beta\) using another
distribution \(q(\beta)\).
To calculate the acceptance ratio, MetropolisHastings
uses an
estimate of the marginal density,
leveraging a single Monte Carlo sample, where \(\beta^* \sim q(\beta)\). This is unbiased (and therefore asymptotically exact as a pseudomarginal method) if \(q(\beta) = p(\beta \mid x)\).
Methods
Parameters: 

proposal_vars : dict of RandomVariable to RandomVariable

Examples
z = Normal(mu=0.0, sigma=1.0)
x = Normal(mu=tf.ones(10) * z, sigma=1.0)
qz = Empirical(tf.Variable(tf.zeros(500)))
proposal_z = Normal(mu=z, sigma=0.5)
data = {x: np.array([0.0] * 10, dtype=np.float32)}
inference = ed.MetropolisHastings({z: qz}, {z: proposal_z}, data)
Methods
build_update
()[source]Draw sample from proposal conditional on last sample. Then accept or reject the sample based on the ratio,
Notes
The updates assume each Empirical random variable is directly parameterized by ``tf.Variable``s.
edward.inferences.
HMC
(*args, **kwargs)[source]Hamiltonian Monte Carlo, also known as hybrid Monte Carlo (Duane et al., 1987; Neal, 2011).
Notes
In conditional inference, we infer \(z\) in \(p(z, \beta
\mid x)\) while fixing inference over \(\beta\) using another
distribution \(q(\beta)\).
HMC
substitutes the model’s log marginal density
leveraging a single Monte Carlo sample, where \(\beta^* \sim q(\beta)\). This is unbiased (and therefore asymptotically exact as a pseudomarginal method) if \(q(\beta) = p(\beta \mid x)\).
Methods
Examples
z = Normal(mu=0.0, sigma=1.0)
x = Normal(mu=tf.ones(10) * z, sigma=1.0)
qz = Empirical(tf.Variable(tf.zeros(500)))
data = {x: np.array([0.0] * 10, dtype=np.float32)}
inference = ed.HMC({z: qz}, data)
Methods
initialize
(step_size=0.25, n_steps=2, *args, **kwargs)[source]Parameters: 

step_size : float, optional
n_steps : int, optional

edward.inferences.
SGLD
(*args, **kwargs)[source]Stochastic gradient Langevin dynamics (Welling and Teh, 2011).
Notes
In conditional inference, we infer \(z\) in \(p(z, \beta
\mid x)\) while fixing inference over \(\beta\) using another
distribution \(q(\beta)\).
SGLD
substitutes the model’s log marginal density
leveraging a single Monte Carlo sample, where \(\beta^* \sim q(\beta)\). This is unbiased (and therefore asymptotically exact as a pseudomarginal method) if \(q(\beta) = p(\beta \mid x)\).
Methods
Examples
z = Normal(mu=0.0, sigma=1.0)
x = Normal(mu=tf.ones(10) * z, sigma=1.0)
qz = Empirical(tf.Variable(tf.zeros(500)))
data = {x: np.array([0.0] * 10, dtype=np.float32)}
inference = ed.SGLD({z: qz}, data)
Methods
initialize
(step_size=0.25, *args, **kwargs)[source]Parameters: 

step_size : float, optional

edward.inferences.
SGHMC
(*args, **kwargs)[source]Stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014).
Notes
In conditional inference, we infer \(z\) in \(p(z, \beta
\mid x)\) while fixing inference over \(\beta\) using another
distribution \(q(\beta)\).
SGHMC
substitutes the model’s log marginal density
leveraging a single Monte Carlo sample, where \(\beta^* \sim q(\beta)\). This is unbiased (and therefore asymptotically exact as a pseudomarginal method) if \(q(\beta) = p(\beta \mid x)\).
Methods
Examples
z = Normal(mu=0.0, sigma=1.0)
x = Normal(mu=tf.ones(10) * z, sigma=1.0)
qz = Empirical(tf.Variable(tf.zeros(500)))
data = {x: np.array([0.0] * 10, dtype=np.float32)}
inference = ed.SGHMC({z: qz}, data)
Methods
initialize
(step_size=0.25, friction=0.1, *args, **kwargs)[source]Parameters: 

step_size : float, optional
friction : float, optional

build_update
()[source]Simulate Hamiltonian dynamics with friction using a discretized integrator. Its discretization error goes to zero as the learning rate decreases.
Implements the update equations from (15) of Chen et al. (2014).
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer New York.
Doucet, A., De Freitas, N., & Gordon, N. (2001). An introduction to sequential Monte Carlo methods. In Sequential monte carlo methods in practice (pp. 3–14). Springer.
Gelfand, A. E., & Smith, A. F. (1990). Samplingbased approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 398–409.
Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Neural information processing systems.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.
Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo.
Robert, C. P., & Casella, G. (1999). Monte carlo statistical methods. Springer.