## API and Documentation

### Classes of Inference

Inference is broadly classified under three classes: variational inference, Monte Carlo, and exact inference. We highlight how to use inference algorithms from each class.

As an example, we assume a mixture model with latent mixture assignments z, latent cluster means beta, and observations x: $p(\mathbf{x}, \mathbf{z}, \beta) = \text{Normal}(\mathbf{x} \mid \beta_{\mathbf{z}}, \mathbf{I}) ~ \text{Categorical}(\mathbf{z}\mid \pi) ~ \text{Normal}(\beta\mid \mathbf{0}, \mathbf{I}).$

### Variational Inference

In variational inference, the idea is to posit a family of approximating distributions and to find the closest member in the family to the posterior (Jordan, Ghahramani, Jaakkola, & Saul, 1999). We write an approximating family, \begin{aligned} q(\beta;\mu,\sigma) &= \text{Normal}(\beta; \mu,\sigma), \\[1.5ex] q(\mathbf{z};\pi) &= \text{Categorical}(\mathbf{z};\pi),\end{aligned} using TensorFlow variables to represent its parameters $$\lambda=\{\pi,\mu,\sigma\}$$.

from edward.models import Categorical, Normal

qbeta = Normal(loc=tf.Variable(tf.zeros([K, D])),
scale=tf.exp(tf.Variable(tf.zeros[K, D])))
qz = Categorical(logits=tf.Variable(tf.zeros[N, K]))

inference = ed.VariationalInference({beta: qbeta, z: qz}, data={x: x_train})

Given an objective function, variational inference optimizes the family with respect to tf.Variables.

Specific variational inference algorithms inherit from the VariationalInference class to define their own methods, such as a loss function and gradient. For example, we represent MAP estimation with an approximating family (qbeta and qz) of PointMass random variables, i.e., with all probability mass concentrated at a point.

from edward.models import PointMass

qbeta = PointMass(params=tf.Variable(tf.zeros([K, D])))
qz = PointMass(params=tf.Variable(tf.zeros(N)))

inference = ed.MAP({beta: qbeta, z: qz}, data={x: x_train})

MAP inherits from VariationalInference and defines a loss function and update rules; it uses existing optimizers inside TensorFlow.

### Monte Carlo

Monte Carlo approximates the posterior using samples (Robert & Casella, 1999). Monte Carlo is an inference where the approximating family is an empirical distribution, \begin{aligned} q(\beta; \{\beta^{(t)}\}) &= \frac{1}{T}\sum_{t=1}^T \delta(\beta, \beta^{(t)}), \\[1.5ex] q(\mathbf{z}; \{\mathbf{z}^{(t)}\}) &= \frac{1}{T}\sum_{t=1}^T \delta(\mathbf{z}, \mathbf{z}^{(t)}).\end{aligned} The parameters are $$\lambda=\{\beta^{(t)},\mathbf{z}^{(t)}\}$$.

from edward.models import Empirical

T = 10000  # number of samples
qbeta = Empirical(params=tf.Variable(tf.zeros([T, K, D]))
qz = Empirical(params=tf.Variable(tf.zeros([T, N]))

inference = ed.MonteCarlo({beta: qbeta, z: qz}, data={x: x_train})

Monte Carlo algorithms proceed by updating one sample $$\beta^{(t)},\mathbf{z}^{(t)}$$ at a time in the empirical approximation. Monte Carlo algorithms proceed by updating one sample $$\beta^{(t)},\mathbf{(z)}^{(t)}$$ at a time in the empirical approximation. Markov chain Monte Carlo does this sequentially to update the current sample (index $$t$$ of tf.Variables) conditional on the last sample (index $$t-1$$ of tf.Variables). Specific Monte Carlo samplers determine the update rules; they can use gradients such as in Hamiltonian Monte Carlo (Neal, 2011) and graph structure such as in sequential Monte Carlo (Doucet, De Freitas, & Gordon, 2001).

### Non-Bayesian Methods

As a library for probabilistic modeling (not necessarily Bayesian modeling), Edward is agnostic to the paradigm for inference. This means Edward can use frequentist (population-based) inferences, strictly point estimation, and alternative foundations for parameter uncertainty.

For example, Edward supports non-Bayesian methods such as generative adversarial networks (GANs) (Goodfellow et al., 2014). For more details, see the GAN tutorial.

In general, we think opening the door to non-Bayesian approaches is a crucial feature for probabilistic programming. This enables advances in other fields such as deep learning to be complementary: all is in service for probabilistic models and thus it makes sense to combine our efforts.

### Exact Inference

This approach also extends to algorithms that usually require tedious algebraic manipulation. With symbolic algebra on the nodes of the computational graph, we can uncover conjugacy relationships between random variables. Users can then integrate out variables to automatically derive classical Gibbs (Gelfand & Smith, 1990), mean-field updates (Bishop, 2006), and exact inference.

The classes below inherit methods from base inference classes; see the development page for more details.

class edward.inferences.VariationalInference(*args, **kwargs)[source]

Abstract base class for variational inference. Specific variational inference methods inherit from VariationalInference, sharing methods such as a default optimizer.

To build an algorithm inheriting from VariaitonalInference, one must at the minimum implement build_loss_and_gradients: it determines the loss function and gradients to apply for a given optimizer.

Methods

initialize(optimizer=None, var_list=None, use_prettytensor=False, *args, **kwargs)[source]

Initialize variational inference.

Parameters:

optimizer : str or tf.train.Optimizer, optional

A TensorFlow optimizer, to use for optimizing the variational objective. Alternatively, one can pass in the name of a TensorFlow optimizer, and default parameters for the optimizer will be used.

var_list : list of tf.Variable, optional

List of TensorFlow variables to optimize over. Default is all trainable variables that latent_vars and data depend on, excluding those that are only used in conditionals in data.

use_prettytensor : bool, optional

True if aim to use PrettyTensor optimizer (when using PrettyTensor) or False if aim to use TensorFlow optimizer. Defaults to TensorFlow.

update(feed_dict=None)[source]

Run one iteration of optimizer for variational inference.

Parameters:

feed_dict : dict, optional

Feed dictionary for a TensorFlow session run. It is used to feed placeholders that are not fed during initialization.

Returns:

dict

Dictionary of algorithm-specific information. In this case, the loss function value after one iteration.

print_progress(info_dict)[source]

Print progress to output.

build_loss_and_gradients(var_list)[source]

Build loss function and its gradients. They will be leveraged in an optimizer to update the model and variational parameters.

Any derived class of VariationalInference must implement this method.

Raises:
NotImplementedError
class edward.inferences.KLqp(*args, **kwargs)[source]

Variational inference with the KL divergence

$\text{KL}( q(z; \lambda) \| p(z \mid x) ).$

This class minimizes the objective by automatically selecting from a variety of black box inference techniques.

Notes

KLqp also optimizes any model parameters $$p(z \mid x; \theta)$$. It does this by variational EM, minimizing

$\mathbb{E}_{q(z; \lambda)} [ \log p(x, z; \theta) ]$

with respect to $$\theta$$.

In conditional inference, we infer $$z$$ in $$p(z, \beta \mid x)$$ while fixing inference over $$\beta$$ using another distribution $$q(\beta)$$. During gradient calculation, instead of using the model’s density

$\log p(x, z^{(s)}), z^{(s)} \sim q(z; \lambda),$

for each sample $$s=1,\ldots,S$$, KLqp uses

$\log p(x, z^{(s)}, \beta^{(s)}),$

where $$z^{(s)} \sim q(z; \lambda)$$ and $$\beta^{(s)} \sim q(\beta)$$.

Methods

initialize(n_samples=1, kl_scaling=None, *args, **kwargs)[source]

Initialization.

Parameters:

n_samples : int, optional

Number of samples from variational model for calculating stochastic gradients.

kl_scaling : dict of RandomVariable to float, optional

Provides option to scale terms when using ELBO with KL divergence. If the KL divergence terms are

$\alpha_p \mathbb{E}_{q(z\mid x, \lambda)} [ \log q(z\mid x, \lambda) - \log p(z)],$

then pass {$$p(z)$$: $$\alpha_p$$} as kl_scaling, where $$\alpha_p$$ is a float that specifies how much to scale the KL term.

build_loss_and_gradients(var_list)[source]

Wrapper for the KLqp loss function.

$-\text{ELBO} = -\mathbb{E}_{q(z; \lambda)} [ \log p(x, z) - \log q(z; \lambda) ]$

KLqp supports

1. score function gradients (Paisley et al., 2012)
2. reparameterization gradients (Kingma and Welling, 2014)

of the loss function.

If the KL divergence between the variational model and the prior is tractable, then the loss function can be written as

$-\mathbb{E}_{q(z; \lambda)}[\log p(x \mid z)] + \text{KL}( q(z; \lambda) \| p(z) ),$

where the KL term is computed analytically (Kingma and Welling, 2014). We compute this automatically when $$p(z)$$ and $$q(z; \lambda)$$ are Normal.

class edward.inferences.ReparameterizationKLqp(*args, **kwargs)[source]

Variational inference with the KL divergence

$\text{KL}( q(z; \lambda) \| p(z \mid x) ).$

This class minimizes the objective using the reparameterization gradient.

class edward.inferences.ReparameterizationKLKLqp(*args, **kwargs)[source]

Variational inference with the KL divergence

$\text{KL}( q(z; \lambda) \| p(z \mid x) ).$

This class minimizes the objective using the reparameterization gradient and an analytic KL term.

class edward.inferences.ReparameterizationEntropyKLqp(*args, **kwargs)[source]

Variational inference with the KL divergence

$\text{KL}( q(z; \lambda) \| p(z \mid x) ).$

This class minimizes the objective using the reparameterization gradient and an analytic entropy term.

class edward.inferences.ScoreKLqp(*args, **kwargs)[source]

Variational inference with the KL divergence

$\text{KL}( q(z; \lambda) \| p(z \mid x) ).$

This class minimizes the objective using the score function gradient.

class edward.inferences.ScoreKLKLqp(*args, **kwargs)[source]

Variational inference with the KL divergence

$\text{KL}( q(z; \lambda) \| p(z \mid x) ).$

This class minimizes the objective using the score function gradient and an analytic KL term.

class edward.inferences.ScoreEntropyKLqp(*args, **kwargs)[source]

Variational inference with the KL divergence

$\text{KL}( q(z; \lambda) \| p(z \mid x) ).$

This class minimizes the objective using the score function gradient and an analytic entropy term.

class edward.inferences.GANInference(data, discriminator)[source]

Parameter estimation with GAN-style training (Goodfellow et al., 2014).

Works for the class of implicit (and differentiable) probabilistic models. These models do not require a tractable density and assume only a program that generates samples.

Methods

Parameters:

data : dict

Data dictionary which binds observed variables (of type RandomVariable or tf.Tensor) to their realizations (of type tf.Tensor). It can also bind placeholders (of type tf.Tensor) used in the model to their realizations.

discriminator : function

Function (with parameters) to discriminate samples. It should output logit probabilities (real-valued) and not probabilities in [0, 1].

Notes

GANInference does not support latent variable inference. Note that GAN-style training also samples from the prior: this does not work well for latent variables that are shared across many data points (global variables).

In building the computation graph for inference, the discriminator’s parameters can be accessed with the variable scope “Disc”.

GANs also only work for one observed random variable in data.

Examples

z = Normal(loc=tf.zeros([100, 10]), scale=tf.ones([100, 10]))
x = generative_network(z)

inference = ed.GANInference({x: x_data}, discriminator)



Methods

initialize(optimizer=None, optimizer_d=None, global_step=None, global_step_d=None, var_list=None, *args, **kwargs)[source]

Initialize GAN inference.

Parameters:

optimizer : str or tf.train.Optimizer, optional

A TensorFlow optimizer, to use for optimizing the generator objective. Alternatively, one can pass in the name of a TensorFlow optimizer, and default parameters for the optimizer will be used.

optimizer_d : str or tf.train.Optimizer, optional

A TensorFlow optimizer, to use for optimizing the discriminator objective. Alternatively, one can pass in the name of a TensorFlow optimizer, and default parameters for the optimizer will be used.

global_step : tf.Variable, optional

Optional Variable to increment by one after the variables for the generator have been updated. See tf.train.Optimizer.apply_gradients.

global_step_d : tf.Variable, optional

Optional Variable to increment by one after the variables for the discriminator have been updated. See tf.train.Optimizer.apply_gradients.

var_list : list of tf.Variable, optional

List of TensorFlow variables to optimize over (in the generative model). Default is all trainable variables that latent_vars and data depend on.

update(feed_dict=None, variables=None)[source]

Run one iteration of optimization.

Parameters:

feed_dict : dict, optional

Feed dictionary for a TensorFlow session run. It is used to feed placeholders that are not fed during initialization.

variables : str, optional

Which set of variables to update. Either “Disc” or “Gen”. Default is both.

Returns:

dict

Dictionary of algorithm-specific information. In this case, the iteration number and generative and discriminative losses.

Notes

The outputted iteration number is the total number of calls to update. Each update may include updating only a subset of parameters.

print_progress(info_dict)[source]

Print progress to output.

class edward.inferences.BiGANInference(latent_vars, data, discriminator)[source]

Adversarially Learned Inference (Dumoulin et al., 2017) or Bidirectional Generative Adversarial Networks (Donahue et al., 2017) for joint learning of generator and inference networks.

Works for the class of implicit (and differentiable) probabilistic models. These models do not require a tractable density and assume only a program that generates samples.

Methods

Notes

BiGANInference matches a mapping from data to latent variables and a mapping from latent variables to data through a joint discriminator.

In building the computation graph for inference, the discriminator’s parameters can be accessed with the variable scope “Disc”. In building the computation graph for inference, the encoder and decoder parameters can be accessed with the variable scope “Gen”.

Examples

with tf.variable_scope("Gen"):
xf = gen_data(z_ph)
zf = gen_latent(x_ph)
inference = ed.BiGANInference({z_ph: zf}, {xf: x_ph}, discriminator)


class edward.inferences.WGANInference(*args, **kwargs)[source]

Parameter estimation with GAN-style training (Goodfellow et al., 2014), using the Wasserstein distance (Arjovsky et al., 2017).

Works for the class of implicit (and differentiable) probabilistic models. These models do not require a tractable density and assume only a program that generates samples.

Methods

Notes

Argument-wise, the only difference from GANInference is conceptual: the discriminator is better described as a test function or critic. WGANInference continues to use discriminator only to share methods and attributes with GANInference.

Examples

z = Normal(loc=tf.zeros([100, 10]), scale=tf.ones([100, 10]))
x = generative_network(z)

inference = ed.WGANInference({x: x_data}, discriminator)


class edward.inferences.ImplicitKLqp(latent_vars, data=None, discriminator=None, global_vars=None)[source]

Variational inference with implicit probabilistic models (Tran et al., 2017).

It minimizes the KL divergence

$\text{KL}( q(z, \beta; \lambda) \| p(z, \beta \mid x) ),$

where $$z$$ are local variables associated to a data point and $$\beta$$ are global variables shared across data points.

Global latent variables require log_prob() and need to return a random sample when fetched from the graph. Local latent variables and observed variables require only a random sample when fetched from the graph. (This is true for both $$p$$ and $$q$$.)

All variational factors must be reparameterizable: each of the random variables (rv) satisfies rv.is_reparameterized and rv.is_continuous.

Methods

Parameters:

discriminator : function

Function (with parameters). Unlike GANInference, it is interpreted as a ratio estimator rather than a discriminator. It takes three arguments: a data dict, local latent variable dict, and global latent variable dict. As with GAN discriminators, it can take a batch of data points and local variables, of size $$M$$, and output a vector of length $$M$$.

global_vars : dict of RandomVariable to RandomVariable, optional

Identifying which variables in latent_vars are global variables, shared across data points. These will not be encompassed in the ratio estimation problem, and will be estimated with tractable variational approximations.

Notes

Unlike GANInference, discriminator takes dict’s as input, and must subset to the appropriate values through lexical scoping from the previously defined model and latent variables. This is necessary as the discriminator can take an arbitrary set of data, latent, and global variables.

Note the type for discriminator‘s output changes when one passes in the scale argument to initialize().

• If scale has at most one item, then discriminator

outputs a tensor whose multiplication with that element is broadcastable. (For example, the output is a tensor and the single scale factor is a scalar.) + If scale has more than one item, then in order to scale its corresponding output, discriminator must output a dictionary of same size and keys as scale.

Methods

initialize(ratio_loss='log', *args, **kwargs)[source]

Initialization.

Parameters:

ratio_loss : str or fn, optional

Loss function minimized to get the ratio estimator. ‘log’ or ‘hinge’. Alternatively, one can pass in a function of two inputs, psamples and qsamples, and output a point-wise value with shape matching the shapes of the two inputs.

build_loss_and_gradients(var_list)[source]

Build loss function

$-\Big(\mathbb{E}_{q(\beta)} [\log p(\beta) - \log q(\beta) ] + \sum_{n=1}^N \mathbb{E}_{q(\beta)q(z_n\mid\beta)} [ r^*(x_n, z_n, \beta) ] \Big).$

We minimize it with respect to parameterized variational families $$q(z, \beta; \lambda)$$.

$$r^*(x_n, z_n, \beta)$$ is a function of a single data point $$x_n$$, single local variable $$z_n$$, and all global variables $$\beta$$. It is equal to the log-ratio

$\log p(x_n, z_n\mid \beta) - \log q(x_n, z_n\mid \beta),$

where $$q(x_n)$$ is the empirical data distribution. Rather than explicit calculation, $$r^*(x, z, \beta)$$ is the solution to a ratio estimation problem, minimizing the specified ratio_loss.

Gradients are taken using the reparameterization trick (Kingma and Welling, 2014).

Notes

This also includes model parameters $$p(x, z, \beta; \theta)$$ and variational distributions with inference networks $$q(z\mid x)$$.

There are a bunch of extensions we could easily do in this implementation:

• further factorizations can be used to better leverage the graph structure for more complicated models;
• score function gradients for global variables;
• use more samples; this would require the copy() utility function for q’s as well, and an additional loop. we opt not to because it complicates the code;
• analytic KL/swapping out the penalty term for the globals.
class edward.inferences.KLpq(*args, **kwargs)[source]

Variational inference with the KL divergence

$\text{KL}( p(z \mid x) \| q(z) ).$

To perform the optimization, this class uses a technique from adaptive importance sampling (Cappe et al., 2008).

Notes

KLpq also optimizes any model parameters $$p(z\mid x; \theta)$$. It does this by variational EM, minimizing

$\mathbb{E}_{p(z \mid x; \lambda)} [ \log p(x, z; \theta) ]$

with respect to $$\theta$$.

In conditional inference, we infer $$z$$ in $$p(z, \beta \mid x)$$ while fixing inference over $$\beta$$ using another distribution $$q(\beta)$$. During gradient calculation, instead of using the model’s density

$\log p(x, z^{(s)}), z^{(s)} \sim q(z; \lambda),$

for each sample $$s=1,\ldots,S$$, KLpq uses

$\log p(x, z^{(s)}, \beta^{(s)}),$

where $$z^{(s)} \sim q(z; \lambda)$$ and $$\beta^{(s)} \sim q(\beta)$$.

Methods

initialize(n_samples=1, *args, **kwargs)[source]

Initialization.

Parameters:

n_samples : int, optional

Number of samples from variational model for calculating stochastic gradients.

build_loss_and_gradients(var_list)[source]

Build loss function

$\text{KL}( p(z \mid x) \| q(z) ) = \mathbb{E}_{p(z \mid x)} [ \log p(z \mid x) - \log q(z; \lambda) ]$

and stochastic gradients based on importance sampling.

The loss function can be estimated as

$\frac{1}{S} \sum_{s=1}^S [ w_{\text{norm}}(z^s; \lambda) (\log p(x, z^s) - \log q(z^s; \lambda) ],$

where for $$z^s \sim q(z; \lambda)$$,

$w_{\text{norm}}(z^s; \lambda) = w(z^s; \lambda) / \sum_{s=1}^S w(z^s; \lambda)$

normalizes the importance weights, $$w(z^s; \lambda) = p(x, z^s) / q(z^s; \lambda)$$.

$- \frac{1}{S} \sum_{s=1}^S [ w_{\text{norm}}(z^s; \lambda) \nabla_{\lambda} \log q(z^s; \lambda) ].$
class edward.inferences.MAP(latent_vars=None, data=None)[source]

Maximum a posteriori.

This class implements gradient-based optimization to solve the optimization problem,

$\min_{z} - p(z \mid x).$

This is equivalent to using a PointMass variational distribution and minimizing the unnormalized objective,

$- \mathbb{E}_{q(z; \lambda)} [ \log p(x, z) ].$

Notes

This class is currently restricted to optimization over differentiable latent variables. For example, it does not solve discrete optimization.

This class also minimizes the loss with respect to any model parameters $$p(z \mid x; \theta)$$.

In conditional inference, we infer $$z$$ in $$p(z, \beta \mid x)$$ while fixing inference over $$\beta$$ using another distribution $$q(\beta)$$. MAP optimizes $$\mathbb{E}_{q(\beta)} [ \log p(x, z, \beta) ]$$, leveraging a single Monte Carlo sample, $$\log p(x, z, \beta^*)$$, where $$\beta^* \sim q(\beta)$$. This is a lower bound to the marginal density $$\log p(x, z)$$, and it is exact if $$q(\beta) = p(\beta \mid x)$$ (up to stochasticity).

Methods

Parameters:

latent_vars : list of RandomVariable or

dict of RandomVariable to RandomVariable

Collection of random variables to perform inference on. If list, each random variable will be implictly optimized using a PointMass random variable that is defined internally (with unconstrained support). If dictionary, each value in the dictionary must be a PointMass random variable.

Examples

Most explicitly, MAP is specified via a dictionary:

qpi = PointMass(params=ed.to_simplex(tf.Variable(tf.zeros(K-1))))
qmu = PointMass(params=tf.Variable(tf.zeros(K*D)))
qsigma = PointMass(params=tf.nn.softplus(tf.Variable(tf.zeros(K*D))))
ed.MAP({pi: qpi, mu: qmu, sigma: qsigma}, data)



We also automate the specification of PointMass distributions, so one can pass in a list of latent variables instead:

ed.MAP([beta], data)
ed.MAP([pi, mu, sigma], data)



Currently, MAP can only instantiate PointMass random variables with unconstrained support. To constrain their support, one must manually pass in the PointMass family.

Methods

build_loss_and_gradients(var_list)[source]

Build loss function. Its automatic differentiation is the gradient of

$- \log p(x,z)$
class edward.inferences.Laplace(latent_vars, data=None)[source]

Laplace approximation (Laplace, 1774).

It approximates the posterior distribution using a multivariate normal distribution centered at the mode of the posterior.

We implement this by running MAP to find the posterior mode. This forms the mean of the normal approximation. We then compute the inverse Hessian at the mode of the posterior. This forms the covariance of the normal approximation.

Methods

Parameters:

latent_vars : list of RandomVariable or

dict of RandomVariable to RandomVariable

Collection of random variables to perform inference on. If list, each random variable will be implictly optimized using a MultivariateNormalTriL random variable that is defined internally (with unconstrained support). If dictionary, each random variable must be a MultivariateNormalDiag, MultivariateNormalTriL, or Normal random variable.

Notes

If MultivariateNormalDiag or Normal random variables are specified as approximations, then the Laplace approximation will only produce the diagonal. This does not capture correlation among the variables but it does not require a potentially expensive matrix inversion.

Examples

X = tf.placeholder(tf.float32, [N, D])
w = Normal(loc=tf.zeros(D), scale=tf.ones(D))
y = Normal(loc=ed.dot(X, w), scale=tf.ones(N))

qw = MultivariateNormalTriL(
loc=tf.Variable(tf.random_normal([D])),
scale_tril=tf.Variable(tf.random_normal([D, D])))

inference = ed.Laplace({w: qw}, data={X: X_train, y: y_train})



Methods

finalize(feed_dict=None)[source]

Function to call after convergence.

Computes the Hessian at the mode.

Parameters:

feed_dict : dict, optional

Feed dictionary for a TensorFlow session run during evaluation of Hessian. It is used to feed placeholders that are not fed during initialization.

class edward.inferences.MonteCarlo(latent_vars=None, data=None)[source]

Abstract base class for Monte Carlo. Specific Monte Carlo methods inherit from MonteCarlo, sharing methods in this class.

To build an algorithm inheriting from MonteCarlo, one must at the minimum implement build_update: it determines how to assign the samples in the Empirical approximations.

Methods

Initialization.

Parameters:

latent_vars : list or dict, optional

Collection of random variables (of type RandomVariable or tf.Tensor) to perform inference on. If list, each random variable will be approximated using a Empirical random variable that is defined internally (with unconstrained support). If dictionary, each value in the dictionary must be a Empirical random variable.

data : dict, optional

Data dictionary which binds observed variables (of type RandomVariable or tf.Tensor) to their realizations (of type tf.Tensor). It can also bind placeholders (of type tf.Tensor) used in the model to their realizations.

Notes

The number of Monte Carlo iterations is set according to the minimum of all Empirical sizes.

Initialization is assumed from params[0, :]. This generalizes initializing randomly and initializing from user input. Updates are along this outer dimension, where iteration t updates params[t, :] in each Empirical random variable.

No warm-up is implemented. Users must run MCMC for a long period of time, then manually burn in the Empirical random variable.

Examples

Most explicitly, MonteCarlo is specified via a dictionary:

qpi = Empirical(params=tf.Variable(tf.zeros([T, K-1])))
qmu = Empirical(params=tf.Variable(tf.zeros([T, K*D])))
qsigma = Empirical(params=tf.Variable(tf.zeros([T, K*D])))
ed.MonteCarlo({pi: qpi, mu: qmu, sigma: qsigma}, data)



The inferred posterior is comprised of Empirical random variables with T samples. We also automate the specification of Empirical random variables. One can pass in a list of latent variables instead:

ed.MonteCarlo([beta], data)
ed.MonteCarlo([pi, mu, sigma], data)



It defaults to Empirical random variables with 10,000 samples for each dimension.

Methods

update(feed_dict=None)[source]

Run one iteration of sampling for Monte Carlo.

Parameters:

feed_dict : dict, optional

Feed dictionary for a TensorFlow session run. It is used to feed placeholders that are not fed during initialization.

Returns:

dict

Dictionary of algorithm-specific information. In this case, the acceptance rate of samples since (and including) this iteration.

Notes

We run the increment of t separately from other ops. Whether the others op run with the t before incrementing or after incrementing depends on which is run faster in the TensorFlow graph. Running it separately forces a consistent behavior.

print_progress(info_dict)[source]

Print progress to output.

build_update()[source]

Build update rules, returning an assign op for parameters in the Empirical random variables.

Any derived class of MonteCarlo must implement this method.

Raises:
NotImplementedError
class edward.inferences.MetropolisHastings(latent_vars, proposal_vars, data=None)[source]

Metropolis-Hastings (Metropolis et al., 1953; Hastings, 1970).

Notes

In conditional inference, we infer $$z$$ in $$p(z, \beta \mid x)$$ while fixing inference over $$\beta$$ using another distribution $$q(\beta)$$. To calculate the acceptance ratio, MetropolisHastings uses an estimate of the marginal density,

$p(x, z) = \mathbb{E}_{q(\beta)} [ p(x, z, \beta) ] \approx p(x, z, \beta^*)$

leveraging a single Monte Carlo sample, where $$\beta^* \sim q(\beta)$$. This is unbiased (and therefore asymptotically exact as a pseudo-marginal method) if $$q(\beta) = p(\beta \mid x)$$.

Methods

Parameters:

proposal_vars : dict of RandomVariable to RandomVariable

Collection of random variables to perform inference on; each is binded to a proposal distribution $$g(z' \mid z)$$.

Examples

z = Normal(loc=0.0, scale=1.0)
x = Normal(loc=tf.ones(10) * z, scale=1.0)

qz = Empirical(tf.Variable(tf.zeros(500)))
proposal_z = Normal(loc=z, scale=0.5)
data = {x: np.array([0.0] * 10, dtype=np.float32)}
inference = ed.MetropolisHastings({z: qz}, {z: proposal_z}, data)



Methods

build_update()[source]

Draw sample from proposal conditional on last sample. Then accept or reject the sample based on the ratio,

$\text{ratio} = \log p(x, z^{\text{new}}) - \log p(x, z^{\text{old}}) + \log g(z^{\text{new}} \mid z^{\text{old}}) - \log g(z^{\text{old}} \mid z^{\text{new}})$

Notes

The updates assume each Empirical random variable is directly parameterized by tf.Variables.

class edward.inferences.Gibbs(latent_vars, proposal_vars=None, data=None)[source]

Gibbs sampling (Geman and Geman, 1984).

Methods

Parameters:

proposal_vars : dict of RandomVariable to RandomVariable, optional

Collection of random variables to perform inference on; each is binded to its complete conditionals which Gibbs cycles draws on. If not specified, default is to use ed.complete_conditional.

Examples

x_data = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1])

p = Beta(1.0, 1.0)
x = Bernoulli(p=p, sample_shape=10)

qp = Empirical(tf.Variable(tf.zeros(500)))
inference = ed.Gibbs({p: qp}, data={x: x_data})



Methods

initialize(scan_order='random', *args, **kwargs)[source]
Parameters:

scan_order : list or str, optional

The scan order for each Gibbs update. If list, it is the deterministic order of latent variables. An element in the list can be a RandomVariable or itself a list of RandomVariables (this defines a blocked Gibbs sampler). If ‘random’, will use a random order at each update.

update(feed_dict=None)[source]

Run one iteration of Gibbs sampling.

Parameters:

feed_dict : dict, optional

Feed dictionary for a TensorFlow session run. It is used to feed placeholders that are not fed during initialization.

Returns:

dict

Dictionary of algorithm-specific information. In this case, the acceptance rate of samples since (and including) this iteration.

build_update()[source]

Notes

The updates assume each Empirical random variable is directly parameterized by tf.Variables.

class edward.inferences.HMC(*args, **kwargs)[source]

Hamiltonian Monte Carlo, also known as hybrid Monte Carlo (Duane et al., 1987; Neal, 2011).

Notes

In conditional inference, we infer $$z$$ in $$p(z, \beta \mid x)$$ while fixing inference over $$\beta$$ using another distribution $$q(\beta)$$. HMC substitutes the model’s log marginal density

$\log p(x, z) = \log \mathbb{E}_{q(\beta)} [ p(x, z, \beta) ] \approx \log p(x, z, \beta^*)$

leveraging a single Monte Carlo sample, where $$\beta^* \sim q(\beta)$$. This is unbiased (and therefore asymptotically exact as a pseudo-marginal method) if $$q(\beta) = p(\beta \mid x)$$.

Methods

Examples

z = Normal(loc=0.0, scale=1.0)
x = Normal(loc=tf.ones(10) * z, scale=1.0)

qz = Empirical(tf.Variable(tf.zeros(500)))
data = {x: np.array([0.0] * 10, dtype=np.float32)}
inference = ed.HMC({z: qz}, data)



Methods

initialize(step_size=0.25, n_steps=2, *args, **kwargs)[source]
Parameters:

step_size : float, optional

Step size of numerical integrator.

n_steps : int, optional

Number of steps of numerical integrator.

build_update()[source]

Simulate Hamiltonian dynamics using a numerical integrator. Correct for the integrator’s discretization error using an acceptance ratio.

Notes

The updates assume each Empirical random variable is directly parameterized by tf.Variables.

class edward.inferences.SGLD(*args, **kwargs)[source]

Stochastic gradient Langevin dynamics (Welling and Teh, 2011).

Notes

In conditional inference, we infer $$z$$ in $$p(z, \beta \mid x)$$ while fixing inference over $$\beta$$ using another distribution $$q(\beta)$$. SGLD substitutes the model’s log marginal density

$\log p(x, z) = \log \mathbb{E}_{q(\beta)} [ p(x, z, \beta) ] \approx \log p(x, z, \beta^*)$

leveraging a single Monte Carlo sample, where $$\beta^* \sim q(\beta)$$. This is unbiased (and therefore asymptotically exact as a pseudo-marginal method) if $$q(\beta) = p(\beta \mid x)$$.

Methods

Examples

z = Normal(loc=0.0, scale=1.0)
x = Normal(loc=tf.ones(10) * z, scale=1.0)

qz = Empirical(tf.Variable(tf.zeros(500)))
data = {x: np.array([0.0] * 10, dtype=np.float32)}
inference = ed.SGLD({z: qz}, data)



Methods

initialize(step_size=0.25, *args, **kwargs)[source]
Parameters:

step_size : float, optional

Constant scale factor of learning rate.

build_update()[source]

Simulate Langevin dynamics using a discretized integrator. Its discretization error goes to zero as the learning rate decreases.

Notes

The updates assume each Empirical random variable is directly parameterized by tf.Variables.

class edward.inferences.SGHMC(*args, **kwargs)[source]

Stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014).

Notes

In conditional inference, we infer $$z$$ in $$p(z, \beta \mid x)$$ while fixing inference over $$\beta$$ using another distribution $$q(\beta)$$. SGHMC substitutes the model’s log marginal density

$\log p(x, z) = \log \mathbb{E}_{q(\beta)} [ p(x, z, \beta) ] \approx \log p(x, z, \beta^*)$

leveraging a single Monte Carlo sample, where $$\beta^* \sim q(\beta)$$. This is unbiased (and therefore asymptotically exact as a pseudo-marginal method) if $$q(\beta) = p(\beta \mid x)$$.

Methods

Examples

z = Normal(loc=0.0, scale=1.0)
x = Normal(loc=tf.ones(10) * z, scale=1.0)

qz = Empirical(tf.Variable(tf.zeros(500)))
data = {x: np.array([0.0] * 10, dtype=np.float32)}
inference = ed.SGHMC({z: qz}, data)



Methods

initialize(step_size=0.25, friction=0.1, *args, **kwargs)[source]
Parameters:

step_size : float, optional

Constant scale factor of learning rate.

friction : float, optional

Constant scale on the friction term in the Hamiltonian system.

build_update()[source]

Simulate Hamiltonian dynamics with friction using a discretized integrator. Its discretization error goes to zero as the learning rate decreases.

Implements the update equations from (15) of Chen et al. (2014).

edward.inferences.conjugacy.complete_conditional(rv, cond_set=None)[source]

Returns the conditional distribution RandomVariable $$p(\text{rv} | \cdot)$$.

This function tries to infer the conditional distribution of rv given cond_set, a set of other RandomVariables in the graph. It will only be able to do this if

1. $$p(\text{rv} | \text{cond_set})$$ is in a tractable exponential family, AND
2. the truth of assumption 1 is not obscured in the TensorFlow graph.

In other words, this function will do its best to recognize conjugate relationships when they exist. But it may not always be able to do the necessary algebra.

Parameters:

rv : RandomVariable

The random variable whose conditional distribution we are interested in.

cond_set : iterable of RandomVariable, optional

The set of random variables we want to condition on. Default is all random variables in the graph. (It makes no difference if cond_set does or does not include rv.)

Notes

When calling complete_conditional() multiple times, one should usually pass an explicit cond_set. Otherwise complete_conditional() will try to condition on the RandomVariables returned by previous calls to itself, which may result in unpredictable behavior.

### References

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer New York.

Doucet, A., De Freitas, N., & Gordon, N. (2001). An introduction to sequential Monte Carlo methods. In Sequential monte carlo methods in practice (pp. 3–14). Springer.

Gelfand, A. E., & Smith, A. F. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 398–409.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Neural information processing systems.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.

Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo.

Robert, C. P., & Casella, G. (1999). Monte carlo statistical methods. Springer.