Posterior predictive checks

Posterior predictive checks (PPCs) analyze the degree to which data generated from the model deviate from data generated from the true distribution. They can be used either numerically to quantify this degree, or graphically to visualize this degree. PPCs can be thought of as a probabilistic generalization of point-based evaluations (Box, 1980; Gelman, Meng, & Stern, 1996; Meng, 1994; Rubin, 1984).

PPCs focus on the posterior predictive distribution \[\begin{aligned} p(\mathbf{x}_\text{new} \mid \mathbf{x}) &= \int p(\mathbf{x}_\text{new} \mid \mathbf{z}) p(\mathbf{z} \mid \mathbf{x}) \text{d} \mathbf{z}.\end{aligned}\] The model’s posterior predictive can be used to generate new data given past observations and can also make predictions on new data given past observations. It is formed by calculating the likelihood of the new data, averaged over every set of latent variables according to the posterior distribution.

The simplest PPC works by applying a test statistic on new data generated from the posterior predictive, such as \(T(\mathbf{x}_\text{new}) = \max(\mathbf{x}_\text{new})\). Applying \(T(\mathbf{x}_\text{new})\) to new data over many data replications induces a distribution. We compare this distribution to the test statistic applied to the real data \(T(\mathbf{x})\).


In the figure, \(T(\mathbf{x})\) falls in a low probability region of this reference distribution. This indicates that the model fits the data poorly according to this check; this suggests an area of improvement for the model.

More generally, the test statistic can also be a function of the model’s latent variables \(T(\mathbf{x}, \mathbf{z})\), known as a discrepancy function. Examples of discrepancy functions are the metrics used for point-based evaluation. We can now interpret the point-based evaluation as a special case of PPCs: it simply calculates \(T(\mathbf{x}, \mathbf{z})\) over the real data and without a reference distribution in mind. A reference distribution allows us to make probabilistic statements about the point, in reference to an overall distribution.


To evaluate inferred models, we first form the posterior predictive distribution. A helpful utility function for this is copy. For example, assume the model defines a likelihood x connected to a prior z. The posterior predictive distribution is

x_post = ed.copy(x, {z: qz})

Here, we copy the likelihood node x in the graph and replace dependence on the prior z with dependence on the inferred posterior qz.

The ed.ppc() method provides a scaffold for studying various discrepancy functions.

def T(xs, zs):
  return tf.reduce_mean(xs[x_post])

ed.ppc(T, data={x_post: x_train})

The discrepancy can also take latent variables as input, which we pass into the PPC.

def T(xs, zs):
  return tf.reduce_mean(tf.cast(zs['z'], tf.float32))

ppc(T, data={y_post: y_train, x_ph: x_train},
    latent_vars={'z': qz, 'beta': qbeta})

See the criticism API for further details.

PPCs are an excellent tool for revising models, simplifying or expanding the current model as one examines how well it fits the data. They are inspired by prior checks and classical hypothesis testing, under the philosophy that models should be criticized under the frequentist perspective of large sample assessment.

PPCs can also be applied to tasks such as hypothesis testing, model comparison, model selection, and model averaging. It’s important to note that while they can be applied as a form of Bayesian hypothesis testing, hypothesis testing is generally not recommended: binary decision making from a single test is not as common a use case as one might believe. We recommend performing many PPCs to get a holistic understanding of the model fit.


Box, G. E. (1980). Sampling and bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General), 383–430.

Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 733–760.

Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics, 1142–1160.

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4), 1151–1172.