Point-based evaluations

A point-based evaluation is a scalar-valued metric for assessing trained models (Gneiting & Raftery, 2007). For example, we can assess models for classification by predicting the label for each observation in the data and comparing it to their true labels. Edward implements a variety of metrics, such as classification error and mean absolute error.

Formally, point prediction in probabilistic models is given by taking the mean of the posterior predictive distribution, \[\begin{aligned} p(\mathbf{x}_\text{new} \mid \mathbf{x}) &= \int p(\mathbf{x}_\text{new} \mid \mathbf{z}) p(\mathbf{z} \mid \mathbf{x}) \text{d} \mathbf{z}.\end{aligned}\] The model’s posterior predictive can be used to generate new data given past observations and can also make predictions on new data given past observations. It is formed by calculating the likelihood of the new data, averaged over every set of latent variables according to the posterior distribution.


To evaluate inferred models, we first form the posterior predictive distribution. A helpful utility function for this is copy. For example, assume the model defines a likelihood x connected to a prior z. The posterior predictive distribution is

x_post = ed.copy(x, {z: qz})

Here, we copy the likelihood node x in the graph and replace dependence on the prior z with dependence on the inferred posterior qz.

The ed.evaluate() method takes as input a set of metrics to evaluate, and a data dictionary. As with inference, the data dictionary binds the observed random variables in the model to realizations: in this case, it is the posterior predictive random variable of outputs y_post to y_train and a placeholder for inputs x to x_train.

ed.evaluate('categorical_accuracy', data={y_post: y_train, x: x_train})
ed.evaluate('mean_absolute_error', data={y_post: y_train, x: x_train})

The data can be data held-out from training time, making it easy to implement cross-validated techniques.

Point-based evaluation applies generally to any setting, including unsupervised tasks. For example, we can evaluate the likelihood of observing the data.

ed.evaluate('log_likelihood', data={x_post: x_train})

It is common practice to criticize models with data held-out from training. To do this, we first perform inference over any local latent variables of the held-out data, fixing the global variables. Then we make predictions on the held-out data.

from edward.models import Categorical

# create local posterior factors for test data, assuming test data
# has N_test many data points
qz_test = Categorical(logits=tf.Variable(tf.zeros[N_test, K]))

# run local inference conditional on global factors
inference_test = ed.Inference({z: qz_test}, data={x: x_test, beta: qbeta})

# build posterior predictive on test data
x_post = ed.copy(x, {z: qz_test, beta: qbeta}})
ed.evaluate('log_likelihood', data={x_post: x_test})

Point-based evaluations are formally known as scoring rules in decision theory. Scoring rules are useful for model comparison, model selection, and model averaging.

See the criticism API for further details. An example of point-based evaluation is in the supervised learning (regression) tutorial.


Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477), 359–378.