• ed.criticisms.evaluate
  • ed.evaluate

Defined in edward/criticisms/evaluate.py.

Evaluate fitted model using a set of metrics.

A metric, or scoring rule (Winkler, 1994), is a function of observed data under the posterior predictive distribution. For example in supervised metrics such as classification accuracy, the observed data (true output) is compared to the posterior predictive’s mean (predicted output). In unsupervised metrics such as log-likelihood, the probability of observing the data is calculated under the posterior predictive’s log-density.


  • metrics: list of str and/or (str, params: dict) tuples, str, or (str, params: dict) tuple. List of metrics or a single metric: 'binary_accuracy', 'categorical_accuracy', 'sparse_categorical_accuracy', 'log_loss' or 'binary_crossentropy', 'categorical_crossentropy', 'sparse_categorical_crossentropy', 'hinge', 'squared_hinge', 'mse' or 'MSE' or 'mean_squared_error', 'mae' or 'MAE' or 'mean_absolute_error', 'mape' or 'MAPE' or 'mean_absolute_percentage_error', 'msle' or 'MSLE' or 'mean_squared_logarithmic_error', 'poisson', 'cosine' or 'cosine_proximity', 'log_lik' or 'log_likelihood'. In lieu of a metric string, this method also accepts (str, params: dict) tuples; the first element of this tuple is the metric string, and the second is a dict of associated params. At present, this dict only expects one key, 'average', which stipulates the type of averaging to perform on those metrics that permit binary averaging. Permissible options include: None, 'macro' and 'micro'.
  • data: dict. Data to evaluate model with. It binds observed variables (of type RandomVariable or tf.Tensor) to their realizations (of type tf.Tensor). It can also bind placeholders (of type tf.Tensor) used in the model to their realizations.
  • n_samples: int. Number of posterior samples for making predictions, using the posterior predictive distribution.
  • output_key: RandomVariable or tf.Tensor. It is the key in data which corresponds to the model’s output.
  • seed: a Python integer. Used to create a random seed for the distribution


list of float or float. A list of evaluations or a single evaluation.


NotImplementedError. If an input metric does not match an implemented metric in Edward.


# build posterior predictive after inference: it is
# parameterized by a posterior sample
x_post = ed.copy(x, {z: qz, beta: qbeta})

# log-likelihood performance
ed.evaluate('log_likelihood', data={x_post: x_train})

# classification accuracy
# here, `x_ph` is any features the model is defined with respect to,
# and `y_post` is the posterior predictive distribution
ed.evaluate('binary_accuracy', data={y_post: y_train, x_ph: x_train})

# mean squared error
ed.evaluate('mean_squared_error', data={y: y_data, x: x_data})

mean squared logarithmic error with 'micro' averaging

ed.evaluate((‘mean_squared_logarithmic_error’, {‘average’: ‘micro’}), data={y: y_data, x: x_data})

Winkler, R. L. (1994). Evaluating probabilities: Asymmetric scoring rules. Management Science, 40(11), 1395–1405.