API and Documentation

Criticism

We can never validate whether a model is true. In practice, “all models are wrong” (Box, 1976). However, we can try to uncover where the model goes wrong. Model criticism helps justify the model as an approximation or point to good directions for revising the model. For background, see the criticism tutorial.

Edward explores model criticism using

  • point evaluations, such as mean squared error or classification accuracy;
  • posterior predictive checks, for making probabilistic assessments of the model fit using discrepancy functions.

edward.criticisms.evaluate(metrics, data, n_samples=500, output_key=None)[source]

Evaluate fitted model using a set of metrics.

A metric, or scoring rule (Winkler, 1994), is a function of observed data under the posterior predictive distribution. For example in supervised metrics such as classification accuracy, the observed data (true output) is compared to the posterior predictive’s mean (predicted output). In unsupervised metrics such as log-likelihood, the probability of observing the data is calculated under the posterior predictive’s log-density.

Parameters:

metrics : list of str or str

List of metrics or a single metric: 'binary_accuracy', 'categorical_accuracy', 'sparse_categorical_accuracy', 'log_loss' or 'binary_crossentropy', 'categorical_crossentropy', 'sparse_categorical_crossentropy', 'hinge', 'squared_hinge', 'mse' or 'MSE' or 'mean_squared_error', 'mae' or 'MAE' or 'mean_absolute_error', 'mape' or 'MAPE' or 'mean_absolute_percentage_error', 'msle' or 'MSLE' or 'mean_squared_logarithmic_error', 'poisson', 'cosine' or 'cosine_proximity', 'log_lik' or 'log_likelihood'.

data : dict

Data to evaluate model with. It binds observed variables (of type RandomVariable or tf.Tensor) to their realizations (of type tf.Tensor). It can also bind placeholders (of type tf.Tensor) used in the model to their realizations.

n_samples : int, optional

Number of posterior samples for making predictions, using the posterior predictive distribution.

output_key : RandomVariable, optional

It is the key in data which corresponds to the model’s output.

Returns:

list of float or float

A list of evaluations or a single evaluation.

Raises:

NotImplementedError

If an input metric does not match an implemented metric in Edward.

Examples

# build posterior predictive after inference: it is
# parameterized by a posterior sample
x_post = ed.copy(x, {z: qz, beta: qbeta})

# log-likelihood performance
ed.evaluate('log_likelihood', data={x_post: x_train})

# classification accuracy
# here, ``x_ph`` is any features the model is defined with respect to,
# and ``y_post`` is the posterior predictive distribution
ed.evaluate('binary_accuracy', data={y_post: y_train, x_ph: x_train})

# mean squared error
ed.evaluate('mean_squared_error', data={y: y_data, x: x_data})

edward.criticisms.ppc(T, data, latent_vars=None, n_samples=100)[source]

Posterior predictive check (Rubin, 1984; Meng, 1994; Gelman, Meng, and Stern, 1996).

PPC’s form an empirical distribution for the predictive discrepancy,

\[p(T\mid x) = \int p(T(x^{\text{rep}})\mid z) p(z\mid x) dz\]

by drawing replicated data sets \(x^{\text{rep}}\) and calculating \(T(x^{\text{rep}})\) for each data set. Then it compares it to \(T(x)\).

If data is inputted with the prior predictive distribution, then it is a prior predictive check (Box, 1980).

Parameters:

T : function

Discrepancy function, which takes a dictionary of data and dictionary of latent variables as input and outputs a tf.Tensor.

data : dict

Data to compare to. It binds observed variables (of type RandomVariable or tf.Tensor) to their realizations (of type tf.Tensor). It can also bind placeholders (of type tf.Tensor) used in the model to their realizations.

latent_vars : dict, optional

Collection of random variables (of type RandomVariable or tf.Tensor) binded to their inferred posterior. This argument is used when the discrepancy is a function of latent variables.

n_samples : int, optional

Number of replicated data sets.

Returns:

list of np.ndarray

List containing the reference distribution, which is a NumPy array with n_samples elements,

\[(T(x^{{\text{rep}},1}, z^{1}), ..., T(x^{\text{rep,nsamples}}, z^{\text{nsamples}}))\]

and the realized discrepancy, which is a NumPy array with n_samples elements,

\[(T(x, z^{1}), ..., T(x, z^{\text{nsamples}})).\]

Examples

# build posterior predictive after inference:
# it is parameterized by a posterior sample
x_post = ed.copy(x, {z: qz, beta: qbeta})

# posterior predictive check
# T is a user-defined function of data, T(data)
T = lambda xs, zs: tf.reduce_mean(xs[x_post])
ed.ppc(T, data={x_post: x_train})

# in general T is a discrepancy function of the data (both response and
# covariates) and latent variables, T(data, latent_vars)
T = lambda xs, zs: tf.reduce_mean(zs[z])
ed.ppc(T, data={y_post: y_train, x_ph: x_train},
       latent_vars={z: qz, beta: qbeta})

# prior predictive check
# run ppc on original x
ed.ppc(T, data={x: x_train})

References

Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356), 791–799.