Supervised Learning (Classification)

In supervised learning, the task is to infer hidden structure from labeled data, comprised of training examples \(\{(x_n, y_n)\}\). Classification means the output \(y\) takes discrete values.

We demonstrate with an example in Edward. An interactive version with Jupyter notebook is available here.

Data

We use the crabs data set, which consists of morphological measurements on a crab species. We are interested in predicting whether a given crab has the color form blue (encoded as 0) or orange (encoded as 1). We use all the numeric features in the dataset.

from observations import crabs

data, metadata = crabs("~/data")
X_train = data[:100, 3:]
y_train = data[:100, 1]

N = X_train.shape[0]  # number of data points
D = X_train.shape[1]  # number of features

print("Number of data points: {}".format(N))
print("Number of features: {}".format(D))
Number of data points: 100
Number of features: 5

Model

A Gaussian process is a powerful object for modeling nonlinear relationships between pairs of random variables. It defines a distribution over (possibly nonlinear) functions, which can be applied for representing our uncertainty around the true functional relationship. Here we define a Gaussian process model for classification (Rasmussen & Williams, 2006).

Formally, a distribution over functions \(f:\mathbb{R}^D\to\mathbb{R}\) can be specified by a Gaussian process \[\begin{aligned} p(f) &= \mathcal{GP}(f\mid \mathbf{0}, k(\mathbf{x}, \mathbf{x}^\prime)),\end{aligned}\] whose mean function is the zero function, and whose covariance function is some kernel which describes dependence between any set of inputs to the function.

Given a set of input-output pairs \(\{\mathbf{x}_n\in\mathbb{R}^D,y_n\in\mathbb{R}\}\), the likelihood can be written as a multivariate normal \[\begin{aligned} p(\mathbf{y}) &= \text{Normal}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})\end{aligned}\] where \(\mathbf{K}\) is a covariance matrix given by evaluating \(k(\mathbf{x}_n, \mathbf{x}_m)\) for each pair of inputs in the data set.

The above applies directly for regression where \(\mathbb{y}\) is a real-valued response, but not for (binary) classification, where \(\mathbb{y}\) is a label in \(\{0,1\}\). To deal with classification, we interpret the response as latent variables which is squashed into \([0,1]\). We then draw from a Bernoulli to determine the label, with probability given by the squashed value.

Define the likelihood of an observation \((\mathbf{x}_n, y_n)\) as \[\begin{aligned} p(y_n \mid \mathbf{z}, x_n) &= \text{Bernoulli}(y_n \mid \text{logit}^{-1}(z_n)).\end{aligned}\]

Define the prior to be a multivariate normal \[\begin{aligned} p(\mathbf{z}) &= \text{Normal}(\mathbf{z} \mid \mathbf{0}, \mathbf{K}),\end{aligned}\] with covariance matrix given as previously stated.

Let’s build the model in Edward. We use a radial basis function (RBF) kernel, also known as the squared exponential or exponentiated quadratic. It returns the kernel matrix evaluated over all pairs of data points; we then Cholesky decompose the matrix to parameterize the multivariate normal distribution.

from edward.models import Bernoulli, MultivariateNormalTriL
from edward.util import rbf

X = tf.placeholder(tf.float32, [N, D])
f = MultivariateNormalTriL(loc=tf.zeros(N), scale_tril=tf.cholesky(rbf(X)))
y = Bernoulli(logits=f)

Here, we define a placeholder X. During inference, we pass in the value for this placeholder according to data.

Inference

Perform variational inference. Define the variational model to be a fully factorized normal.

qf = Normal(loc=tf.get_variable("qf/loc", [N]),
            scale=tf.nn.softplus(tf.get_variable("qf/scale", [N])))

Run variational inference for 500 iterations.

inference = ed.KLqp({f: qf}, data={X: X_train, y: y_train})
inference.run(n_iter=500)

In this case KLqp defaults to minimizing the \(\text{KL}(q\|p)\) divergence measure using the reparameterization gradient. For more details on inference, see the \(\text{KL}(q\|p)\) tutorial. (This example happens to be slow because evaluating and inverting full covariances in Gaussian processes happens to be slow.)

References

Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. MIT Press.