## Supervised Learning (Classification)

In supervised learning, the task is to infer hidden structure from labeled data, comprised of training examples $$\{(x_n, y_n)\}$$. Classification means the output $$y$$ takes discrete values.

We demonstrate with an example in Edward. An interactive version with Jupyter notebook is available here.

### Data

We use the crabs data set, which consists of morphological measurements on a crab species. We are interested in predicting whether a given crab has the color form blue (encoded as 0) or orange (encoded as 1). We use all the numeric features in the dataset.

from observations import crabs

X_train = data[:100, 3:]
y_train = data[:100, 1]

N = X_train.shape  # number of data points
D = X_train.shape  # number of features

print("Number of data points: {}".format(N))
print("Number of features: {}".format(D))
Number of data points: 100
Number of features: 5

### Model

A Gaussian process is a powerful object for modeling nonlinear relationships between pairs of random variables. It defines a distribution over (possibly nonlinear) functions, which can be applied for representing our uncertainty around the true functional relationship. Here we define a Gaussian process model for classification (Rasmussen & Williams, 2006).

Formally, a distribution over functions $$f:\mathbb{R}^D\to\mathbb{R}$$ can be specified by a Gaussian process \begin{aligned} p(f) &= \mathcal{GP}(f\mid \mathbf{0}, k(\mathbf{x}, \mathbf{x}^\prime)),\end{aligned} whose mean function is the zero function, and whose covariance function is some kernel which describes dependence between any set of inputs to the function.

Given a set of input-output pairs $$\{\mathbf{x}_n\in\mathbb{R}^D,y_n\in\mathbb{R}\}$$, the likelihood can be written as a multivariate normal \begin{aligned} p(\mathbf{y}) &= \text{Normal}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})\end{aligned} where $$\mathbf{K}$$ is a covariance matrix given by evaluating $$k(\mathbf{x}_n, \mathbf{x}_m)$$ for each pair of inputs in the data set.

The above applies directly for regression where $$\mathbb{y}$$ is a real-valued response, but not for (binary) classification, where $$\mathbb{y}$$ is a label in $$\{0,1\}$$. To deal with classification, we interpret the response as latent variables which is squashed into $$[0,1]$$. We then draw from a Bernoulli to determine the label, with probability given by the squashed value.

Define the likelihood of an observation $$(\mathbf{x}_n, y_n)$$ as \begin{aligned} p(y_n \mid \mathbf{z}, x_n) &= \text{Bernoulli}(y_n \mid \text{logit}^{-1}(z_n)).\end{aligned}

Define the prior to be a multivariate normal \begin{aligned} p(\mathbf{z}) &= \text{Normal}(\mathbf{z} \mid \mathbf{0}, \mathbf{K}),\end{aligned} with covariance matrix given as previously stated.

Let’s build the model in Edward. We use a radial basis function (RBF) kernel, also known as the squared exponential or exponentiated quadratic. It returns the kernel matrix evaluated over all pairs of data points; we then Cholesky decompose the matrix to parameterize the multivariate normal distribution.

from edward.models import Bernoulli, MultivariateNormalTriL
from edward.util import rbf

X = tf.placeholder(tf.float32, [N, D])
f = MultivariateNormalTriL(loc=tf.zeros(N), scale_tril=tf.cholesky(rbf(X)))
y = Bernoulli(logits=f)

Here, we define a placeholder X. During inference, we pass in the value for this placeholder according to data.

### Inference

Perform variational inference. Define the variational model to be a fully factorized normal.

qf = Normal(loc=tf.get_variable("qf/loc", [N]),
scale=tf.nn.softplus(tf.get_variable("qf/scale", [N])))

Run variational inference for 500 iterations.

inference = ed.KLqp({f: qf}, data={X: X_train, y: y_train})
inference.run(n_iter=500)

In this case KLqp defaults to minimizing the $$\text{KL}(q\|p)$$ divergence measure using the reparameterization gradient. For more details on inference, see the $$\text{KL}(q\|p)$$ tutorial. (This example happens to be slow because evaluating and inverting full covariances in Gaussian processes happens to be slow.)

Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. MIT Press.