## Supervised Learning (Classification)

In supervised learning, the task is to infer hidden structure from labeled data, comprised of training examples $$\{(x_n, y_n)\}$$. Classification means the output $$y$$ takes discrete values.

We demonstrate with an example in Edward. An interactive version with Jupyter notebook is available here.

### Data

Use 25 data points from the crabs data set.

df[df[:, 0] == -1, 0] = 0  # replace -1 label with 0 label

N = 25  # number of data points
D = df.shape[1] - 1  # number of features

subset = np.random.choice(df.shape[0], N, replace=False)
X_train = df[subset, 1:]
y_train = df[subset, 0]

### Model

A Gaussian process is a powerful object for modeling nonlinear relationships between pairs of random variables. It defines a distribution over (possibly nonlinear) functions, which can be applied for representing our uncertainty around the true functional relationship. Here we define a Gaussian process model for classification (Rasmussen & Williams, 2006).

Formally, a distribution over functions $$f:\mathbb{R}^D\to\mathbb{R}$$ can be specified by a Gaussian process \begin{aligned} p(f) &= \mathcal{GP}(f\mid \mathbf{0}, k(\mathbf{x}, \mathbf{x}^\prime)),\end{aligned} whose mean function is the zero function, and whose covariance function is some kernel which describes dependence between any set of inputs to the function.

Given a set of input-output pairs $$\{\mathbf{x}_n\in\mathbb{R}^D,y_n\in\mathbb{R}\}$$, the likelihood can be written as a multivariate normal \begin{aligned} p(\mathbf{y}) &= \text{Normal}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})\end{aligned} where $$\mathbf{K}$$ is a covariance matrix given by evaluating $$k(\mathbf{x}_n, \mathbf{x}_m)$$ for each pair of inputs in the data set.

The above applies directly for regression where $$\mathbb{y}$$ is a real-valued response, but not for (binary) classification, where $$\mathbb{y}$$ is a label in $$\{0,1\}$$. To deal with classification, we interpret the response as latent variables which is squashed into $$[0,1]$$. We then draw from a Bernoulli to determine the label, with probability given by the squashed value.

Define the likelihood of an observation $$(\mathbf{x}_n, y_n)$$ as \begin{aligned} p(y_n \mid \mathbf{z}, x_n) &= \text{Bernoulli}(y_n \mid \text{logit}^{-1}(\mathbf{x}_n^\top \mathbf{z})).\end{aligned}

Define the prior to be a multivariate normal \begin{aligned} p(\mathbf{z}) &= \text{Normal}(\mathbf{z} \mid \mathbf{0}, \mathbf{K}),\end{aligned} with covariance matrix given as previously stated.

Let’s build the model in Edward. We use a radial basis function (RBF) kernel, also known as the squared exponential or exponentiated quadratic.

from edward.models import Bernoulli, MultivariateNormalCholesky
from edward.util import multivariate_rbf

def kernel(x):
"""Compute lower triangular matrix, where entry (i, j) for i <= j is
the kernel over the ith and jth data points."""
mat = []
for i in range(N):
mat.append([])
xi = x[i, :]
for j in range(N):
xj = x[j, :]
if i <= j:
mat[i].append(multivariate_rbf(xi, xj))
else:
mat[i].append(0.0)

mat[i] = tf.stack(mat[i])

return tf.stack(mat)

X = tf.placeholder(tf.float32, [N, D])
f = MultivariateNormalCholesky(mu=tf.zeros(N), chol=kernel(X))
y = Bernoulli(logits=f)

Here, we define a placeholder X. During inference, we pass in the value for this placeholder according to data.

### Inference

Perform variational inference. Define the variational model to be a fully factorized normal.

qf = Normal(mu=tf.Variable(tf.random_normal([N])),
sigma=tf.nn.softplus(tf.Variable(tf.random_normal([N]))))

Run variational inference for 500 iterations.

inference = ed.KLqp({f: qf}, data={X: X_train, y: y_train})
inference.run(n_iter=500)

In this case KLqp defaults to minimizing the $$\text{KL}(q\|p)$$ divergence measure using the reparameterization gradient. For more details on inference, see the $$\text{KL}(q\|p)$$ tutorial. (This example happens to be slow because evaluating and inverting full covariances in Gaussian processes happens to be slow.)

### References

Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. MIT Press.