Supervised Learning (Classification)

In supervised learning, the task is to infer hidden structure from labeled data, comprised of training examples \(\{(x_n, y_n)\}\). Classification means the output \(y\) takes discrete values.

We demonstrate with an example in Edward. An interactive version with Jupyter notebook is available here.

Data

Use 25 data points from the crabs data set.

df = np.loadtxt('data/crabs_train.txt', delimiter=',')
df[df[:, 0] == -1, 0] = 0  # replace -1 label with 0 label

N = 25  # number of data points
D = df.shape[1] - 1  # number of features

subset = np.random.choice(df.shape[0], N, replace=False)
X_train = df[subset, 1:]
y_train = df[subset, 0]

Model

A Gaussian process is a powerful object for modeling nonlinear relationships between pairs of random variables. It defines a distribution over (possibly nonlinear) functions, which can be applied for representing our uncertainty around the true functional relationship. Here we define a Gaussian process model for classification (Rasmussen & Williams, 2006).

Formally, a distribution over functions \(f:\mathbb{R}^D\to\mathbb{R}\) can be specified by a Gaussian process \[\begin{aligned} p(f) &= \mathcal{GP}(f\mid \mathbf{0}, k(\mathbf{x}, \mathbf{x}^\prime)),\end{aligned}\] whose mean function is the zero function, and whose covariance function is some kernel which describes dependence between any set of inputs to the function.

Given a set of input-output pairs \(\{\mathbf{x}_n\in\mathbb{R}^D,y_n\in\mathbb{R}\}\), the likelihood can be written as a multivariate normal \[\begin{aligned} p(\mathbf{y}) &= \text{Normal}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})\end{aligned}\] where \(\mathbf{K}\) is a covariance matrix given by evaluating \(k(\mathbf{x}_n, \mathbf{x}_m)\) for each pair of inputs in the data set.

The above applies directly for regression where \(\mathbb{y}\) is a real-valued response, but not for (binary) classification, where \(\mathbb{y}\) is a label in \(\{0,1\}\). To deal with classification, we interpret the response as latent variables which is squashed into \([0,1]\). We then draw from a Bernoulli to determine the label, with probability given by the squashed value.

Define the likelihood of an observation \((\mathbf{x}_n, y_n)\) as \[\begin{aligned} p(y_n \mid \mathbf{z}, x_n) &= \text{Bernoulli}(y_n \mid \text{logit}^{-1}(\mathbf{x}_n^\top \mathbf{z})).\end{aligned}\]

Define the prior to be a multivariate normal \[\begin{aligned} p(\mathbf{z}) &= \text{Normal}(\mathbf{z} \mid \mathbf{0}, \mathbf{K}),\end{aligned}\] with covariance matrix given as previously stated.

Let’s build the model in Edward. We use a radial basis function (RBF) kernel, also known as the squared exponential or exponentiated quadratic.

from edward.models import Bernoulli, MultivariateNormalCholesky
from edward.util import multivariate_rbf

def kernel(x):
  """Compute lower triangular matrix, where entry (i, j) for i <= j is
  the kernel over the ith and jth data points."""
  mat = []
  for i in range(N):
    mat.append([])
    xi = x[i, :]
    for j in range(N):
      xj = x[j, :]
      if i <= j:
        mat[i].append(multivariate_rbf(xi, xj))
      else:
        mat[i].append(0.0)

    mat[i] = tf.stack(mat[i])

  return tf.stack(mat)

X = tf.placeholder(tf.float32, [N, D])
f = MultivariateNormalCholesky(mu=tf.zeros(N), chol=kernel(X))
y = Bernoulli(logits=f)

Here, we define a placeholder X. During inference, we pass in the value for this placeholder according to data.

Inference

Perform variational inference. Define the variational model to be a fully factorized normal.

qf = Normal(mu=tf.Variable(tf.random_normal([N])),
            sigma=tf.nn.softplus(tf.Variable(tf.random_normal([N]))))

Run variational inference for 500 iterations.

inference = ed.KLqp({f: qf}, data={X: X_train, y: y_train})
inference.run(n_iter=500)

In this case KLqp defaults to minimizing the \(\text{KL}(q\|p)\) divergence measure using the reparameterization gradient. For more details on inference, see the \(\text{KL}(q\|p)\) tutorial. (This example happens to be slow because evaluating and inverting full covariances in Gaussian processes happens to be slow.)

References

Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. MIT Press.