Supervised learning (Classification)

In supervised learning, the task is to infer hidden structure from labeled data, comprised of training examples \(\{(x_n, y_n)\}\). Classification means the output \(y\) takes discrete values.

We demonstrate how to do this in Edward with an example. The script is available here.


Use 25 data points from the crabs data set.

df = np.loadtxt('data/crabs_train.txt', dtype='float32', delimiter=',')
df[df[:, 0] == -1, 0] = 0  # replace -1 label with 0 label

N = 25  # number of data points
D = df.shape[1] - 1  # number of features

subset = np.random.choice(df.shape[0], N, replace=False)
X_train = df[subset, 1:]
y_train = df[subset, 0]


A Gaussian process is a powerful object for modeling nonlinear relationships between pairs of random variables. It defines a distribution over (possibly nonlinear) functions, which can be applied for representing our uncertainty around the true functional relationship. Here we define a Gaussian process model for classification (Rasmussen & Williams, 2006).

Formally, a distribution over functions \(f:\mathbb{R}^D\to\mathbb{R}\) can be specified by a Gaussian process \[\begin{aligned} p(f) &= \mathcal{GP}(f\mid \mathbf{0}, k(\mathbf{x}, \mathbf{x}^\prime)),\end{aligned}\] whose mean function is the zero function, and whose covariance function is some kernel which describes dependence between any set of inputs to the function.

Given a set of input-output pairs \(\{\mathbf{x}_n\in\mathbb{R}^D,y_n\in\mathbb{R}\}\), the likelihood can be written as a multivariate normal \[\begin{aligned} p(\mathbf{y}) &= \text{Normal}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})\end{aligned}\] where \(\mathbf{K}\) is a covariance matrix given by evaluating \(k(\mathbf{x}_n, \mathbf{x}_m)\) for each pair of inputs in the data set.

The above applies directly for regression where \(\mathbb{y}\) is a real-valued response, but not for (binary) classification, where \(\mathbb{y}\) is a label in \(\{0,1\}\). To deal with classification, we interpret the response as latent variables which is squashed into \([0,1]\). We then draw from a Bernoulli to determine the label, with probability given by the squashed value.

Define the likelihood of an observation \((\mathbf{x}_n, y_n)\) as \[\begin{aligned} p(y_n \mid \mathbf{z}, x_n) &= \text{Bernoulli}(y_n \mid \text{logit}^{-1}(\mathbf{x}_n^\top \mathbf{z})).\end{aligned}\]

Define the prior to be a multivariate normal \[\begin{aligned} p(\mathbf{z}) &= \text{Normal}(\mathbf{z} \mid \mathbf{0}, \mathbf{K}),\end{aligned}\] with covariance matrix given as previously stated.

Let’s build the model in Edward. We use a radial basis function (RBF) kernel, also known as the squared exponential or exponentiated quadratic.

from edward.models import Bernoulli, Normal
from edward.util import multivariate_rbf

def kernel(x):
  mat = []
  for i in range(N):
    mat += [[]]
    xi = x[i, :]
    for j in range(N):
      if j == i:
        mat[i] += [multivariate_rbf(xi, xi)]
        xj = x[j, :]
        mat[i] += [multivariate_rbf(xi, xj)]

    mat[i] = tf.stack(mat[i])

  return tf.stack(mat)

X = tf.placeholder(tf.float32, [N, D])
f = MultivariateNormalFull(mu=tf.zeros(N), sigma=kernel(X))
y = Bernoulli(logits=f)


Perform variational inference. Define the variational model to be a fully factorized normal

qf = Normal(mu=tf.Variable(tf.random_normal([N])),

Run variational inference for 500 iterations.

inference = ed.KLqp({f: qf}, data={X: X_train, y: y_train})

In this case KLqp defaults to minimizing the \(\text{KL}(q\|p)\) divergence measure using the reparameterization gradient. For more details on inference, see the \(\text{KL}(q\|p)\) tutorial. (This example happens to be slow because evaluating and inverting full covariances in Gaussian processes happens to be slow.)


Rasmussen, C. E., & Williams, C. (2006). Gaussian processes for machine learning. MIT Press.