Unsupervised learning

In unsupervised learning, the task is to infer hidden structure from unlabeled data, comprised of training examples \(\{x_n\}\).

We demonstrate how to do this in Edward with an example. The script is available here.


Use a simulated dataset of 2-dimensional datapoints \(\mathbf{x}_n\in\mathbb{R}^2\).

def build_toy_dataset(N):
  pi = np.array([0.4, 0.6])
  mus = [[1, 1], [-1, -1]]
  stds = [[0.1, 0.1], [0.1, 0.1]]
  x = np.zeros((N, 2), dtype=np.float32)
  for n in range(N):
    k = np.argmax(np.random.multinomial(1, pi))
    x[n, :] = np.random.multivariate_normal(mus[k], np.diag(stds[k]))

  return x

N = 500  # number of data points
D = 2  # dimensionality of data

x_train = build_toy_dataset(N)

We visualize the generated data points.

plt.scatter(x_train[:, 0], x_train[:, 1])
plt.axis([-3, 3, -3, 3])



Posit the model as a mixture of Gaussians. For more details on the model, see the Mixture of Gaussians tutorial. We write it in collapsed form, marginalizing out the mixture assignments.

K = 2  # number of components

mu = Normal(mu=tf.zeros([K, D]), sigma=tf.ones([K, D]))
sigma = InverseGamma(alpha=tf.ones([K, D]), beta=tf.ones([K, D]))
cat = Categorical(logits=tf.zeros([N, K]))
components = [
    MultivariateNormalDiag(mu=tf.ones([N, 1]) * tf.gather(mu, k),
                           diag_stdev=tf.ones([N, 1]) * tf.gather(sigma, k))
    for k in range(K)]
x = Mixture(cat=cat, components=components)


Perform variational inference. The latent variables are the mixture probabilities, component means, and component variances. Define the variational model to be \[\begin{aligned} q(\mu, \sigma \;;\; \lambda) &= \prod_{k=1}^K \text{Normal}(\mu_k; \lambda_{\mu_k}) ~ \text{InverseGamma}(\sigma_k; \lambda_{\sigma_k}).\end{aligned}\] The model in Edward is

qmu = Normal(
    mu=tf.Variable(tf.random_normal([K, D])),
    sigma=tf.nn.softplus(tf.Variable(tf.zeros([K, D]))))
qsigma = InverseGamma(
    alpha=tf.nn.softplus(tf.Variable(tf.random_normal([K, D]))),
    beta=tf.nn.softplus(tf.Variable(tf.random_normal([K, D]))))

Run variational inference for 4000 iterations and 20 latent variable samples per iteration.

data = {x: x_train}
inference = ed.KLqp({pi: qpi, mu: qmu, sigma: qsigma}, data)
inference.run(n_iter=4000, n_samples=20)

In this case KLqp defaults to minimizing the \(\text{KL}(q\|p)\) divergence measure using the score function gradient. For more details on inference, see the \(\text{KL}(q\|p)\) tutorial.


We visualize the predicted memberships of each data point. We pick the cluster assignment which produces the highest posterior predictive density for each data point.

To do this, we first draw a sample from the posterior and calculate a a K x N matrix of log-likelihoods, one for each cluster assignment \(k\) and data point \(\mathbf{x}_n\). We perform this averaged over 100 pwosterior samples.

# Average per-cluster and per-data point likelihood over many posterior samples.
log_liks = []
for _ in range(100):
  mu_sample = qmu.sample()
  sigma_sample = qsigma.sample()
  # Take per-cluster and per-data point likelihood.
  log_lik = []
  for k in range(K):
    x_post = Normal(mu=tf.ones([N, 1]) * tf.gather(mu_sample, k),
                    sigma=tf.ones([N, 1]) * tf.gather(sigma_sample, k))
    log_lik.append(tf.reduce_sum(x_post.log_prob(x_train), 1))

  log_lik = tf.pack(log_lik)  # has shape (K, N)

log_liks = tf.reduce_mean(log_liks, 0)

We then take the \(\arg\max\) along the rows (cluster assignments).

clusters = tf.argmax(log_liks, 0).eval()

Plot the data points, colored by their predicted membership.

plt.scatter(x_train[:, 0], x_train[:, 1], c=clusters, cmap=cm.bwr)
plt.axis([-3, 3, -3, 3])
plt.title("Predicted cluster assignments")