Mixture of Gaussians

A mixture model is a model typically used for clustering. It assigns a mixture component to each data point, and this mixture component determines the distribution that the data point is generated from. A mixture of Gaussians uses Gaussian distributions to generate this data (Bishop, 2006).

For a set of \(N\) data points, the likelihood of each observation \(\mathbf{x}_n\) is \[\begin{aligned} p(\mathbf{x}_n \mid \pi, \mu, \sigma) &= \sum_{k=1}^K \pi_k \, \text{Normal}(\mathbf{x}_n \mid \mu_k, \sigma_k).\end{aligned}\] The latent variable \(\pi\) is a \(K\)-dimensional probability vector which mixes individual Gaussian distributions, each characterized by a mean \(\mu_k\) and standard deviation \(\sigma_k\).

Define the prior on \(\pi\in[0,1]\) such that \(\sum_{k=1}^K\pi_k=1\) to be \[\begin{aligned} p(\pi) &= \text{Dirichlet}(\pi \mid \alpha \mathbf{1}_{K}).\end{aligned}\]

Define the prior on each component \(\mathbf{\mu}_k\in\mathbb{R}^D\) to be \[\begin{aligned} p(\mathbf{\mu}_k) &= \text{Normal}(\mathbf{\mu}_k \mid \mathbf{0}, \sigma^2\mathbf{I}).\end{aligned}\]

Define the prior on each component \(\mathbf{\sigma}_k\in\mathbb{R}^D\) to be \[\begin{aligned} p(\mathbf{\sigma}_k) &= \text{InverseGamma}(\mathbf{\sigma}_k \mid a, b).\end{aligned}\]

We build two versions of the model in Edward: one explicitly with the mixture assignments \(c_n\in\{0,\ldots,K-1\}\) as latent variables, and another with them summed out.

The explicit version is as follows:

from edward.models import Categorical, Dirichlet, InverseGamma, Normal

N = 500  # number of data points
K = 2  # number of components
D = 2  # dimensionality of data

pi = Dirichlet(alpha=tf.ones(K))
mu = Normal(mu=tf.zeros([K, D]), sigma=tf.ones([K, D]))
sigma = InverseGamma(alpha=tf.ones([K, D]), beta=tf.ones([K, D]))
c = Categorical(logits=tf.ones([N, 1]) * ed.logit(pi))
x = Normal(mu=tf.gather(mu, c), sigma=tf.gather(sigma, c))

The collapsed version marginalizes out the mixture assignments. We implement this with the Mixture random variable.

from edward.models import Categorical, Dirichlet, InverseGamma, Mixture, \
    MultivariateNormalDiag, Normal

N = 500  # number of data points
K = 2  # number of components
D = 2  # dimensionality of data

pi = Dirichlet(alpha=tf.ones(K))
mu = Normal(mu=tf.zeros([K, D]), sigma=tf.ones([K, D]))
sigma = InverseGamma(alpha=tf.ones([K, D]), beta=tf.ones([K, D]))
cat = Categorical(logits=tf.ones([N, 1]) * ed.logit(pi))
components = [
    MultivariateNormalDiag(mu=tf.ones([N, 1]) * tf.gather(mu, k),
                           diag_stdev=tf.ones([N, 1]) * tf.gather(sigma, k))
    for k in range(K)]

x = Mixture(cat=cat, components=components)

We experiment with this model using variational inference in the unsupervised learning tutorial. Example scripts using this model can be found here.

Remarks: The log-sum-exp trick

For a collapsed mixture model, implementing the log density can be tricky. In general, the log density is \[\begin{aligned} \log p(\pi) + \Big[ \sum_{k=1}^K \log p(\mathbf{\mu}_k) + \log p(\mathbf{\sigma}_k) \Big] + \sum_{n=1}^N \log p(\mathbf{x}_n \mid \pi, \mu, \sigma),\end{aligned}\] where the likelihood is \[\begin{aligned} \sum_{n=1}^N \log p(\mathbf{x}_n \mid \pi, \mu, \sigma) &= \sum_{n=1}^N \log \sum_{k=1}^K \pi_k \, \text{Normal}(\mathbf{x}_n \mid \mu_k, \sigma_k).\end{aligned}\] To prevent numerical instability, we’d like to work on the log-scale, \[\begin{aligned} \sum_{n=1}^N \log p(\mathbf{x}_n \mid \pi, \mu, \sigma) &= \sum_{n=1}^N \log \sum_{k=1}^K \exp\Big( \log \pi_k + \log \text{Normal}(\mathbf{x}_n \mid \mu_k, \sigma_k)\Big).\end{aligned}\] This expression involves a log sum exp operation, which is numerically unstable as exponentiation will often lead to one value dominating the rest. Therefore we use the log-sum-exp trick. It is based on the identity \[\begin{aligned} \mathbf{x}_{\mathrm{max}} &= \arg\max \mathbf{x}, \\ \log \sum_i \exp(\mathbf{x}_i) &= \log \Big(\exp(\mathbf{x}_{\mathrm{max}}) \sum_i \exp(\mathbf{x}_i - \mathbf{x}_{\mathrm{max}})\Big) \\ &= \mathbf{x}_{\mathrm{max}} + \log \sum_i \exp(\mathbf{x}_i - \mathbf{x}_{\mathrm{max}}).\end{aligned}\] Subtracting the maximum value before taking the log-sum-exp leads to more numerically stable output. The Mixture random variable implements this trick for calculating the log-density.

References

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer New York.