Supervised learning (Regression)

In supervised learning, the task is to infer hidden structure from labeled data, comprised of training examples \(\{(x_n, y_n)\}\). Regression (typically) means the output \(y\) takes continuous values.

We demonstrate how to do this in Edward with an example. The script is available here.

Data

Simulate training and test sets of \(40\) data points. They comprise of pairs of inputs \(\mathbf{x}_n\in\mathbb{R}^{10}\) and outputs \(y_n\in\mathbb{R}\). They have a linear dependence with normally distributed noise.

def build_toy_dataset(N, w, noise_std=0.1):
  D = len(w)
  x = np.random.randn(N, D).astype(np.float32)
  y = np.dot(x, w) + np.random.normal(0, noise_std, size=N)
  return x, y

N = 40  # number of data points
D = 10  # number of features

w_true = np.random.randn(D)
X_train, y_train = build_toy_dataset(N, w_true)
X_test, y_test = build_toy_dataset(N, w_true)

Model

Posit the model as Bayesian linear regression (Murphy, 2012). It assumes a linear relationship between the inputs \(\mathbf{x}\in\mathbb{R}^D\) and the outputs \(y\in\mathbb{R}\).

For a set of \(N\) data points \((\mathbf{X},\mathbf{y})=\{(\mathbf{x}_n, y_n)\}\), the model posits the following distributions: \[\begin{aligned} p(\mathbf{w}) &= \text{Normal}(\mathbf{w} \mid \mathbf{0}, \sigma_w^2\mathbf{I}), \\[1.5ex] p(b) &= \text{Normal}(b \mid 0, \sigma_b^2), \\ p(\mathbf{y} \mid \mathbf{w}, b, \mathbf{X}) &= \prod_{n=1}^N \text{Normal}(y_n \mid \mathbf{x}_n^\top\mathbf{w} + b, \sigma_y^2).\end{aligned}\] The latent variables are the linear model’s weights \(\mathbf{w}\) and intercept \(b\), also known as the bias. Assume \(\sigma_w^2,\sigma_b^2\) are known prior variances and \(\sigma_y^2\) is a known likelihood variance. The mean of the likelihood is given by a linear transformation of the inputs \(\mathbf{x}_n\).

Let’s build the model in Edward, fixing \(\sigma_w,\sigma_b,\sigma_y=1\).

from edward.models import Normal

X = tf.placeholder(tf.float32, [N, D])
w = Normal(mu=tf.zeros(D), sigma=tf.ones(D))
b = Normal(mu=tf.zeros(1), sigma=tf.ones(1))
y = Normal(mu=ed.dot(X, w) + b, sigma=tf.ones(N))

Here, we define a placeholder X. During inference, we pass in the value for this placeholder according to data.

Inference

We now turn to inferring the posterior using variational inference. Define the variational model to be a fully factorized normal across the weights.

qw = Normal(mu=tf.Variable(tf.random_normal([D])),
            sigma=tf.nn.softplus(tf.Variable(tf.random_normal([D]))))
qb = Normal(mu=tf.Variable(tf.random_normal([1])),
            sigma=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))

Run variational inference with the Kullback-Leibler divergence, using a default of \(1000\) iterations.

inference = ed.KLqp({w: qw, b: qb}, data={X: X_train, y: y_train})
inference.run()

In this case KLqp defaults to minimizing the \(\text{KL}(q\|p)\) divergence measure using the reparameterization gradient. For more details on inference, see the \(\text{KL}(q\|p)\) tutorial.

Criticism

A standard evaluation in regression is to calculate point-based evaluations on held-out “testing” data. We do this first by forming the posterior predictive distribution.

y_post = Normal(mu=ed.dot(X, qw.mean()) + qb.mean(), sigma=tf.ones(N))

With this we can evaluate various point-based quantities using the posterior predictive.

print(ed.evaluate('mean_squared_error', data={X: X_test, y_post: y_test}))
> 0.012107

print(ed.evaluate('mean_absolute_error', data={X: X_test, y_post: y_test}))
> 0.0867875

The trained model makes predictions with low mean squared error (relative to the magnitude of the output).

References

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press.