## Supervised learning (Regression)

In supervised learning, the task is to infer hidden structure from labeled data, comprised of training examples $$\{(x_n, y_n)\}$$. Regression (typically) means the output $$y$$ takes continuous values.

We demonstrate how to do this in Edward with an example. The script is available here.

### Data

Simulate training and test sets of $$40$$ data points. They comprise of pairs of inputs $$\mathbf{x}_n\in\mathbb{R}^{10}$$ and outputs $$y_n\in\mathbb{R}$$. They have a linear dependence with normally distributed noise.

def build_toy_dataset(N, w, noise_std=0.1):
D = len(w)
x = np.random.randn(N, D).astype(np.float32)
y = np.dot(x, w) + np.random.normal(0, noise_std, size=N)
return x, y

N = 40  # number of data points
D = 10  # number of features

w_true = np.random.randn(D)
X_train, y_train = build_toy_dataset(N, w_true)
X_test, y_test = build_toy_dataset(N, w_true)

### Model

Posit the model as Bayesian linear regression (Murphy, 2012). It assumes a linear relationship between the inputs $$\mathbf{x}\in\mathbb{R}^D$$ and the outputs $$y\in\mathbb{R}$$.

For a set of $$N$$ data points $$(\mathbf{X},\mathbf{y})=\{(\mathbf{x}_n, y_n)\}$$, the model posits the following distributions: \begin{aligned} p(\mathbf{w}) &= \text{Normal}(\mathbf{w} \mid \mathbf{0}, \sigma_w^2\mathbf{I}), \\[1.5ex] p(b) &= \text{Normal}(b \mid 0, \sigma_b^2), \\ p(\mathbf{y} \mid \mathbf{w}, b, \mathbf{X}) &= \prod_{n=1}^N \text{Normal}(y_n \mid \mathbf{x}_n^\top\mathbf{w} + b, \sigma_y^2).\end{aligned} The latent variables are the linear model’s weights $$\mathbf{w}$$ and intercept $$b$$, also known as the bias. Assume $$\sigma_w^2,\sigma_b^2$$ are known prior variances and $$\sigma_y^2$$ is a known likelihood variance. The mean of the likelihood is given by a linear transformation of the inputs $$\mathbf{x}_n$$.

Let’s build the model in Edward, fixing $$\sigma_w,\sigma_b,\sigma_y=1$$.

from edward.models import Normal

X = tf.placeholder(tf.float32, [N, D])
w = Normal(mu=tf.zeros(D), sigma=tf.ones(D))
b = Normal(mu=tf.zeros(1), sigma=tf.ones(1))
y = Normal(mu=ed.dot(X, w) + b, sigma=tf.ones(N))

Here, we define a placeholder X. During inference, we pass in the value for this placeholder according to data.

### Inference

We now turn to inferring the posterior using variational inference. Define the variational model to be a fully factorized normal across the weights.

qw = Normal(mu=tf.Variable(tf.random_normal([D])),
sigma=tf.nn.softplus(tf.Variable(tf.random_normal([D]))))
qb = Normal(mu=tf.Variable(tf.random_normal([1])),
sigma=tf.nn.softplus(tf.Variable(tf.random_normal([1]))))

Run variational inference with the Kullback-Leibler divergence, using a default of $$1000$$ iterations.

inference = ed.KLqp({w: qw, b: qb}, data={X: X_train, y: y_train})
inference.run()

In this case KLqp defaults to minimizing the $$\text{KL}(q\|p)$$ divergence measure using the reparameterization gradient. For more details on inference, see the $$\text{KL}(q\|p)$$ tutorial.

### Criticism

A standard evaluation in regression is to calculate point-based evaluations on held-out “testing” data. We do this first by forming the posterior predictive distribution.

y_post = Normal(mu=ed.dot(X, qw.mean()) + qb.mean(), sigma=tf.ones(N))

With this we can evaluate various point-based quantities using the posterior predictive.

print(ed.evaluate('mean_squared_error', data={X: X_test, y_post: y_test}))
> 0.012107

print(ed.evaluate('mean_absolute_error', data={X: X_test, y_post: y_test}))
> 0.0867875

The trained model makes predictions with low mean squared error (relative to the magnitude of the output).

### References

Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press.