Code Exercise: Analyze Data With Simple Linear Regression

In this coding exercise, we're going to work with a housing database that contains the median house value for 17,000 blocks across the state of California. The database contains various attributes for each housing block, and our goal is to create a mathematical model that can accurately predict the average house vale for each block.

Loading The Data

The first thing we need to do is load the data in memory. We're going to use a very handy data manipulation library called Deedle, that makes it super easy to work with complex datasets. You can install Deedle as a Nuget package.

Deedle is designed for F#, but it also works well on C#. However, to get it running we're going to need one more Nuget package: the F# Core library FSharp.Core.

You can download the housing database from the following link. Save the file as "california_housing.csv" in your project folder:

Create a new C# Console Project targetting the .NET Framework version 4.6.1 or higher, and add the following code:

var path = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "..");
path = Path.Combine(path, "..");
path = Path.Combine(path, "california_housing.csv");
var housing = Frame.ReadCsv(path, separators: ",");

This will load the file in memory in a Frame instance.

Deedle encourages you to think of your data in terms of columns. Let's grab two data columns:

// set up a few series
var total_rooms = housing["total_rooms"];
var median_house_value = housing["median_house_value"];
var median_income = housing["median_income"];

The median_house_value column is in the range from 0 to 500,000. To make these values more manageable, let's divide all values by 1,000:

// convert the house value range to thousands
median_house_value /= 1000;

Running A Simple Linear Regression

We are going to assume that there is a linear relationship between the total number of rooms in a housing block and the median house value of that same block. To test our hypothesis, we're going to run a linear regression on the data and see if we get a good fit.

We're going to use the machine learning classes in the Accord.NET library. You can find them in the Accord.MachineLearning package.

The Accord regression classes expect data in the form of array of double. So we need to convert our Deedle series into double arrays. The ValuesAll property is exactly what we need; it returns all values in the series as an enumeration. So we get the following code:

// set up feature and label
var feature = total_rooms.Values.ToArray();
var labels = median_house_value.Values.ToArray();

This gets us both the input features (total_rooms) and the output labels (median_house_value) as arrays of double.

The next step is to pick the learning algorithm. We could use gradient descent, but since we're doing linear regression with only a single input feature, there is an even better solution that will give us the perfect fit in just a single pass: the OrdinaryLeastSquares class. Here's how that works:

// train the model
var learner = new OrdinaryLeastSquares();
var model = learner.Learn(feature, labels);

This code snippet will run a linear regression on the data, using the ordinary least squares algorithm to find the optimal solution.

We can access the discovered model parameters by reading the Slope and Intercept properties, like this:

// show results
Console.WriteLine($"Slope:       {model.Slope}");
Console.WriteLine($"Intercept:   {model.Intercept}");

Validating The Result

So is this a good fit? To find out, we must validate the model. We can do this by running every single feature through the model; this will yield a set of predictions. Then we can compare each prediction with the actual label, and calculate the Root Mean Squared Error (RMSE) value:

// validate the model
var predictions = model.Transform(feature);
var rmse = Math.Sqrt(new SquareLoss(labels).Loss(predictions));

The RMSE indicates the uncertainty in each prediction. We can compare it to the range of labels to get a feel for the accuracy of the model:

var range = Math.Abs(labels.Max() - labels.Min());
Console.WriteLine($"Label range: {range}");
Console.WriteLine($"RMSE:        {rmse} {rmse/range*100:0.00}%");

Running this code gives the following result:

We get an RMSE of 114, which is more than 23% of the label range. That's not very good.

Let's plot the data and the regression line to get a better feel for the data. Accord.NET has a built-in graph library for quickly creating scatterplots and histograms. To use it, you first need to install the Accord.Controls Nuget package.

Now we need to get a little creative. Accord can work with separate x- and y data arrays (corresponding nicely to our feature and labels variables), but we need to plot two data series: the labels and the model predictions. To get this to work, we need to concatenate the labels and predictions arrays together. The following code sets up two x- and y value arrays for the plot:

// generate plot arrays
var x = feature.Concat(feature).ToArray();
var y = predictions.Concat(labels).ToArray();

Finally, we need a third array to tell Accord what color to use when drawing the two series. We will generate an array with the value 1 for all predictions, and 2 for all labels:

// set up color array 
var colors1 = Enumerable.Repeat(1, labels.Length).ToArray();
var colors2 = Enumerable.Repeat(2, labels.Length).ToArray();
var c = colors1.Concat(colors2).ToArray();

And now we can generate the scatterplot:

// plot the data 
var plot = new Scatterplot("Training", "feature", "label");
plot.Compute(x, y, c);

When you run this code, you'll see this:

Our problem is that there is no clear linear relationship between the total number of rooms per block and the median house value. Our model has nothing to work with.

Let's try a different feature, for example the median_income series. Modify the code as follows:

// set up feature and label
var feature = median_income.Values.ToArray();

Run the code. Now you'll get this result:

Now we have an RMSE of 83.7 which is 17% of the label range.

The graph now looks like this:

Much better. There seems to be a linear relationships between the median income of everybody living in a housing block and the median house value of all houses in that block. Or said differently: richer people buy more expensive houses.

However, we're not done yet. For some reason, the data file clips median house values at $500,000.That creates the red horizontal line in the plot, and it's affecting the slope. Since we don't know the exact house values for these income levels, it's best to just remove them from the data altogether:

// get data
var housing = Frame.ReadCsv(path, separators: ",");
housing = housing.Where(kv => ((decimal)kv.Value["median_house_value"]) < 500000);

This code snippet keeps all data points with a median house value below $500,000. Now when we run the code, we get the following result:

We get an RMSE of 73.9 which is 15% of the range of label values.

The plot now looks like this:

Our model can predict the value of a house based on the median income of everybody in a housing block. Our predictions will have a deviation of $73,900 from the real market value.

It's plausible that the median house value depends on the median income level per block, but there are obviously many other factors also affecting the house value. We're going to have to dig a little deeper to refine our model further.