Previous Lecture Complete and continue  

  Code Exercise: Analyze Data With Logistic Regression

Let's look at the median house value of our California housing data set. This is a continuous numeric-valued feature, but let's turn it into a class label instead.

We're going to train our model to identify high-value city blocks. We define 'high value' as a median house value at or above the 75th percentile of the housing market, which is around $265,000.

Let's introduce a new column to hold the class feature:

# add classification feature
housing["median_high_house_value"] = (housing.median_house_value > 265000).astype(float)

I've added a new column called median_high_house_value to the frame. The column contains the value of 1.0 for every housing block with a median value at or above $265,000, and the value 0.0 for all other housing blocks.

So what would happen if we train a simple linear regressor on this data? Sure, we'll get a continuous numeric-valued label as output, but couldn't we simply interpret this value as a probability?

Well, let's give it a try. First, let's shuffle the data set:

# shuffle housing data
housing = housing.reindex(np.random.permutation(housing.index))

And create training-, validation-, and test partitions:

# partition into training, validation, and test sets
training = housing[1:12000]
validation = housing[12001:14500]
test = housing[14501:17000]

Now we're going to build a list of feature columns and descriptors for the model:

# set up feature column names and descriptors
feature_columns = [
    "latitude",
    "longitude",
    "housing_median_age",
    "total_rooms",
    "total_bedrooms",
    "population",
    "households",
    "median_income"]
feature_column_desc = set([tf.feature_column.numeric_column(f) for f in feature_columns])

With these columns, we can set up a model:

# set up the model
optimizer = tf.train.GradientDescentOptimizer(
    learning_rate = 0.000001
)
optimizer = tf.contrib.estimator.clip_gradients_by_norm(optimizer, 5.0)
model = tf.estimator.LinearRegressor(
    feature_columns = feature_column_desc,
    optimizer = optimizer
)

And run the training:

# train the model
_ = model.train(
    input_fn = lambda: input_fn(
        training[feature_columns], 
        training.median_high_house_value),
    steps = 200
)

Let's print out the model weights too:

# print model weights
weight_values = [model.get_variable_value("linear/linear_model/%s/weights" % name)[0][0] 
                 for name in feature_columns]
print(weight_values)

Run a validation:

# validate the model
prediction_set = model.predict(
    input_fn = lambda: input_fn(
        validation[feature_columns], 
        validation.median_high_house_value,
        epochs = 1)
)

And get a list of predictions:

# get prediction values
prediction_values = np.array([item['predictions'][0] for item in prediction_set])


So are these predictions any good? Let's find out by putting them in a histogram:

# plot a histogram of predictions
plt.figure(figsize=(13, 8))
plt.title("Predictions histogram")
plt.xlabel("prediction value")
plt.ylabel("frequency")
plt.hist(prediction_values, bins = 50)
plt.plot()
plt.show()


Here is the output of the program:

And the histogram looks like this:

It's a nice distribution of values, but we get some negative predictions and a long tail of of predictions above 1.0.

This is going to be a problem when we try to calculate the loss. Remember: logistic regression uses Logloss which is defined as:

L = i = 1 n y i l o g ( y i ) ( 1 y i ) l o g ( 1 y i )

Both negative value and values above 1.0 will cause one of the two log arguments to become negative, which is not allowed.

We could just clip the predictions at zero or shift the entire distribution to the right, but that would be cheating. It would also make the model very brittle.

So instead, let's switch to logistic regression.


Logistic Regression

Changing the code to use logistic regression is easy. All we need to do is change the model like this:

# set up the model
....
model = tf.estimator.LinearClassifier(
    feature_columns = feature_column_desc,
    optimizer = optimizer
)

The only change is that we're now using a LinearClassifier instead of a LinearRegressor.

The rest of the code stays the same, except for the part where we validate the model:

# validate the model
probability_set = model.predict(
    input_fn = lambda: input_fn(
        validation[feature_columns], 
        validation.median_high_house_value,
        epochs = 1)
)

Note that this code is basically unchanged, except for the fact that I'm storing the validation results in probability_set, not prediction_set. I renamed the variable to highlight that the model is now returning probability values.

The code to get the list of predictions now becomes:

# get prediction values
probability_values = np.array([item['probabilities'][0] for item in probability_set])

The set now contains records keyed with 'probabilities' instead of 'predictions'. I've also renamed the prediction_values variable to probability_values for clarity.

When I run the modified code, I get the following output:

And the histogram now looks like this:

Much better! The values are now nicely restricted to a range from 0.0 to 1.0.


Classification Metrics

Let's collect some metrics to evaluate the predictions of this classification model.

The first thing we need to do is translate the probability values to actual true/false labels. We'll use a threshold of 0.7:

# get prediction values using a threshold of 0.7
prediction_values = [1 if value < 0.7 else 0 for value in probability_values]

Note that I'm converting a probability below 0.7 to 1, and a probability equal or above 0.7 to 0. I have inverted the predictions because TensorFlow sometimes gets confused about which value represents a positive. Inverting the predictions and probabilities is a quick fix to resolve this issue.

Next, I'll use the metrics.confusion_matrix function to compare the predictions and labels and calculate all metrics for us:

# get classification scores
confusion = metrics.confusion_matrix(validation.median_high_house_value, prediction_values)
tn, fp, fn, tp = confusion.ravel()

The confusion matrix is a 2x2 matrix that contains the true positive, true negative, false positive, and false negative counts. I use the ravel function to project the matrix into a tuple of variables.

All that remains is to print the variables:

# print classification scores
print("True Positives: ", tp)
print("True Negatives: ", tn)
print("False Positives:", fp)
print("False Negatives:", fn)

We can now calculate and print the TPR and FPR:

# get tpr and fpr
tpr = 1.0 * tp / (tp + fn)
fpr = 1.0 * fp / (fp + tn)
print("TPR:", tpr)
print("FPR:", fpr)

And the accuracy, precision, and recall:

# print other metrics
accuracy = 1.0 * (tp + tn) / (tp + tn + fp + fn)
precision = 1.0 * tp / (tp + fp)
recall = 1.0 * tp / (tp + fn)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall

Here is the output of the code:

We get an accuracy of 0.64, a precision of 0.37, and a recall of 0.73.

The model is correct 64% of the time. That seems pretty good, but for every positive prediction made, only 37% are correct. And for every positive case, 73% is predicted correctly.


ROC And AUC

To wrap up, let's calculate the ROC curve and the AUC value.

To calculate the AUC value, we first need to evaluate our model. This will apply the model to the validation set and calculate a bunch of evaluation metrics, including the AUC:

# get the auc value
print ("Evaluating ML model...")
evaluation_set = model.evaluate(
    input_fn = lambda: input_fn(
        validation[feature_columns], 
        validation.median_high_house_value,
        epochs = 1)
)
print ("AUC:", evaluation_set["auc"])

After evaluation, the AUC value is available in the object field named "auc".

To plot the ROC curve, I can use the function metrics.roc_curve in the sklearn module:

# get the roc curve
fpr, tpr, thresholds = metrics.roc_curve(validation.median_high_house_value, 1 - probability_values)

Note how I am again inverting the probability values.

Calling roc_curve will calculate the ROC curve and return three arrays for the FPR, TPR, and thresholds respectively.

Finally, I need to plot the arrays like this:

# plot the roc curve
plt.title("Area under ROC curve")
plt.xlabel("FPR (1 - specificity)")
plt.ylabel("TPR (sensitivity)")
plt.plot(fpr, tpr, label="ROC curve")
plt.plot([0, 1], [0, 1], label="Random classifier")
plt.show()

Note that I add a straight line from the origin to [1, 1]. This corresponds to a classifier that produces completely random results. I want my ROC curve to be above this straight line.

Running the code produces the following output:

And the ROC curve looks like this:

We get an AUC of 0.72 which, according to the table in the previous lecture, corresponds to a classifier with "fair" predictive ability.