The Boston Housing Dataset

Objectives

  1. Analyse and explore the Boston house price data
  2. Split the data for training and testing
  3. Run a Multivariable Regression
  4. Evaluate how the model's coefficients and residuals
  5. Use data transformation to improve the model performance
  6. Use the model to estimate a property price

Import Statements

Notebook Presentation

Load the Data

Understand the Boston House Price Dataset


Characteristics:

Number of Instances: 506 

Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.

Attribute Information (in order):
    1. CRIM     per capita crime rate by town
    2. ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS    proportion of non-retail business acres per town
    4. CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5. NOX      nitric oxides concentration (parts per 10 million)
    6. RM       average number of rooms per dwelling
    7. AGE      proportion of owner-occupied units built prior to 1940
    8. DIS      weighted distances to five Boston employment centres
    9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT    % lower status of the population
    14. PRICE     Median value of owner-occupied homes in $1000's

Preliminary Data Exploration

Data Cleaning - Check for Missing Values and Duplicates

Descriptive Statistics

Visualise the Features

House Prices

There is a spike in the number homes at the very right tail at the $50,000 mark.

Distance to Employment - Length of Commute

Most homes are about 3.8 miles away from work. There are fewer and fewer homes the further out we go.

Number of Rooms

Access to Highways

RAD is an index of accessibility to roads. Better access to a highway is represented by a higher number. There's a big gap in the values of the index.

Next to the River

Understand the Relationships in the Data

Run a Pair Plot

Distance from Employment vs. Pollution

We see that pollution goes down as we go further and further out of town. This makes intuitive sense. However, even at the same distance of 2 miles to employment centres, we can get very different levels of pollution. By the same token, DIS of 9 miles and 12 miles have very similar levels of pollution.

Proportion of Non-Retail Industry vs Pollution

% of Lower Income Population vs Average Number of Rooms

In the top left corner we see that all the homes with 8 or more rooms, LSTAT is well below 10%.

% of Lower Income Population versus Home Price

Number of Rooms versus Home Value

Again, we see those homes at the $50,000 mark all lined up at the top of the chart. Perhaps there was some sort of cap or maximum value imposed during data collection.

Split Training & Test Dataset

Multivariable Regression

$$ PR \hat ICE = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta _3 DIS + \theta _4 CHAS ... + \theta _{13} LSTAT$$

Run the Regression

Evaluate the Coefficients of the Model

Premium for having an extra room

Analyse the Estimated Values & Regression Residuals

Residuals

We see that the residuals have a skewness of 1.46. There could be some room for improvement here.

Data Transformations for a Better Fit

The log prices have a skew that's closer to zero. This makes them a good candidate for use in our linear model. Perhaps using log prices will improve our regression's r-squared and our model's residuals.

Regression using Log Prices

$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$

This time we got an r-squared of 0.79 compared to 0.75. This looks like a promising improvement.

Evaluating Coefficients with Log Prices

Regression with Log Prices & Residual Plots

Our new regression residuals have a skew of 0.09 compared to a skew of 1.46. The mean is still around 0. From both a residuals perspective and an r-squared perspective we have improved our model with the data transformation.

Compare Out of Sample Performance

By definition, the model has not been optimised for the testing data. Therefore performance will be worse than on the training data. However, our r-squared still remains high, so we have built a useful model.

Predict a Property's Value using the Regression Coefficients

Our preferred model now has an equation that looks like this:

$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$

The average property has the mean value for all its charactistics:

A property with an average value for all the features has a value of $20,700.

Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics: