Boston housing prices
Contents
This is a self-correcting activity generated by nbgrader. Fill in any place that says
YOUR CODE HERE
orYOUR ANSWER HERE
. Run subsequent cells to check your code.
Boston housing prices¶
The goal of this activity is to predict the median price (in $1,000’s) of homes given their characteristics.
The dataset used here has ethical problems and will be removed in a future version of scikit-learn. It is left there as a example of possible issues with Machine Learning.
The Boston Housing Prices dataset is frequently used to test regression algorithms.
The dataset contains information gathered in the 1970s concerning housing in the Boston suburban area. Each house has the following features.
Feature |
Description |
---|---|
0 |
Per capita crime rate by town |
1 |
Proportion of residential land zoned for lots over 25,000 sq.ft. |
2 |
Proportion of non-retail business acres per town. |
3 |
Charles River dummy variable (1 if tract bounds river; 0 otherwise) |
4 |
Nitric oxides concentration (parts per 10 million) |
5 |
Average number of rooms per dwelling |
6 |
Proportion of owner-occupied units built prior to 1940 |
7 |
Weighted distances to five Boston employment centres |
8 |
Index of accessibility to radial highways |
9 |
Full-value property-tax rate per $10,000 |
10 |
Pupil-teacher ratio by town |
11 |
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
12 |
Lower status of the population |
Environment setup¶
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor, LinearRegression
Step 1: Loading the data¶
dataset = load_boston()
# Describe the dataset
print(dataset.DESCR)
# Show a sample of raw training data
df_boston = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_boston['MEDV'] = dataset.target
# Show 10 random samples
df_boston.sample(n=10)
Step 2: Preparing the data¶
Question¶
Store input data and labels into the x
and y
variables respectively.
# YOUR CODE HERE
print(f'x: {x.shape}. y: {y.shape}')
assert x.shape == (506, 13)
assert y.shape == (506,)
Question¶
Prepare data for training. Store the data subsets in variables named x_train
/y_train
and x_test
/y_test
with a 20% ratio.
# YOUR CODE HERE
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')
assert x_train.shape == (404, 13)
assert y_train.shape == (404,)
assert x_test.shape == (102, 13)
assert y_test.shape == (102,)
Question¶
Scale features by standardization while preventing information leakage from the test set. This means standardization values (mean and standard deviation) should be computed on the training set only.
# YOUR CODE HERE
mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')
mean_test = x_test.mean()
std_test = x_test.std()
print(f'mean_test: {mean_test}. std_test: {std_test}')
assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6
Step 3: Training a model¶
Question¶
Create a SGDRegressor
instance and store it into the model
variable. Fit this model on the training data.
# YOUR CODE HERE
Step 4: Evaluating the model¶
Question¶
Compute the training and test MSE into the mse_train
and mse_test
variables respectively.
# YOUR CODE HERE
print (f'Training MSE: {mse_train:.2f}. Test MSE: {mse_test:.2f}')
plt.scatter(y_test, y_test_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs Predicted Prices")
Question¶
Go back to step 3 and try to obtain the best possible test MSE by tweaking the SGDRegressor
parameters.
Step 5: Use another regression algorithm¶
Question¶
Create and fit a
LinearRegression
instance, which uses the normal equation instead of gradient descent.Compute the training and test MSE for this instance (variables
mse_train_n
andmse_test_n
). How does it compare to theSGDRegressor
in this case?
# YOUR CODE HERE
# YOUR CODE HERE
print (f'Training MSE: {mse_train_n:.2f}. Test MSE: {mse_test_n:.2f}')
plt.scatter(y_test, y_test_pred_n)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted prices")
plt.title("Actual Prices vs Predicted Prices")