“Prediction is very difficult, especially if it’s about the future.”
- Niels Bohr

painted_houses

Introduction

Today I’ll be creating a linear regression model to predict house sale prices, training it on a data set describing the sale of individual residential properties in Ames, Iowa from 2006 to 2010.

I’m building a linear regression work flow, creating a pipeline of functions as follows:

Data In ->

transform_features() -> select_features() -> train_and_test() ->

Evaluation Out

The goal is to enable rapid iteration on different models, and once I have my function pipeline set up it should be easy to do so; experimenting by passing in different arguments and evaluating the results to determine their impact on the accuracy of my model’s house sale price predictions.

transform_features()
This is where I will do my feature engineering - imputing values and stripping out nulls, unwanted data types, columns with data leakage and incorrect values.

select_features()
I’ll use this function to select columns to include as features. I’ll look at correlation, in-column variance and at transforming nominal and ordinal data to categorical values.

train_and_test()
Here I’ll train and test the model, building it to handle holdout and k-fold cross validation.

Data Set

The data set contains 2930 observations and 80 explanatory variables involved in assessing home values.

Data column descriptions here.
Info about why the data was collected here
The dataset itself can be downloaded from here

Let’s import the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, KFold, train_test_split

data = pd.read_csv('AmesHousing.tsv', sep='\t')
pd.set_option('max_columns', 100)
data.head()

Order	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	Utilities	Lot Config	Land Slope	Neighborhood	Condition 1	Condition 2	Bldg Type	House Style	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Roof Style	Roof Matl	Exterior 1st	Exterior 2nd	Mas Vnr Type	Mas Vnr Area	Exter Qual	Exter Cond	Foundation	Bsmt Qual	Bsmt Cond	Bsmt Exposure	BsmtFin Type 1	BsmtFin SF 1	BsmtFin Type 2	BsmtFin SF 2	Bsmt Unf SF	Total Bsmt SF	Heating	Heating QC	Central Air	Electrical	1st Flr SF	2nd Flr SF	Gr Liv Area	Bsmt Full Bath	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	Kitchen Qual	TotRms AbvGrd	Functional	Fireplaces	Fireplace Qu	Garage Type	Garage Yr Blt	Garage Finish	Garage Cars	Garage Area	Garage Qual	Garage Cond	Paved Drive	Wood Deck SF	Open Porch SF	Screen Porch	Pool QC	Fence	Misc Feature	Misc Val	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
1	526301100	20	RL	141	31770	Pave	nan	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	5	1960	1960	Hip	CompShg	BrkFace	Plywood	Stone	112	TA	TA	CBlock	TA	Gd	Gd	BLQ	639	Unf	0	441	1080	GasA	Fa	Y	SBrkr	1656	0	1656	1	1	0	3	1	TA	7	Typ	2	Gd	Attchd	1960	Fin	2	528	TA	TA	P	210	62	0	nan	nan	nan	0	5	2010	WD	Normal	215000
2	526350040	20	RH	80	11622	Pave	nan	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Feedr	Norm	1Fam	1Story	5	6	1961	1961	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	CBlock	TA	TA	No	Rec	468	LwQ	144	270	882	GasA	TA	Y	SBrkr	896	0	896	0	1	0	2	1	TA	5	Typ	0	nan	Attchd	1961	Unf	1	730	TA	TA	Y	140	0	120	nan	MnPrv	nan	0	6	2010	WD	Normal	105000
3	526351010	20	RL	81	14267	Pave	nan	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	6	1958	1958	Hip	CompShg	Wd Sdng	Wd Sdng	BrkFace	108	TA	TA	CBlock	TA	TA	No	ALQ	923	Unf	0	406	1329	GasA	TA	Y	SBrkr	1329	0	1329	0	1	1	3	1	Gd	6	Typ	0	nan	Attchd	1958	Unf	1	312	TA	TA	Y	393	36	0	nan	nan	Gar2	12500	6	2010	WD	Normal	172000
4	526353030	20	RL	93	11160	Pave	nan	Reg	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	7	5	1968	1968	Hip	CompShg	BrkFace	BrkFace	None	0	Gd	TA	CBlock	TA	TA	No	ALQ	1065	Unf	0	1045	2110	GasA	Ex	Y	SBrkr	2110	0	2110	1	2	1	3	1	Ex	8	Typ	2	TA	Attchd	1968	Fin	2	522	TA	TA	Y	0	0	0	nan	nan	nan	0	4	2010	WD	Normal	244000
5	527105010	60	RL	74	13830	Pave	nan	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	5	5	1997	1998	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	PConc	Gd	TA	No	GLQ	791	Unf	0	137	928	GasA	Gd	Y	SBrkr	928	701	1629	0	2	1	3	1	TA	6	Typ	1	TA	Attchd	1997	Fin	2	482	TA	TA	Y	212	34	0	nan	MnPrv	nan	0	3	2010	WD	Normal	189900

Removing Outliers

As per the documentation special notes, I remove the below 5 outliers.

# Plot the outliers
plt.figure(figsize=(8,6))
x = data.loc[data['Gr Liv Area'] > 4000, 'SalePrice']
y = data.loc[data['Gr Liv Area'] > 4000, 'Gr Liv Area']
plt.scatter(x, y, cmap='viridis')

x = data.loc[data['Gr Liv Area'] <= 4000, 'SalePrice']
y = data.loc[data['Gr Liv Area'] <= 4000, 'Gr Liv Area']
plt.scatter(x, y, cmap='viridis')

plt.xlabel('House Sale Price')
plt.ylabel('Living Area Square Foot')
plt.title('Outliers for Removal', fontsize=14, fontweight='bold')
plt.show()

print('Before:', data[data['Gr Liv Area'] > 4000].shape)

# Drop the outliers
data.drop(data[data['Gr Liv Area'] > 4000].index, inplace=True)

# Check
print('After:', data[data['Gr Liv Area'] > 4000].shape)

ames_outliers

Before: (5, 82)
After: (0, 82)

Set Up Basic Pipeline

Let’s begin by setting up my basic pipeline of helper functions that will let me quickly iterate on different models.

transform_features()
For now, this just returns the dataframe
select_features()
For now, this just returns the Gr Liv and SalePrice columns from the data dataframe
train_and_test()
For now, this
- splits the data dataframe into train and test
- trains a model using all numerical columns except our target SalePrice from the dataframe returned from select_features()
- tests the model on the test set and evaluates, returning the RMSE value

A Note on `RMSE`

The RMSE is the absolute fit of the model to the data - the measure of how close the model’s predicted values are to the observed data. It is the standard deviation of the prediction errors, so tells us the average variability of our predictions above and below the line of best fit or regression line. As such, the lower the RMSE, the closer our predictions are to the observed data, the more accurate we can say our model is at predicting the response.

The RMSE is in the same units as our target variable, which makes for intuitive understanding of the model’s evaluation.

e.g. if my RMSE is 50,000, I can say that on average my model is predicting $50,000 above or below the observed house sale price.

def transform_features(data):
    return data

def select_features(data):
    return data[['Gr Liv Area', 'SalePrice']]

def train_and_test(data):
    train, test = train_test_split(data, test_size=0.50, random_state = 1)

    target = 'SalePrice'
    features_df = select_features(data)
    feature_cols = features_df.select_dtypes(include=['integer', 'float']).drop('SalePrice', axis=1).columns
    
    lr = LinearRegression()
    lr.fit(train[feature_cols], train[target])
    
    predictions = lr.predict(test[feature_cols])
    
    mse = mean_squared_error(predictions, test[target])
    rmse = np.sqrt(mse)
    return rmse

# Return RMSE
train_and_test(data)

54020.48763702078

Feature Engineering

Let’s experiment with transforming our features, then once I’m satisfied, build these transformations into my transform_features() function.

Nulls / Missing Values

First I want to get a sense of the distribution of missing values in the data. I’m going to use the following to help me out:

Heatmap of null values
Null count

# Build heatmap to check for nulls

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

def plot_null_matrix(df, figsize=(18,14)):
    # Initiate the figure
    plt.figure(figsize=figsize)
    
    # Create a boolean dataframe based on whether values are null
    df_null = df.isnull()
    
    # Create a heatmap of the boolean dataframe: null=dark, non-null = light
    sns.heatmap(~df_null, cbar=False, yticklabels=False) 
    plt.xticks(rotation=90, size='x-large')
    plt.title('Missing Values in Housing Data \n (black = missing)', fontsize=14, fontweight='bold')
    plt.savefig('ames_heatmap.png')
    plt.show()

    
plot_null_matrix(data, figsize=(10,6))

ames_heatmap

Some columns appear almost totally null. Let’s get some quantitive data around this.

# Get percentage of nulls in a col
(data.isnull().sum().sort_values(ascending=False) / data.shape[0] *100)[:30]

Pool QC           99.623932
Misc Feature      96.410256
Alley             93.230769
Fence             80.478632
Fireplace Qu      48.615385
Lot Frontage      16.752137
Garage Qual        5.435897
Garage Yr Blt      5.435897
Garage Cond        5.435897
Garage Finish      5.435897
Garage Type        5.367521
Bsmt Exposure      2.837607
BsmtFin Type 2     2.769231
BsmtFin Type 1     2.735043
Bsmt Cond          2.735043
Bsmt Qual          2.735043
Mas Vnr Type       0.786325
Mas Vnr Area       0.786325
Bsmt Full Bath     0.068376
Bsmt Half Bath     0.068376
Garage Area        0.034188
Garage Cars        0.034188
Total Bsmt SF      0.034188
Bsmt Unf SF        0.034188
BsmtFin SF 2       0.034188
BsmtFin SF 1       0.034188
Electrical         0.034188
Exterior 2nd       0.000000
Exterior 1st       0.000000
Roof Matl          0.000000
dtype: float64

Strategies for missing values

I set a null threshold of 5% and will drop all columns with > 5% missing values.
- I’ll be sure to include the null threshold as a parameter in my function so that I can experiment with different values for this threshold to see if it impacts model performance
For nulls in remaining numeric columns, I will replace missing values with the average value
For the remaining text columns with missing values, I will drop the rows where there are missing values

1. Drop all columns with > 5% missing values

# Shape before drop
print('Before:', data.shape)

# Get cols to drop, those with missing values > 5%
nulls_over_5perc = data.isnull().sum()[data.isnull().sum() > data.shape[0]/20]

# Drop these cols
data.drop(nulls_over_5perc.index, axis=1, inplace=True)

# Shape after drop
print('After:', data.shape)

Before: (2925, 82)
After: (2925, 71)

2. For the remaining nulls in the remaining numeric columns , replace with the ‘average’ value

What statistic should I use as my average? Let’s take a look at the data.

As per the documentation the remaining numeric columns with nulls are all on the interval or ratio scale, with continuous and discrete values. Now, what about the distribution of values?

# Get remaining null cols
remaining_nulls = data.isnull().sum()[data.isnull().sum() > 0]

# Get Numeric columns with nulls
numeric_nulls = data[remaining_nulls.index].select_dtypes(include=np.number)
cols = numeric_nulls.columns
print(cols)

# Create figure showing distributions of columns with nulls
fig = plt.figure(figsize=(30, 40))
fig.subplots_adjust(wspace=.5, hspace=0.5)
fig.suptitle('Distributions of Variables With Null Values', y=.92, fontsize=30, fontweight='bold')

# Show boxplots in column 1 of figure
for i in range(0,16,2):
    ax = fig.add_subplot(8,2,i+1)
    r = int(i/2)
    sns.boxplot(numeric_nulls[cols[r]])
    ax.set_xlabel('')
    ax.set_title("'" + str(cols[r]) + "'" + ' outliers', fontsize=16)

# Show KDEs in column 2 of figure
for i in range(0,16,2): 
    ax = fig.add_subplot(8,2,i+2)
    r = int(i/2)
    numeric_nulls[cols[r]].plot.kde()
    ax.set_xlim(numeric_nulls[cols[r]].min(), numeric_nulls[cols[r]].max())
    ax.set_title("'" + str(cols[r]) + "'" + ' shape', fontsize=16)

plt.show()

Index(['Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF',
       'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Cars',
       'Garage Area'],
      dtype='object')

drawing

I can see the distributions are heavily skewed, so I decide to impute using the median value as it more resistant to the effects of outliers.

# Replace nulls with median value
data[numeric_nulls.columns] = data[numeric_nulls.columns].fillna(data[numeric_nulls.columns].median())

# Check no nulls remain for above numeric cols
data.isnull().sum().sort_values(ascending=False)[:10]

Bsmt Exposure     83
BsmtFin Type 2    81
BsmtFin Type 1    80
Bsmt Cond         80
Bsmt Qual         80
Mas Vnr Type      23
Electrical         1
SalePrice          0
Exterior 1st       0
Year Remod/Add     0
dtype: int64

3. For the remaining text columns with missing values, I drop the rows where there are missing values

# Ensure remaining cols with nulls are text cols
print(data[data.isnull().sum()[data.isnull().sum() > 0].index].dtypes)
print('\n')
print('Before:', data.shape)

# Drop rows with any nulls
data.dropna(inplace=True)

# Check 
print('After:', data.shape)
data.isnull().sum().sort_values(ascending=False)[:10]

Mas Vnr Type      object
Bsmt Qual         object
Bsmt Cond         object
Bsmt Exposure     object
BsmtFin Type 1    object
BsmtFin Type 2    object
Electrical        object
dtype: object


Before: (2925, 71)
After: (2817, 71)

SalePrice         0
Mas Vnr Area      0
Year Remod/Add    0
Roof Style        0
Roof Matl         0
Exterior 1st      0
Exterior 2nd      0
Mas Vnr Type      0
Exter Qual        0
Overall Cond      0
dtype: int64

Create Features

Looking at the existing columns, I see I can create two useful features that will give me:

Age of house at sale
Years since remodelling

1. `age_at_sale`

age_at_sale = data['Yr Sold'] - data['Year Built']

# Check for incorrect values
age_at_sale[age_at_sale < 0]

2. `years_since_remodel`

There is one row which returns a negative value here - the remodelled date is greater than the sold date. This is either data leakage or, more likely, an error in the data. Either way I decide to drop it.

years_since_remodel = data['Yr Sold'] - data['Year Remod/Add']

# Check for incorrect values
years_since_remodel[years_since_remodel < 0]

1702   -1
dtype: int64

I drop the above incorrect row, add these two newly created columns to our data set, and drop the original two columns.

# Create new cols and drop the row with the incorrect, negative value
data['age_at_sale'] = age_at_sale
data['years_since_remodel'] = years_since_remodel

data = data.drop(years_since_remodel[years_since_remodel < 0].index)

# Check
data[data['years_since_remodel'] < 0]

# No longer need original year columns
data = data.drop(['Year Built', 'Year Remod/Add'], axis = 1)

Drop Columns

I’ll now get rid of columns which:

are irrelevant or provide no value to the model
will leak data about the final sale, thereby overfitting the model

A note on data leakage

Currently all rows contain data relating to the sale of the house. However, when we’re trying to predict a house sale price, we wouldn’t have access to data such as the Sale Type or Yr Sold, because the sale hasn’t happened yet. We need to make sure we don’t train our model on data that wouldn’t be available to us in a real world scenario, otherwise we’re overfitting, building a model that will work really well in training, but won’t do as well in production.

# Drop order and PID - not useful
data = data.drop(['Order', 'PID'], axis=1)

# Drop data leakage cols i.e. cols with data about the sale
data = data.drop(['Mo Sold', 'Yr Sold', 'Sale Type', 'Sale Condition'], axis=1)
data.head()

MS SubClass	MS Zoning	Lot Area	Street	Lot Shape	Land Contour	Utilities	Lot Config	Land Slope	Neighborhood	Condition 1	Condition 2	Bldg Type	House Style	Overall Qual	Overall Cond	Roof Style	Roof Matl	Exterior 1st	Exterior 2nd	Mas Vnr Type	Mas Vnr Area	Exter Qual	Exter Cond	Foundation	Bsmt Qual	Bsmt Cond	Bsmt Exposure	BsmtFin Type 1	BsmtFin SF 1	BsmtFin Type 2	BsmtFin SF 2	Bsmt Unf SF	Total Bsmt SF	Heating	Heating QC	Central Air	Electrical	1st Flr SF	2nd Flr SF	Gr Liv Area	Bsmt Full Bath	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	Kitchen Qual	TotRms AbvGrd	Functional	Fireplaces	Garage Cars	Garage Area	Paved Drive	Wood Deck SF	Open Porch SF	Screen Porch	Misc Val	SalePrice	age_at_sale	years_since_remodel
20	RL	31770	Pave	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	5	Hip	CompShg	BrkFace	Plywood	Stone	112	TA	TA	CBlock	TA	Gd	Gd	BLQ	639	Unf	0	441	1080	GasA	Fa	Y	SBrkr	1656	0	1656	1	1	0	3	1	TA	7	Typ	2	2	528	P	210	62	0	0	215000	50	50
20	RH	11622	Pave	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Feedr	Norm	1Fam	1Story	5	6	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	CBlock	TA	TA	No	Rec	468	LwQ	144	270	882	GasA	TA	Y	SBrkr	896	0	896	0	1	0	2	1	TA	5	Typ	0	1	730	Y	140	0	120	0	105000	49	49
20	RL	14267	Pave	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	6	Hip	CompShg	Wd Sdng	Wd Sdng	BrkFace	108	TA	TA	CBlock	TA	TA	No	ALQ	923	Unf	0	406	1329	GasA	TA	Y	SBrkr	1329	0	1329	0	1	1	3	1	Gd	6	Typ	0	1	312	Y	393	36	0	12500	172000	52	52
20	RL	11160	Pave	Reg	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	7	5	Hip	CompShg	BrkFace	BrkFace	None	0	Gd	TA	CBlock	TA	TA	No	ALQ	1065	Unf	0	1045	2110	GasA	Ex	Y	SBrkr	2110	0	2110	1	2	1	3	1	Ex	8	Typ	2	2	522	Y	0	0	0	0	244000	42	42
60	RL	13830	Pave	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	5	5	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	PConc	Gd	TA	No	GLQ	791	Unf	0	137	928	GasA	Gd	Y	SBrkr	928	701	1629	0	2	1	3	1	TA	6	Typ	1	2	482	Y	212	34	0	0	189900	13	12

Update function `transform_features()`

I’ll now incorporate the above changes into my transform_features() function and run my pipeline, getting an RMSE of 52.5K. Not to worry though, I still have my feature selection and training to go.

def transform_features(data, null_thresh=0.05):
    # Drop the outliers as advised in the documetation
    data.drop(data[data['Gr Liv Area'] > 4000].index, inplace=True)
    
    # Drop cols with missing values > our threshold (default 5%)
    nulls_over_thresh = data.isnull().sum()[data.isnull().sum() > len(data)*null_thresh]
    data.drop(nulls_over_thresh.index, axis=1, inplace=True)    
    
    # Impute values for remaining numeric null cols
    remaining_nulls = data.isnull().sum()[data.isnull().sum() > 0]
    numeric_nulls = data[remaining_nulls.index].select_dtypes(include=['float', 'integer'])
    data[numeric_nulls.columns] = data[numeric_nulls.columns].fillna(data[numeric_nulls.columns].median())
    
    # Drop remainging rows with any nulls (all text)
    data.dropna(axis = 1, inplace=True)
    
    # Add features
    data['age_at_sale'] = data['Yr Sold'] - data['Year Built']
    data['years_since_remodel'] = data['Yr Sold'] - data['Year Remod/Add']
    
    # Drop incorrect values
    data = data.drop(1702, axis=0)
    
    # Drop unhelpful / data leakage cols
    data = data.drop(['Order', 'PID', 'Mo Sold', 'Sale Type', 'Sale Condition', 'Year Built', 'Year Remod/Add'], axis=1)
    return data

def select_features(data):
    return data[['Gr Liv Area', 'SalePrice']]

def train_and_test(data):
    train, test = train_test_split(data, test_size=0.50, random_state = 1)
    
    # Get numeric train and test
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    # Features and target
    target = 'SalePrice'
    features = numeric_train.drop('SalePrice', axis=1).columns
    
    # Model
    lr = LinearRegression()
    lr.fit(numeric_train[features], numeric_train[target])
    predictions = lr.predict(numeric_test[features])
    
    mse = mean_squared_error(predictions, numeric_test[target])
    rmse = np.sqrt(mse)
    return rmse

data = pd.read_csv("AmesHousing.tsv", delimiter="\t")

transformed = transform_features(data)
selected = select_features(transformed)
rmse = train_and_test(selected)

rmse

52522.800606230994

Feature Selection

This is where I select which columns I’ll keep to train my model.

First I generate a correlation heatmap matrix of the numerical features in the training data to see which ones have a strong correlation with my target 'SalePrice'.

numerical_data = transformed.select_dtypes(include=['float', 'integer'])

# Correlation between 'SalePrice' and all other numeric cols
target_data_corr = numerical_data.corr().abs()['SalePrice'].sort_values(ascending=False)

# Keep only cols which have a strong corr with the target col
# arbitrary cutoff of 0.4
strong_corr_cols = target_data_corr[target_data_corr > 0.4].index

# Do pairwise correlation of all features which are stronly correlated to 'SalePrice'
strong_corrs = transformed[strong_corr_cols].corr().abs()

# Generate heatmap
plt.figure(figsize=(14,12))
sns.heatmap(strong_corrs, annot=True)
plt.title('Pairwise Correlation Heatmap', fontsize=16, fontweight='bold')
plt.show()

ames_corr_heatmap

I need to choose a correlation coefficient threshold, and drop all numeric columns which fall below it. Here I’m going to use 0.4, but I’ll make sure to include it as a parameter in my function so I can experiment with different values.

# Get columns with weak correlation
weak_corr_cols = target_data_corr[target_data_corr < 0.4].index
weak_corr_cols

print('Before:', transformed.shape)

transformed.drop(weak_corr_cols, axis=1, inplace=True)
print('After:', transformed.shape)

Before: (2924, 59)
After: (2924, 39)

Collinearity

I can see strong correlation between the following pairs of features so I investigate further for collinearity and repetition of information.

Gr Liv Area and TotRms AbvGrd
Garage Cars and Garage Area
1st Flr SF and Total Basement SF

I decide there is repetition of data between the first 2 pairs. I will drop TotRms AbvGrd and Garage Cars, as both Gr Liv Area and Garage Area are continuous and contain more nuanced data than their counterparts.

While there is strong correlation between 1st Flr SF and Total Basement SF, I decide not to drop either, as some houses might not have a basement, and some might not have a 1st floor.

transformed = transformed.drop(['TotRms AbvGrd', 'Garage Cars'], axis=1)

Categorical Columns

I now turn the investigation towards which features should be converted to categorical data type.

All columns that can be categorised as nominal or ordinal variables are candidates here.

Things to keep in mind:

Columns with hundreds of unique variables being converted to dummy code will result in hundreds of new columns being added to the data frame, adding complexity
Those with few unique variables, but low variance (e.g. 95% of values are 1 specific category) will not add value to my model

Let’s look at nominal and ordinal variable columns that were originally included in the data set.

# Nominal variable columns, as per the documentation
nom_ord_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition", 'Lot Shape', 'Utilities', 'Land Slope', 'Overall Qual', 'Overall Cond', 'Exter Qual', 'Exter Cond', 
                    'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating QC', 'Electrical', 
                   'Kitchen Qual', 'Functional', 'Fireplace Qu', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive',
                   'Pool QC', 'Fence']

Now I’ll see which of these original features still remain following the transform_features() step which was completed above, and how many unique values each feature has.

# Select nominal and ordinal features from our transformed data set
transform_nom_ord_cols = [col for col in transformed.columns if col in nom_ord_features]

# Get unique value count for each feature
nom_cols_unique_count = transformed[transform_nom_ord_cols].apply(lambda col: len(col.value_counts())).sort_values()
nom_cols_unique_count

Street           2
Central Air      2
Paved Drive      3
Utilities        3
Land Slope       3
Lot Shape        4
Land Contour     4
Exter Qual       4
Lot Config       5
Kitchen Qual     5
Heating QC       5
Bldg Type        5
Exter Cond       5
Heating          6
Foundation       6
Roof Style       6
MS Zoning        7
Roof Matl        7
Functional       8
House Style      8
Condition 2      8
Condition 1      9
Overall Qual    10
Exterior 1st    16
Exterior 2nd    17
Neighborhood    28
dtype: int64

Note: Overall Qual, when treated as a numeric column in the correlation heatmap above, was the feature with the highest correlation to my target. It has 10 unique values and for now I am interested in including it in my model.

I decide to go with a unique category count cutoff of 10, dropping columns with more than 10 unique categories or values. I’ll include this unique category threshold as a parameter in my function so I can iterate on it later.

# Drop variables with more than 10 unique categories/values
transformed = transformed.drop(nom_cols_unique_count[nom_cols_unique_count > 10].index, axis=1)

# Check
transformed.columns

Index(['MS Zoning', 'Street', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Condition 1', 'Condition 2', 'Bldg Type',
       'House Style', 'Overall Qual', 'Roof Style', 'Roof Matl',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation',
       'BsmtFin SF 1', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air',
       '1st Flr SF', 'Gr Liv Area', 'Full Bath', 'Kitchen Qual', 'Functional',
       'Fireplaces', 'Garage Area', 'Paved Drive', 'SalePrice', 'age_at_sale',
       'years_since_remodel'],
      dtype='object')

Now I want to investigate the in-feature variance, remembering that features with low variance will not add much value to the model.

I find 6 features that have one category making > 95% of the total values.

# Find features with one category making up > 95% of the total values
low_variance = [] 

for col in transformed:
    val_counts = (transformed[col].value_counts(normalize=True) *100)
    if val_counts.values[0] > 94:
        low_variance.append(col)

# Display
print(low_variance)

# Confirm
for col in low_variance:
    print(col)
    print(transformed[col].value_counts(normalize=True)*100)
    print('-----------------')

['Street', 'Utilities', 'Land Slope', 'Condition 2', 'Roof Matl', 'Heating']
Street
Pave    99.589603
Grvl     0.410397
Name: Street, dtype: float64
-----------------
Utilities
AllPub    99.897401
NoSewr     0.068399
NoSeWa     0.034200
Name: Utilities, dtype: float64
-----------------
Land Slope
Gtl    95.177839
Mod     4.274966
Sev     0.547196
Name: Land Slope, dtype: float64
-----------------
Condition 2
Norm      99.008208
Feedr      0.444596
Artery     0.170999
PosA       0.136799
PosN       0.102599
RRNn       0.068399
RRAn       0.034200
RRAe       0.034200
Name: Condition 2, dtype: float64
-----------------
Roof Matl
CompShg    98.597811
Tar&Grv     0.786594
WdShake     0.307798
WdShngl     0.205198
Membran     0.034200
Metal       0.034200
Roll        0.034200
Name: Roof Matl, dtype: float64
-----------------
Heating
GasA     98.461012
GasW      0.923393
Grav      0.307798
Wall      0.205198
OthW      0.068399
Floor     0.034200
Name: Heating, dtype: float64
-----------------

I drop these low-variance features.

# Drop low variance features
transformed = transformed.drop(low_variance, axis=1)

# Check
transformed.columns

Index(['MS Zoning', 'Lot Shape', 'Land Contour', 'Lot Config', 'Condition 1',
       'Bldg Type', 'House Style', 'Overall Qual', 'Roof Style',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation',
       'BsmtFin SF 1', 'Total Bsmt SF', 'Heating QC', 'Central Air',
       '1st Flr SF', 'Gr Liv Area', 'Full Bath', 'Kitchen Qual', 'Functional',
       'Fireplaces', 'Garage Area', 'Paved Drive', 'SalePrice', 'age_at_sale',
       'years_since_remodel'],
      dtype='object')

Now it’s time to transform our selected nominal and ordinal variables into categorical variables. Note, all of them currently have data type object except for Overall Qual which has int data type.

# Display data types for the remaining selected nominal and ordinal variables
cat_feats = [col for col in transformed.columns if col in nom_ord_features]
transformed[cat_feats].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2924 entries, 0 to 2929
Data columns (total 17 columns):
MS Zoning       2924 non-null object
Lot Shape       2924 non-null object
Land Contour    2924 non-null object
Lot Config      2924 non-null object
Condition 1     2924 non-null object
Bldg Type       2924 non-null object
House Style     2924 non-null object
Overall Qual    2924 non-null int64
Roof Style      2924 non-null object
Exter Qual      2924 non-null object
Exter Cond      2924 non-null object
Foundation      2924 non-null object
Heating QC      2924 non-null object
Central Air     2924 non-null object
Kitchen Qual    2924 non-null object
Functional      2924 non-null object
Paved Drive     2924 non-null object
dtypes: int64(1), object(16)
memory usage: 411.2+ KB

I change the data type, then use dummy coding to convert the features into binary columns, keeping the original column name as a prefix, and append these new binary columns to the dataframe.

# Convert to categorical data type 
for col in transformed[cat_feats]:
    transformed[col] = transformed[col].astype('category')

# Convert to binary columns
transformed = pd.concat([transformed, pd.get_dummies(transformed.select_dtypes(include='category'))], axis=1)
transformed[cat_feats].info()
transformed.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2924 entries, 0 to 2929
Data columns (total 17 columns):
MS Zoning       2924 non-null category
Lot Shape       2924 non-null category
Land Contour    2924 non-null category
Lot Config      2924 non-null category
Condition 1     2924 non-null category
Bldg Type       2924 non-null category
House Style     2924 non-null category
Overall Qual    2924 non-null category
Roof Style      2924 non-null category
Exter Qual      2924 non-null category
Exter Cond      2924 non-null category
Foundation      2924 non-null category
Heating QC      2924 non-null category
Central Air     2924 non-null category
Kitchen Qual    2924 non-null category
Functional      2924 non-null category
Paved Drive     2924 non-null category
dtypes: category(17)
memory usage: 75.4 KB
(2924, 124)

Lastly I drop the original categorical features.

# Drop original categorical cols
print('Before:', transformed.shape)
transformed.drop(cat_feats, axis=1, inplace=True)
print('After:', transformed.shape)

Before: (2924, 124)
After: (2924, 107)

Update Function `select_features()`

Now I’ll update my select_features() function with the changes above. I’ll include parameters for the correlation coefficient threshold and the number of unique categories threshold.

This time when I run my pipeline below I get a much-improved RMSE of 24,087. My model is now predicting on average with an error of $24,087 above or below the observed house sale prices.

I still have my train_and_test() function to update. Let’s see if we can improve the RMSE any further.

def transform_features(data, null_thresh=0.05):    
    # Drop the outliers as advised in the documetation
    data.drop(data[data['Gr Liv Area'] > 4000].index, inplace=True)
    
    # Drop cols with missing values > our threshold (default 5%)
    nulls_over_thresh = data.isnull().sum()[data.isnull().sum() > len(data)*null_thresh]
    data.drop(nulls_over_thresh.index, axis=1, inplace=True)    
    
    # Impute values for remaining numeric null cols
    remaining_nulls = data.isnull().sum()[data.isnull().sum() > 0]
    numeric_nulls = data[remaining_nulls.index].select_dtypes(include=['float', 'integer'])
    data[numeric_nulls.columns] = data[numeric_nulls.columns].fillna(data[numeric_nulls.columns].median())
    
    # Drop remainging rows with any nulls (all text)
    data.dropna(axis = 1, inplace=True)
    
    # Add features
    data['age_at_sale'] = data['Yr Sold'] - data['Year Built']
    data['years_since_remodel'] = data['Yr Sold'] - data['Year Remod/Add']
    
    # Drop incorrect values
    data = data.drop(1702, axis=0)
    
    # Drop unhelpful / data leakage cols
    data = data.drop(['Order', 'PID', 'Mo Sold', 'Sale Type', 'Sale Condition', 'Year Built', 'Year Remod/Add'], axis=1)
    return data

def select_features(data, corr_coeff_thresh, unique_cats_thresh):
    # Drop features with below our correlation coefficient threshold
    numerical_data = data.select_dtypes(include=['float', 'integer'])
    target_data_corr = numerical_data.corr().abs()['SalePrice'].sort_values(ascending=False)
    weak_corr_cols = target_data_corr[target_data_corr < corr_coeff_thresh].index
    data.drop(weak_corr_cols, axis=1, inplace=True)

    # Drop features with repetition of data
    if 'TotRms AbvGrd' in data.columns:
        data = data.drop('TotRms AbvGrd', axis=1)
    
    if 'Garage Cars' in data.columns:
        data = data.drop('Garage Cars', axis=1)
    
    # Drop nominal and ordinal features with no of categories > unique category threshold
    nom_ord_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition", 'Lot Shape', 'Utilities', 'Land Slope', 'Overall Qual', 'Overall Cond', 'Exter Qual', 'Exter Cond', 
                    'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating QC', 'Electrical', 
                   'Kitchen Qual', 'Functional', 'Fireplace Qu', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive',
                   'Pool QC', 'Fence']
    transform_nom_ord_cols = [col for col in data.columns if col in nom_ord_features]
    nom_cols_unique_count = data[transform_nom_ord_cols].apply(lambda col: len(col.value_counts())).sort_values()
    data = data.drop(nom_cols_unique_count[nom_cols_unique_count > unique_cats_thresh].index, axis=1)
    
    # Drop features with low variance
    low_variance = [] 
    for col in data:
        val_counts = (data[col].value_counts(normalize=True) *100)
        if val_counts.values[0] > 94:
            low_variance.append(col)
    data = data.drop(low_variance, axis=1)
    
    # Transform nominal and ordinal features to categorical
    cat_feats = [col for col in data.columns if col in nom_ord_features]
    for col in cat_feats:
        data[col] = data[col].astype('category')
    
    # Transform categorial features to binary and drop the original categorical features
    data = pd.concat([data, pd.get_dummies(data.select_dtypes(include='category'))], axis=1)
    data.drop(cat_feats, axis=1, inplace=True)
    
    return data    
    
def train_and_test(data):
    train, test = train_test_split(data, test_size=0.50, random_state = 1) 
    
    # Get numeric train and test
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    # Features and target
    target = 'SalePrice'
    features = numeric_train.drop('SalePrice', axis=1).columns
    
    # Model
    lr = LinearRegression()
    lr.fit(numeric_train[features], numeric_train[target])
    predictions = lr.predict(numeric_test[features])
    
    mse = mean_squared_error(predictions, numeric_test[target])
    rmse = np.sqrt(mse)
    return rmse

data = pd.read_csv("AmesHousing.tsv", delimiter="\t")

transformed = transform_features(data)
selected = select_features(transformed, 0.4, 10)
rmse = train_and_test(selected)

rmse

24087.258401239073

Train and Test

Now I’ll update train_and_test() to be able to handle both holdout (already included) and k-folds cross validation.

To do this I’ll include a parameter for number of folds.

0 will be the default, implementing holdout validation if no argument is passed.

As k-folds cross validation requires at least 2 folds, I’ll include an custom exception if the value 1 is inputted.

def transform_features(data, null_thresh=0.05):    
    # Drop the outliers as advised in the documetation
    data.drop(data[data['Gr Liv Area'] > 4000].index, inplace=True)
    
    # Drop cols with missing values > our threshold (default 5%)
    nulls_over_thresh = data.isnull().sum()[data.isnull().sum() > len(data)*null_thresh]
    data.drop(nulls_over_thresh.index, axis=1, inplace=True)    
    
    # Impute values for remaining numeric null cols
    remaining_nulls = data.isnull().sum()[data.isnull().sum() > 0]
    numeric_nulls = data[remaining_nulls.index].select_dtypes(include=['float', 'integer'])
    data[numeric_nulls.columns] = data[numeric_nulls.columns].fillna(data[numeric_nulls.columns].median())
    
    # Drop remainging rows with any nulls (all text)
    data.dropna(axis = 1, inplace=True)
    
    # Add features
    data['age_at_sale'] = data['Yr Sold'] - data['Year Built']
    data['years_since_remodel'] = data['Yr Sold'] - data['Year Remod/Add']
    
    # Drop incorrect values
    data = data.drop(1702, axis=0)
    
    # Drop unhelpful / data leakage cols
    data = data.drop(['Order', 'PID', 'Mo Sold', 'Sale Type', 'Sale Condition', 'Year Built', 'Year Remod/Add'], axis=1)
    return data

def select_features(data, corr_coeff_thresh, unique_cats_thresh):
    # Drop features with below our correlation coefficient threshold
    numerical_data = data.select_dtypes(include=['float', 'integer'])
    target_data_corr = numerical_data.corr().abs()['SalePrice'].sort_values(ascending=False)
    weak_corr_cols = target_data_corr[target_data_corr < corr_coeff_thresh].index
    data.drop(weak_corr_cols, axis=1, inplace=True)

    # Drop features with repetition of data
    if 'TotRms AbvGrd' in data.columns:
        data = data.drop('TotRms AbvGrd', axis=1)
    
    if 'Garage Cars' in data.columns:
        data = data.drop('Garage Cars', axis=1)
    
    # Drop nominal and ordinal features with no of categories > unique category threshold
    nom_ord_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition", 'Lot Shape', 'Utilities', 'Land Slope', 'Overall Qual', 'Overall Cond', 'Exter Qual', 'Exter Cond', 
                    'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating QC', 'Electrical', 
                   'Kitchen Qual', 'Functional', 'Fireplace Qu', 'Garage Finish', 'Garage Qual', 'Garage Cond', 'Paved Drive',
                   'Pool QC', 'Fence']
    transform_nom_ord_cols = [col for col in data.columns if col in nom_ord_features]
    nom_cols_unique_count = data[transform_nom_ord_cols].apply(lambda col: len(col.value_counts())).sort_values()
    data = data.drop(nom_cols_unique_count[nom_cols_unique_count > unique_cats_thresh].index, axis=1)
    
    # Drop features with low variance
    low_variance = [] 
    for col in data:
        val_counts = (data[col].value_counts(normalize=True) *100)
        if val_counts.values[0] > 94:
            low_variance.append(col)
    data = data.drop(low_variance, axis=1)
    
    # Transform nominal and ordinal features to categorical
    cat_feats = [col for col in data.columns if col in nom_ord_features]
    for col in cat_feats:
        data[col] = data[col].astype('category')
    
    # Transform categorial features to binary and drop the original categorical features
    data = pd.concat([data, pd.get_dummies(data.select_dtypes(include='category'))], axis=1)
    data.drop(cat_feats, axis=1, inplace=True)
    
    return data    
    
def train_and_test(data, k=0):
    # Define features, target and 
    target = 'SalePrice'
    features = data.select_dtypes(include=['integer', 'float']).drop('SalePrice', axis=1).columns
    lr = LinearRegression()
    
    if k == 0:
        train, test = train_test_split(data, test_size=0.50, random_state = 1)

        # model
        lr.fit(train[features], train[target])
        predictions = lr.predict(test[features])

        mse = mean_squared_error(predictions, test[target])
        rmse = np.sqrt(mse)
        return rmse
    
    elif k == 1:
        print('Oops! You need at least 2 folds for k-folds cross validation. Please try again.')
    
    else:
        kf = KFold(k, shuffle=True, random_state=1)

        mses = cross_val_score(lr, data[features], data[target], 
                               scoring='neg_mean_squared_error', 
                               cv=kf)
        rmses = np.sqrt(np.abs(mses))
        print(rmses)
        avg_rmse = np.mean(rmses)
        
        return avg_rmse

data = pd.read_csv("AmesHousing.tsv", delimiter="\t")

transformed = transform_features(data)
selected = select_features(transformed, 0.4, 10)
rmse = train_and_test(selected, 1)

rmse

Oops! You need at least 2 folds for k-folds cross validation. Please try again.

Evaluate

Now that I’ve built my pipeline I can iterate over the model, experimenting by passing in different arguments to see how they impact model accuracy.

In the below example I have set:

the null threshold to 0.2
the correlation coefficient threshold to 0.3
the unique category threshold to 10
the number of folds to 20

Giving RMSE of just below 24,000.

This is telling me that this particular iteration of my model is, on average, predicting house sale prices with a variability of $24K around the observed house sale price.

Note: while experimenting it was interesting to note that passing in a unique category threshold value below 10 had a discernible impact on the evaluation. My assumption is that this is due to the model now excluding the feature Overall Qual, which had the highest correlation with the response variable.

data = pd.read_csv("AmesHousing.tsv", delimiter="\t")

transformed = transform_features(data, 0.2)
selected = select_features(transformed, 0.3, 10)
rmse = train_and_test(selected, 20)

rmse

[19675.06279037 18288.97175461 25045.22704703 27207.57140488
40698407 23172.94854065 23189.23625569 23978.1374733
35102604 22030.84003668 22507.30967812 30461.71616112
87484188 20488.83361378 20355.86101815 26532.62515809
24995271 30466.23487708 23172.18672488 29438.47454499]
00599420598

Conclusion

In this project I built a pipeline of functions which engineer and select features, and train and test a Linear Regression model to predict house sale prices.

Data In ->

transform_features() -> select_features() -> train_and_test() ->

Evaluation Out

In building this pipeline I used a range of techniques to:

Find and removed outliers
Investigate null values and employ different strategies for their removal or imputation
Create useful new features from existing ones
Handle features of no value and those causing data leakage
Investigate features for correlation and collinearity
Transform features to binary data
Remove features with low variance
Explore holdout and k-folds cross validation
Discuss RMSE and data leakage
Make explicit the logic I used to make decisions

The resulting product allows me to iterate over the model, easily passing in different arguments for a number of parameters, thus enabling rapid experimentation to see how the model responds.

Sold! A House Price Prediction Pipeline with Linear Regression

Introduction

Data In ->

`transform_features()` -> `select_features()` -> `train_and_test()` ->

Evaluation Out

Data Set

Removing Outliers

Set Up Basic Pipeline

A Note on `RMSE`

Feature Engineering

Nulls / Missing Values

Strategies for missing values

1. Drop all columns with > 5% missing values

2. For the remaining nulls in the remaining numeric columns , replace with the ‘average’ value

3. For the remaining text columns with missing values, I drop the rows where there are missing values

Create Features

1. `age_at_sale`

2. `years_since_remodel`

Drop Columns

A note on data leakage

Update function `transform_features()`

Feature Selection

Collinearity

Categorical Columns

Update Function `select_features()`

Train and Test

Evaluate

Conclusion

Data In ->

`transform_features()` -> `select_features()` -> `train_and_test()` ->

Evaluation Out

Introduction

Data In ->

transform_features() -> select_features() -> train_and_test() ->

Evaluation Out

Data Set

Removing Outliers

Set Up Basic Pipeline

A Note on RMSE

Feature Engineering

Nulls / Missing Values

Strategies for missing values

1. Drop all columns with > 5% missing values

2. For the remaining nulls in the remaining numeric columns , replace with the ‘average’ value

3. For the remaining text columns with missing values, I drop the rows where there are missing values

Create Features

1. age_at_sale

2. years_since_remodel

Drop Columns

A note on data leakage

Update function transform_features()

Feature Selection

Collinearity

Categorical Columns

Update Function select_features()

Train and Test

Evaluate

Conclusion

Data In ->

transform_features() -> select_features() -> train_and_test() ->

Evaluation Out

`transform_features()` -> `select_features()` -> `train_and_test()` ->

A Note on `RMSE`

1. `age_at_sale`

2. `years_since_remodel`

Update function `transform_features()`

Update Function `select_features()`

`transform_features()` -> `select_features()` -> `train_and_test()` ->