Jul 21, 2024

Linear Regression — Model Error, Residual plots and more for Ads dataset

DevTechie

Problem Statement: In our last articles on linear regression for simple and polynomial we predicted sales for TV and Newspaper spend. We saw that we were getting almost a straight line for TV spend, however, for Newspaper it was scattered all over the place. This brought up the question as how do we know if Linear Regression is the right algorithm for our dataset.

In this article, we will explore how we can measure and evaluate model performance. We will use the same Ads dataset(as in previous article) and compare model performance for TV, Newspaper and Radio separately. We will be leveraging mean absolute error, mean squared error and residual plots to evaluate linear regression’s model performance for the overall dataset. In our next article, we will compare the different features TV, Newspaper and Radio separately. So, without further ado, lets get going…

Import all the important packages and libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load our file

file_path = r'/Users/Downloads/advertising.csv' # Use your own file path
df_ad_data = pd.read_csv(file_path)
df_ad_data.head()

First we will do collectively on all the feature and then separate out each feature

Creating X matrix to represent all the feature

X = df_ad_data[["TV","Radio","Newspaper"]]
X.head()

Get Y Vector Column

y = df_ad_data['Sales']

We will separate out out our training data and test data on which we will test our model performance. We do not want to test our model performance on the same data that it was trained on. Using sklearn.model_selection we will use train_test_split to split training and test data as below

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In here, we are keeping 30% of our data as a testing data, this test data will be invisible to model.

Note: Before separating out training and test data we need to make sure that data is not sorted in any order and it is random.

We are using random_state which will Controls the shuffling applied to the data before applying the split. For more understanding on parameters for train_test_split, use help(train_test_split) to get more information.

Apply LinearRegression to fit the training data and do our predictions

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)

In order to evaluate model performance, we will calculate mean_absolute_error and mean_squared_error. Since, mean_squared_error is squared we will take a square root of the value using numpy as below

mean_absolute_error(y_test, test_predictions)
np.sqrt(mean_squared_error(y_test, test_predictions))

Now, when I run this I get mean_absolute_error as 1.3731200698367851 and square root of mean_squared_error=1.6936855180040058.

So, how do we interpret this, so first what is my overall average is

df_ad_data['Sales'].mean()

Above gives me 15.130500000000001 as average value. Thereby, mean for the overall sale value is ~15 and mean_absolute_error is ~1.8, i.e. my error is around 12% (x% of 15=1.8), meaning my predictions are off by 12%. In other words if I decide to spend $100 on ads, according to model, $88 may result in sales but $12 might not.

Note: If we had a historical model we would compare our mean_absolute_error with it, however, as in our case in this article, if we are starting fresh and do not have any historical data then we can compare mean_absolute_error with overall mean.

Coming to root mean_squared_error, that tells us if we are far off from few points. If we look from our previous example as in the last article below, there are some points that are far off from our straight line.

If we are building upon our work and improve our previous model, root mean_squared_error can help us identify if our current model is better than our last.

Plotting out the residuals

Linear Regression may or may not be a right choice based on the dataset. With the help of residual plots we can come identify if current dataset will do good with Linear Regression.

Residuals are y — predictions, so from above since we already calculated test_predictions, we will calculate residuals as below

test_residuals = y_test - test_predictions

Now, when we plot a scatterplot as below we see that data is scattered.

sns.scatterplot(x=y_test, y=test_residuals)
plt.axhline(y=0, color='r', ls='--')

Since, visually this looks random and there is no straight line forming we can say that for overall dataset linear regression is a good choice. In short, with residual plot if there is pattern like a curve or a line that mean Linear Regression is not a good choice, however, if visually data appears random on residual plot then we can say Linear Regression would work well.

We can also check the distribution plot and see if data is normally distributed


sns.displot(test_residuals, bins=25, kde=True)

Above shows that data is mainly centered around zero. But main plot to identify problem with underlying data is residual plot. With residual plot we need to make sure that we are not getting anything parabolic.

In our next article, we will separate out the features and analyze if Linear Regression is appropriate for each feature. Stay tuned.

Please check the downloadable code for above here.

With that we have reached the end of this article.