Jul 22, 2024

Linear Regression — Residual Plot Comparison for Ads data

DevTechie Inc

Problem statements: In our last article, we learned how we can evaluate if Linear Regression is the right choice for our dataset. In this and upcoming articles, we will dive deeper into each feature and see if Linear Regression is a right choice for each.

If we recall from our last article, visually if the residual plot looks random and there is no pattern forming either a straight line or parabolic we can say Linear Regression would be a good choice. Let’s go ahead and calculate mean_absolute_error and root_mean_squared_error along with analyzing residual plot and distribution graph to evaluate model performance for TV and Newspaper separately. Let’s get going.

TV

Import our libraries and load our data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

file_path = r'/Users/Downloads/advertising.csv'
df_ad_data = pd.read_csv(file_path)
df_ad_data.head()

Creating X matrix for TV data

X_tv = df_ad_data[["TV"]]
X_tv.head()

Get Y Vector Column

y_tv = df_ad_data['Sales']
y_tv.head()

Separate out our training data and test data

from sklearn.model_selection import train_test_split
X_train_tv, X_test_tv, y_train_tv, y_test_tv = train_test_split(X_tv, y_tv, test_size=0.3, random_state=101)

Same as before, we will be keeping 30% of our data as a testing data, this test data will be invisible to model.

Apply LinearRegression

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_tv, y_train_tv)
test_predictions_tv = model.predict(X_test_tv)

Calculate mean_absolute_error and root_mean_squared_error

from sklearn.metrics import mean_squared_error, mean_absolute_error
mean_absolute_error(y_test_tv, test_predictions_tv)
np.sqrt(mean_squared_error(y_test_tv, test_predictions_tv))

when I run this I get mean_absolute_error as 1.8210722841262867 and square root of mean_squared_error=2.322465709498783. Our mean absolute error is slightly above from collective feature(1.3) showing TV alone is slightly under performing in that the percentage of error rose to 12% from 9%. In this scenario, this is not bad but context is everything, meaning in other scenarios this may or may not be acceptable. Root_mean_squared is also slightly high from 1.7 to 2.3, which shows that some points for TV are farther from the regression line. So our predictions would be quite a bit off for some points than the average.

Plotting out the residuals

As we recall from our previous article, if the data in residual plot looks scattered and/or random and there is no straight line forming we can say that linear regression is a good choice.

So, we need to calculate test residuals for TV

test_residuals_tv = y_test_tv - test_predictions_tv

Plotting the graph

sns.scatterplot(x=y_test_tv, y=test_residuals_tv)
plt.axhline(y=0, color='r', ls='--')

Above we see that, data looks pretty random and we don’t see anything parabolic or a straight line. We can further check the distribution plot and see if data is normally distributed.

sns.displot(test_residuals_tv, bins=25, kde=True)

It does look like it is distributed around zero. So, in conclusion we can say that we did not detect any problems with underlying data for TV and Linear regression would work well for sales due to TV spend.

Newspaper

Let’s repeat the same steps with newspaper and determine if Linear Regression would be a good fit for predicting sales for newspaper spend.

Calculate X and y

X_newspaper = df_ad_data[['Newspaper']]
y_newspaper = df_ad_data['Sales']

Train test split

X_train_newspaper, X_test_newspaper, y_train_newspaper, y_test_newspaper = train_test_split(X_newspaper, y_newspaper, test_size=0.3, random_state=101)

Applying Linear Regression

model.fit(X_train_newspaper, y_train_newspaper)
test_predictions_newspaper = model.predict(X_test_newspaper)

Calculate mean_absolute_error and root_mean_squared_error (RMSE)

mean_absolute_error(y_test_newspaper, test_predictions_newspaper) # mean_absolute_error(y_test_tv, test_predictions_tv)
np.sqrt(mean_squared_error(y_test_newspaper, test_predictions_newspaper)) # np.sqrt(mean_squared_error(y_test_tv, test_predictions_tv))

With above we get mean_absolute_error = 4.442543182198105 and root_mean_squared_error = 5.321841175671236. These numbers are quite high, as mean_absolute_error shows that for around ~30% of the predicted values could be inaccurate. Our RMSE (root_mean_squared_error) is also high in this context. It looks like Linear Regression might not be a good choice for predicting sales from newspaper as there might be other underlying reasons affecting sales of newspaper and there is no direct correlation of newspaper spend vs newspaper sales.

Let’s plot the residual plot and distribution curve to know more.

Residual plot

test_residual_newspaper = y_test_newspaper - test_predictions_newspaper # test_residuals_tv = y_test_tv - test_predictions_tv

sns.scatterplot(x=y_test_newspaper, y=test_residual_newspaper) # sns.scatterplot(x=y_test_tv, y=test_residuals_tv)
plt.axhline(y=0, color='r', ls='--')

So we can see a clear straight line and that data is spread uniformly in a straight line. With this we can safely conclude that for this particular feature there might be more to the story and Linear Regression would not be the appropriate choice for Newspaper.

Let’s look at distribution plot

sns.displot(test_residual_newspaper, bins=25, kde=True)

This further strengthens our conclusions as we do not see a normal distribution for newspaper feature.

In our next article we will evaluate Linear Regression for Radio. Please see the downloadable code for this article in here.

With that we have reached the end of this article. Thank you once again for reading. If you liked this, don’t forget to 👏 and follow 😍. Also visit us at https://www.devtechie.com