Data Science for Business Applications

Class 03 - Regression Assumptions, and Potential Problems

Regression Assumptions, and Outliers

Linear models are useful:

Prediction - given a new observations
Explanatory power - which variables affects the response

But issues in linear model are not uncommon:

They can affect the explanatory, and predictive power of our model
They can affect our confidence in our model
We will look at some of the most common problems in linear regression, and how we can fix them

Regression Assumptions, and Potential Problems

These issues are related to:

Regression model assumptions
Influential observations, and outliers

Multiple regression assumptions

We need four things to be true for regression to work properly:

Linearity: \(Y\) is a linear function of the \(X\)’s (except for the prediction errors).
Independence: The prediction errors are independent.
Normality: The prediction errors are normally distributed.
Equal Variance: The variance of \(Y\) is the same for any value of \(X\) (“homoscedasticity”).

Non-Linearity

What we would expect to observe in a regression where there is a linear relation?

library(tidyverse)
ggplot(linear_data, aes(x=X, y=Y)) +
  geom_point() +
  geom_smooth(method="lm", se = FALSE)

Residuals

Let’s plot the residuals \(r_i\), such that \[r_i = y_i − \widehat{y}_i\] where \(\widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 x_i\) vs \(x_i\)
Hopefully identify non-linear relationships
We are looking for patterns or trends in the residuals

Residuals

Plot of the residuals
How can these residuals be useful for us?

Regression diagnostic plots

We’ll use regression diagnostic plots to help us evaluate some of the assumptions.

The residuals vs fitted graph plots:

Residuals on the \(Y\)-axis
Fitted values (predicted \(Y\) values) on the \(X\)-axis

This graph effectively subtracts out the linear trend between \(Y\) and the \(X\)’s, so we want to see no trend left in this graph.

Regression diagnostic plot

To check non-linearity we focus on the Residual vs. Fitted plot

library(ggfortify)

lm1 = lm(Y ~ X, data = linear_data)
autoplot(lm1)

Regression diagnostic plot

From the Residual vs. Fitted plot, we can observe that since the residuals are evenly distributed around zero in relation to the fitted values, we have that the linear regression model is a good fit for this data.
This means that we are learning the linear representation contained in this data.

Non-Linearity Example

What we would expect to observe if the relation is non linear?

ggplot(nonlinear_data, aes(x = X, y = Y)) +
  geom_point() +
  geom_smooth(method="lm", se = FALSE)

Non-Linearity Example

Let’s look at the residuals for this model

Let’s check the residual plot

Non-Linearity Example

lm2 = lm(Y ~ X, data = nonlinear_data)
autoplot(lm2)

Non-Linearity Example

From the Residual vs. Fitted, we can observe that the residuals are not evenly distributed around zero.
This indicates that for lower and higher values of \(x_i\) our model is overpredicting and underpredicting in the mid values.
What are the implications in this case?
Worse predictions

Independence

Independence means that knowing the prediction error for one observation doesn’t tell you anything about the error for another observation
Data collected over time are usually not independent
We can’t use regression diagnostics to decide the independence
We have to measure the autocorrelation of the residuals
We’ll get back to autocorrelation when we discuss Time Series models

Normality assumption

When we’ve been interpreting residual standard error (RSE) , we’ve used the following interpretation:
95% of our predictions will be accurate to within plus or minus \(2\times RSE\).
In order for this to be true, the residuals have to be Normally distributed

Normality example

We can check the distribution of the residuals

linear_data = linear_data %>% 
  mutate(resid = residuals(lm1))

ggplot(linear_data, aes(x = resid)) + 
  geom_histogram(color = "grey", binwidth = 0.2)

Normality example

But how can we judge if the residuals follows a Normal distribution?
The key is to look at the Normal Q-Q plot, which compares the distribution of our residuals to a perfect Normal distribution.
If the dots line up along an (approximately) straight line, then the Normality assumption is satisfied.

Regression diagnostic plot

To check for Normality we focus on the Normal Q-Q plot

lm1 = lm(Y ~ X, data = linear_data)
autoplot(lm1)

In this case the normality assumptions seem to be met

Normality example

Let’s look at different data.
In this case the data has non Normal errors.

Normality example

Histogram of the residuals (right skewed)

lm3 = lm(Y ~ X, data = non_normal)

non_normal = non_normal %>% 
  mutate(resid = residuals(lm3))

ggplot(non_normal, aes(x = resid)) + 
  geom_histogram(color = "grey", binwidth = 1)

Regression diagnostic plot

autoplot(lm3)

Interpretation of the plot

From the Normal Q-Q plot, we can observe that the residuals are not following the line that indicates the Normal quantiles
This means that our model results in non-normal residuals
This affects statistical tests, and confidence intervals

Equal variance

Equal variance is also known as “homoscedasticity”
The variance of \(Y\) should be about the same at any \(X\) value (or combination of values for the \(X\)’s).
In other words, the vertical spread of the points should be the same anywhere along the \(X\)-axis.
If there’s no equal variance then we might have heteroskedasticity.
Lower precision, estimates are further from the correct population value.

Equal variance example

The vertical spread of the points is larger along the right side of the graph

ggplot(heter_data, aes(x = X, y = Y)) +
  geom_point() +
  geom_smooth(method="lm", se = FALSE)

Regression diagnostic plot

To check for homoscedasticity we focus on the Scale-Location plot

lm4 = lm(Y ~ X, data = heter_data)
autoplot(lm4)

Interpretation of the plot

From the Sacle-Location plot, we can observe that the residuals have a fan shape, indicating that there is heteroscedacity in the data.
This resulted in lower precision; thus, estimates are further from the correct population value.

Outliers and influential observations

Adding a new observation with \(X\) near the mean of \(X\) doesn’t matter much even if it’s out of line with the rest of the data:

This point has high residual but low leverage. RSE = 0.5504

Diagnostics Plot

We can observe the point with high residual on the Residual vs. Leverage plot

lm5 = lm(Y ~ X, data = outlier_residual)
autoplot(lm5)

High leverage

We can also have points with high leverage - when a point in \(X\) is distant from the average on \(X\)

This point has low residual but high leverage. RSE = 0.2956

High leverage

We can observe the point with high leverage on the Residual vs. Leverage plot

lm6 = lm(Y ~ X, data = outlier_leverage)
autoplot(lm6)

Points with high influence

Points with high leverage and high residuals are known as influential points

This point has high residual but high leverage. RSE = 0.8281

Points with high influence

We can observe the point with high influence on the Residual vs. Leverage plot

lm7 = lm(Y ~ X, data = outlier_influence)
autoplot(lm7)

Points with high influence

When a case has a very unusual \(X\) value, it has leverage — the potential to have a big impact on the regression line
If the case is in line with the overall trend of the regression line, it won’t be a problem
But when that case also has a \(Y\) (high residual) value that is out of line
We need both a large residual and high leverage for an observation to be influential
We should be worried about these points
They affect the coefficients and predictions