Data Science for Business Applications

Class 04 - Modeling nonlinear relationships

Polynomial models

The data set utilities contains information on the utility bills for a house in Minnesota. We’ll focus on two variables:

dailyspend is the average amount of money spent on utilities (e.g. heating) for each day during the month
temp is the average temperature outside for that month

Polynomial models

What problems do you see here?

library(tidyverse)
ggplot(utilities, aes(x = temp, y = dailyspend)) +
  geom_point()

Polynomial models

lm1 <- lm(dailyspend ~ temp, data=utilities) 
summary(lm1)


Call:
lm(formula = dailyspend ~ temp, data = utilities)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.84674 -0.50361 -0.02397  0.51540  2.44843 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.347617   0.206446   35.59   <2e-16 ***
temp        -0.096432   0.003911  -24.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8663 on 115 degrees of freedom
Multiple R-squared:  0.841, Adjusted R-squared:  0.8396 
F-statistic: 608.1 on 1 and 115 DF,  p-value: < 2.2e-16

Let’s interpret this relation
For one unit increase in temperature (Fahrenheit), there will be a 10-cent decrease in spending

Polynomial models

library(ggfortify)
autoplot(lm1)

Linearity and homoscedasticity are violated

Polynomial models

We’ll use polynomial regression to fix problems
If a polynomial curve (e.g., quadratic, cubic, etc) would be a better fit for the data than a line, we can fit a curve to the data.
The way we do this is by adding \(X^2\) to the model as a second predictor variable.
This can “fix” the linearity problem because now \(Y\) is a linear function of \(X\) and \(X^2\), resulting in: \[ Y = \beta_0 + \beta_1\cdot X + \beta\cdot X^2 + e \]

Polynomial models

We add the term I(temp^2) in the regression equation:

lm2 <- lm(dailyspend ~ temp + I(temp^2), data=utilities) 
summary(lm2)


Call:
lm(formula = dailyspend ~ temp + I(temp^2), data = utilities)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.87250 -0.28048 -0.03929  0.26391  2.19117 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.4722885  0.3907892  24.239  < 2e-16 ***
temp        -0.2115553  0.0191046 -11.074  < 2e-16 ***
I(temp^2)    0.0012476  0.0002037   6.124 1.33e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7547 on 114 degrees of freedom
Multiple R-squared:  0.8803,    Adjusted R-squared:  0.8782 
F-statistic: 419.3 on 2 and 114 DF,  p-value: < 2.2e-16

We have that the new term is evaluated as an extra variable.

Polynomial models

Writing out the equation: \[ \widehat{\texttt{dailyspend}} = 9.4723 −0.2116\cdot \texttt{temp} + 0.0012\cdot \texttt{temp}^2 \] The effect of the extra variable is statistically significant:

confint(lm2)

                    2.5 %       97.5 %
(Intercept)  8.6981381712 10.246438869
temp        -0.2494014032 -0.173709160
I(temp^2)    0.0008440041  0.001651114

The residual standard error of the polynomial model is \(\texttt{0.75}\).
The residual standard error of the linear model is \(\texttt{0.87}\).

Polynomial models

Adding an \(X^2\) term fits a parabola to the data (orange line)

ggplot(utilities, aes(x = temp, y = dailyspend)) +
  geom_point() + 
  geom_line(aes(x = temp, y = predict(lm1)), col = "lightblue") + 
  geom_line(aes(x = temp, y = predict(lm2)), col = "orange")

Polynomial models

It solves the linearity problem

autoplot(lm2)

Polynomial models

What about a higher-order polynomial?
We could fit a cubic curve by adding an \(X^3\) term
Making the polynomial higher order will decrease the RSE
Why not go nuts and fit a 7th degree polynomial?

Degree	name	RSE
1	linear	0.866
2	quadratic	0.754
3	cubic	0.755
4	quartic	0.755
5	quintic	0.758
6		0.761
7		0.761

Polynomial models

lm7 <- lm(dailyspend ~ poly(temp,7), data=utilities) 
ggplot(utilities, aes(x = temp, y = dailyspend)) +
  geom_point() + 
  geom_line(aes(x = temp, y = predict(lm7)), col = "red") + 
  geom_line(aes(x = temp, y = predict(lm2)), col = "orange")

Too high a degree creates dangers with extrapolation

Building polynomial models

Start simple: only add higher-degree terms to the extent it gives you a substantial decrease in the RSE, or satisfies an assumption hold that wasn’t satisfied before

You must include lower-order terms: e.g., if you add \(X^3\), you must also include \(X\) and \(X^2\)
Be careful about overfitting when adding higher-order terms!
Be particularly careful about extrapolating beyond the range of the data!
Mind-bender: We can think about an \(X^2\) term as an interaction of \(X\) with itself: in a parabola, the slope depends on the value of \(X\)

The log transformation

We saw that we can use transformations to fix problems
Sometimes, a violation of regression assumptions can be fixed by transforming one or the other of the variables (or both).
When we transform a variable, we have to also transform our interpretation of the equation.

The log transformation

The log transformation is frequently useful in regression, because many nonlinear relationships are naturally exponential.

\(\log_b x=y\) when \(b^y=x\)
For example, \(\log_{10} 1000 = 3\), \(\log_{10}100 = 2\), and \(\log_{10}10 = 1\)

The log transformation

Anytime you need to ”squash” one of the variables (logs make huge numbers not so big!)
Skewed data is also a good candidate for log

Moore’s Law

Moore’s Law was a prediction made by Gordon Moore in 1965 (!) that the number of transistors on computer chips would double every 2 years
This implies exponential growth, so a linear model won’t fit well (and neither will any polynomial)

ggplot(moores, aes(x = Date.of.introduction, y = Transistor.count)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Moore’s Law

lm_moore = lm(Transistor.count ~ Date.of.introduction, data = moores)
autoplot(lm_moore)

A linear model is a spectacular fail

Modeling exponential growth

If \(Y = ae^{bX}\), then

\[\log(Y) = \log(a)+ bX\]

In other words, \(\log(Y)\) is a linear function of \(X\) when \(Y\) is an exponential function of \(X\)
So if we think \(Y\) is an exponential function of \(X\), predict \(\log(Y)\) as a linear function of \(X\)

Modeling exponential growth

Transistors does NOT have a linear relationship with year
\(\log(\texttt{Transistors})\) does have a linear relationship with year

ggplot(moores, aes(x = Date.of.introduction, y = log(Transistor.count))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Log-linear Model

Let’s run the regression model

options(scipen = 999)
lm_moore = lm(log(Transistor.count) ~ Date.of.introduction, data = moores)
summary(lm_moore)


Call:
lm(formula = log(Transistor.count) ~ Date.of.introduction, data = moores)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1299 -0.3338  0.1767  0.5230  2.0626 

Coefficients:
                        Estimate  Std. Error t value            Pr(>|t|)    
(Intercept)          -681.212056   15.958165  -42.69 <0.0000000000000002 ***
Date.of.introduction    0.349154    0.007981   43.75 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.054 on 99 degrees of freedom
Multiple R-squared:  0.9508,    Adjusted R-squared:  0.9503 
F-statistic:  1914 on 1 and 99 DF,  p-value: < 0.00000000000000022

Modeling exponential growth

autoplot(lm_moore)

Interpretation of the model

Our model is \[\widehat{\log(\texttt{Transistors})} = −681.21 + 0.35 \cdot \texttt{Year}\]

Two interpretations of the slope coefficient:

Every year, the predicted log of transistors goes up by 0.35
More useful: Every year, the predicted number of transistors goes up by 35%
A constant percentage increase every year is exponential growth!

Interpretation of the model

Making predictions using the log-linear model
When making predictions, we have to remember that our equation gives us predictions for \(\log(\texttt{Transistors})\), not Transistors!

Example: To make a prediction for the number of transistors in 2022: \[ \log(\texttt{Transistors}) = −681.21 + 0.35(2022) = 26.49 \] But our prediction is not 26.49:

\(e^{\log(\texttt{Transistors})} = e^{26.49} = 319,492,616,196\)

Interpretation of the model

ggplot(moores, aes(x = Date.of.introduction, y = Transistor.count)) +
  geom_point() +
  geom_line(aes(x = Date.of.introduction, y = exp(predict(lm_moore))), col = "orange")

Interpretation of the model

Model	Equation	Interpretation
Linear	\(\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X\)	1 unit increase in \(X\) implies \(\widehat{\beta}_1\) unit increase in \(\widehat{Y}\)
Log-linear	\(\log(\widehat{Y}) = \widehat{\beta}_0 + \widehat{\beta}_1 X\)	1 unit increase in \(X\) implies ≈ \(100 \cdot \widehat{\beta}_1 \%\) increase in \(\widehat{Y}\)
Linear-log	\(\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \log(X)\)	1% increase in \(X\) implies ≈ \(0.01 \cdot \widehat{\beta}_1\) unit increase in \(\widehat{Y}\)
Log-log	\(\log(\widehat{Y}) = \widehat{\beta}_0 + \widehat{\beta}_1 \log(X)\)	1% increase in \(X\) implies ≈ \(\widehat{\beta}_1 \%\) increase in \(\widehat{Y}\)

Conclusion

When is the log transformation useful?
You can transform \(X \rightarrow \log(X)\), \(Y \rightarrow \log(Y)\), or both
Anytime you need to ”squash” one of the variables (logs make huge numbers not so big!), try transforming it with a log
In this case, Transistors is skewed right so it is a good candidate for log
You may need to do a little bit of trial and error to see what works best
Other transformations are possible!