The data set utilities
contains information on the utility bills for a house in Minnesota. We’ll focus on two variables:
dailyspend
is the average amount of money spent on utilities (e.g. heating) for each day during the monthtemp
is the average temperature outside for that monthWhat problems do you see here?
Call:
lm(formula = dailyspend ~ temp, data = utilities)
Residuals:
Min 1Q Median 3Q Max
-2.84674 -0.50361 -0.02397 0.51540 2.44843
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.347617 0.206446 35.59 <2e-16 ***
temp -0.096432 0.003911 -24.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8663 on 115 degrees of freedom
Multiple R-squared: 0.841, Adjusted R-squared: 0.8396
F-statistic: 608.1 on 1 and 115 DF, p-value: < 2.2e-16
I(temp^2)
in the regression equation:
Call:
lm(formula = dailyspend ~ temp + I(temp^2), data = utilities)
Residuals:
Min 1Q Median 3Q Max
-2.87250 -0.28048 -0.03929 0.26391 2.19117
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4722885 0.3907892 24.239 < 2e-16 ***
temp -0.2115553 0.0191046 -11.074 < 2e-16 ***
I(temp^2) 0.0012476 0.0002037 6.124 1.33e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7547 on 114 degrees of freedom
Multiple R-squared: 0.8803, Adjusted R-squared: 0.8782
F-statistic: 419.3 on 2 and 114 DF, p-value: < 2.2e-16
Writing out the equation: \[ \widehat{\texttt{dailyspend}} = 9.4723 −0.2116\cdot \texttt{temp} + 0.0012\cdot \texttt{temp}^2 \] The effect of the extra variable is statistically significant:
2.5 % 97.5 %
(Intercept) 8.6981381712 10.246438869
temp -0.2494014032 -0.173709160
I(temp^2) 0.0008440041 0.001651114
Adding an \(X^2\) term fits a parabola to the data (orange line)
It solves the linearity problem
RSE
Degree | name | RSE |
---|---|---|
1 | linear | 0.866 |
2 | quadratic | 0.754 |
3 | cubic | 0.755 |
4 | quartic | 0.755 |
5 | quintic | 0.758 |
6 | 0.761 | |
7 | 0.761 |
Start simple: only add higher-degree terms to the extent it gives you a substantial decrease in the RSE
, or satisfies an assumption hold that wasn’t satisfied before
The log transformation is frequently useful in regression, because many nonlinear relationships are naturally exponential.
Moore’s Law was a prediction made by Gordon Moore in 1965 (!) that the number of transistors on computer chips would double every 2 years
This implies exponential growth, so a linear model won’t fit well (and neither will any polynomial)
If \(Y = ae^{bX}\), then
\[\log(Y) = \log(a)+ bX\]
In other words, \(\log(Y)\) is a linear function of \(X\) when \(Y\) is an exponential function of \(X\)
So if we think \(Y\) is an exponential function of \(X\), predict \(\log(Y)\) as a linear function of \(X\)
Let’s run the regression model
options(scipen = 999)
lm_moore = lm(log(Transistor.count) ~ Date.of.introduction, data = moores)
summary(lm_moore)
Call:
lm(formula = log(Transistor.count) ~ Date.of.introduction, data = moores)
Residuals:
Min 1Q Median 3Q Max
-5.1299 -0.3338 0.1767 0.5230 2.0626
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -681.212056 15.958165 -42.69 <0.0000000000000002 ***
Date.of.introduction 0.349154 0.007981 43.75 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.054 on 99 degrees of freedom
Multiple R-squared: 0.9508, Adjusted R-squared: 0.9503
F-statistic: 1914 on 1 and 99 DF, p-value: < 0.00000000000000022
Our model is \[\widehat{\log(\texttt{Transistors})} = −681.21 + 0.35 \cdot \texttt{Year}\]
Two interpretations of the slope coefficient:
Every year, the predicted log of transistors goes up by 0.35
More useful: Every year, the predicted number of transistors goes up by 35%
A constant percentage increase every year is exponential growth!
Making predictions using the log-linear model
When making predictions, we have to remember that our equation gives us predictions for \(\log(\texttt{Transistors})\), not Transistors!
Example: To make a prediction for the number of transistors in 2022: \[ \log(\texttt{Transistors}) = −681.21 + 0.35(2022) = 26.49 \] But our prediction is not 26.49:
\(e^{\log(\texttt{Transistors})} = e^{26.49} = 319,492,616,196\)
Model | Equation | Interpretation |
---|---|---|
Linear | \(\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X\) | 1 unit increase in \(X\) implies \(\widehat{\beta}_1\) unit increase in \(\widehat{Y}\) |
Log-linear | \(\log(\widehat{Y}) = \widehat{\beta}_0 + \widehat{\beta}_1 X\) | 1 unit increase in \(X\) implies ≈ \(100 \cdot \widehat{\beta}_1 \%\) increase in \(\widehat{Y}\) |
Linear-log | \(\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \log(X)\) | 1% increase in \(X\) implies ≈ \(0.01 \cdot \widehat{\beta}_1\) unit increase in \(\widehat{Y}\) |
Log-log | \(\log(\widehat{Y}) = \widehat{\beta}_0 + \widehat{\beta}_1 \log(X)\) | 1% increase in \(X\) implies ≈ \(\widehat{\beta}_1 \%\) increase in \(\widehat{Y}\) |
When is the log transformation useful?
You can transform \(X \rightarrow \log(X)\), \(Y \rightarrow \log(Y)\), or both
Anytime you need to ”squash” one of the variables (logs make huge numbers not so big!), try transforming it with a log
In this case, Transistors is skewed right so it is a good candidate for log
You may need to do a little bit of trial and error to see what works best
Other transformations are possible!