Data Science for Business Applications

Class 07 - Model building for prediction

Mastery Exam A

  • Mastery Exam A is 7-9 PM on Wednesday, October 22 or Thursday, October 23 (your choice—sign up for a time using the survey link sent out).
  • The exam covers Unit A (weeks 1-7, including today).
  • The exam will be on Canvas using Respondus.
  • You can bring 1 full-sized page of notes (whatever you want, as long as you make it yourself).
  • Two types of questions:
    • Conceptual and computational questions like the homework and quizzes
    • Open-ended questions that ask you to apply the concepts to a new business situation and write up your analysis in paragraph form
  • Practice material for the exam is posted in Canvas.

Two reasons for building regression models

  • Prediction: Predict the value of \(Y\) based on the values of the \(X\) variables (today)
  • Understanding: Understand the relationships between \(Y\) and the \(X\) variables (later in the semester)

How we evaluate a model depends on what our use for it is.

Evaluating forecasts

How should we evaluate a forecast?

apple <- apple %>% mutate(
  COVID = ifelse(Period >= 38 & Period <= 45, 1, 0),
  lag1 = lag(Revenue)
)
model4 <- lm(log(Revenue) ~ Period + Quarter 
                  + COVID + log(lag1), data=apple)
summary(model4)

Call:
lm(formula = log(Revenue) ~ Period + Quarter + COVID + log(lag1), 
    data = apple)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.164955 -0.047524 -0.006454  0.044278  0.149002 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  2.472570   0.387615   6.379     0.00000007169178 ***
Period       0.010331   0.002215   4.664     0.00002593255364 ***
QuarterQ2   -0.471450   0.049927  -9.443     0.00000000000197 ***
QuarterQ3   -0.487549   0.026755 -18.223 < 0.0000000000000002 ***
QuarterQ4   -0.352342   0.025851 -13.630 < 0.0000000000000002 ***
COVID        0.109412   0.029754   3.677             0.000605 ***
log(lag1)    0.413292   0.111170   3.718             0.000535 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06482 on 47 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.9716,    Adjusted R-squared:  0.968 
F-statistic: 267.9 on 6 and 47 DF,  p-value: < 0.00000000000000022

If you were an analyst, how impressed would your boss be if you claimed to be able to predict the past? (Not very — they want you to be able to predict the future!)

How should we evaluate a forecast?

  • \(R^2\) and the residual standard error are optimistic measures of predictive performance.
  • They measure the quality of predictions using the same data used to fit the model.
  • But we’re interested in how the model predicts into the future, not how well it fits the past!

We can mimic predicting on future data by holding back the most recent observations.

Fit the model to the “training” data only (blue) and test the model by forecast the most recent year of data (red).

Fitting the model to the training set

We want to split our data into a test set (last 4 quarters) and a training set (everything else).

training <- apple[1:51,]
test <- apple[52:55,]

Then we want to fit the model to the training data and test it on the test data.

model4 <- lm(log(Revenue) ~ Period + Quarter 
                  + COVID + log(lag1), data=training)

Testing the model on the test set

Now we can see how well the model predicts the test set.

predict(model4, test)
      52       53       54       55 
4.377099 4.502806 4.918158 4.558995 

These are predictions for log revenue, so we need to convert them back to revenue:

exp(predict(model4, test))
       52        53        54        55 
 79.60673  90.27005 136.75045  95.48747 

How do we compare this to the actual revenues of the last 4 quarters?

test.Rev = test$Revenue

Measuring prediction error

Mean squared prediction error (MSE or MSPE) lets us compare all 4 predictions to the actual revenues at once:

\[ \text{MSPE} = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat Y_i)^2 \]

(\(n\) is the size of the test set; \(n=4\) this case)

Take the square root to get an interpretable number (in the units of \(Y\)):

\[ \text{RMSPE} = \sqrt{\text{MSPE}} \]

Interpret RMSPE just like residual standard error (smaller is better!)!

Calculating RMSPE

\(i\) \(Y_i\) \(\hat Y_i\) \(Y_i - \hat Y_i\) \((Y_i - \hat Y_i)^2\)
\(1\) \(85.8\) \(79.61\) \(6.19\) \(38.36\)
\(2\) \(94.9\) \(79.61\) \(4.63\) \(21.44\)
\(3\) \(124.3\) \(136.75\) \(44.69\) \(1997.49\)
\(4\) \(95.4\) \(95.49\) \(-0.09\) \(0.01\)

\[ \text{RMSPE} = \sqrt{\frac{1}{4} \sum_{i=1}^4 (Y_i - \hat Y_i)^2} = 7.33 \]

In other words, we expect our model to be off by about

round(sqrt(mean((test$Revenue - exp(predict(model4, test)))^2)), 2)
[1] 7.33

billion dollars on average when forecasting the future.

A shortcut for calculating RMSPE

# exp here only because we're predicting log(Revenue) instead of Revenue
predictions <- exp(predict(model4, test))
actuals <- test$Revenue
sqrt(mean((predictions - actuals)^2))
[1] 7.328273

Training/test splits in general

Train/test splits to measure performance

  • Randomly allocating observations makes training/testing sets representative of the full data set
  • 80/20% training/testing splits are common
  • More testing data \(\to\) More accurate estimates of prediction error
  • More training data \(\to\) Better representation of full-data fit

Predicting utility expenditures

Creating a train/test split

Let’s do the same thing for the utilities data, but by randomly splitting the data into a training and test set.

library(tidyverse)
set.seed(9529)

# Ask R to shuffle all 117 observations
randomized <- slice_sample(utilities, n = 117)

# 24 is about 20% of 117, so we'll use the first 24 rows as our test set
utilities_test <- randomized[1:24,]

# The rest are our training set
utilities_train <- randomized[25:117,]

The set.seed(9529) command ensures reproducible “random” splits. (I chose this number because it’s my EID number; use your EID number!)

Creating a train/test split

Black = training data, Red = testing data

Random splitting is important! (What would happen if we trained on fall-spring and tested on summer?)

Using training/test splits to pick between competing models

When comparing models, we want to pick the one that minimizes prediction error on the test set — we want to avoid overfitting to the training data!

When we fit to the entire data set, we might overfit to the training data and get an overly optimistic estimate of prediction error on new data.

A linear (1st-degree) polynomial model

model1 <- lm(dailyspend ~ temp, data=utilities)

Call:
lm(formula = dailyspend ~ temp, data = utilities)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.84674 -0.50361 -0.02397  0.51540  2.44843 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.347617   0.206446   35.59   <2e-16 ***
temp        -0.096432   0.003911  -24.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8663 on 115 degrees of freedom
Multiple R-squared:  0.841, Adjusted R-squared:  0.8396 
F-statistic: 608.1 on 1 and 115 DF,  p-value: < 2.2e-16

A quadratic (2nd-degree) polynomial model

model2 <- lm(dailyspend ~ temp + I(temp^2), data=utilities)

Call:
lm(formula = dailyspend ~ temp + I(temp^2), data = utilities)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.87250 -0.28048 -0.03929  0.26391  2.19117 

Coefficients:
              Estimate Std. Error t value     Pr(>|t|)    
(Intercept)  9.4722885  0.3907892  24.239      < 2e-16 ***
temp        -0.2115553  0.0191046 -11.074      < 2e-16 ***
I(temp^2)    0.0012476  0.0002037   6.124 0.0000000133 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7547 on 114 degrees of freedom
Multiple R-squared:  0.8803,    Adjusted R-squared:  0.8782 
F-statistic: 419.3 on 2 and 114 DF,  p-value: < 2.2e-16

A 9th-degree polynomial model

model9 <- lm(dailyspend ~ temp + I(temp^2) + I(temp^3) + I(temp^4) 
              + I(temp^5) + I(temp^6) + I(temp^7) + I(temp^8) + I(temp^9),
              data=utilities)

Call:
lm(formula = dailyspend ~ temp + I(temp^2) + I(temp^3) + I(temp^4) + 
    I(temp^5) + I(temp^6) + I(temp^7) + I(temp^8) + I(temp^9), 
    data = utilities)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.66386 -0.24156 -0.01734  0.25996  2.48031 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.144e+01  4.599e+01  -0.466    0.642
temp         8.103e+00  1.460e+01   0.555    0.580
I(temp^2)   -8.468e-01  1.905e+00  -0.445    0.658
I(temp^3)    4.153e-02  1.349e-01   0.308    0.759
I(temp^4)   -9.600e-04  5.764e-03  -0.167    0.868
I(temp^5)    4.563e-06  1.551e-04   0.029    0.977
I(temp^6)    2.603e-07  2.644e-06   0.098    0.922
I(temp^7)   -5.943e-09  2.770e-08  -0.215    0.831
I(temp^8)    5.173e-11  1.627e-10   0.318    0.751
I(temp^9)   -1.676e-13  4.095e-13  -0.409    0.683

Residual standard error: 0.7501 on 107 degrees of freedom
Multiple R-squared:  0.889, Adjusted R-squared:  0.8797 
F-statistic: 95.26 on 9 and 107 DF,  p-value: < 2.2e-16

Building the models

We can avoid overfitting to the training data by fitting the models to the training data and then testing them on the test data.

Fit the models using the training set:

model1 <- lm(dailyspend ~ temp, data=utilities_train)
model2 <- lm(dailyspend ~ temp + I(temp^2), data=utilities_train)
model9 <- lm(dailyspend ~ temp + I(temp^2) + I(temp^3) + I(temp^4) 
              + I(temp^5) + I(temp^6) + I(temp^7) + I(temp^8) + I(temp^9),
              data=utilities_train)

Get their predictions on the test set:

actuals <- utilities_test$dailyspend
predictions1 <- predict(model1, utilities_test)
predictions2 <- predict(model2, utilities_test)
predictions9 <- predict(model9, utilities_test)

Measuring performance

Now let’s calculate RMSPE on the testing data:

sqrt(mean((actuals - predictions1)^2))
[1] 1.013597
sqrt(mean((actuals - predictions2)^2))
[1] 0.8534912
sqrt(mean((actuals - predictions9)^2))
[1] 0.9482653
Model RSE (all data) RMSPE (training/test split)
Linear \(0.83\) \(1.01\)
Quadratic \(0.73\) \(0.85\)
9th Degree \(0.71\) \(0.95\)
  • These are all worse than the predictions of the models fit to the entire data set!
  • It shows that the quadratic model is the best of the three for making predictions on new data

The problem with this approach

We might get “unlucky,” and our train/test split might not give us a representative estimate of prediction error!

Cross-validation

\(k\)-fold Cross-Validation

  • Averages results over \(k\) random train/test splits, to avoid getting bad results from possible bad luck of a single train/test split.
  • \(k=n\) is called leave-one-out cross validation

Let’s work through an example with \(k=3\)

First we need to split the data into 3 random groups of roughly the same size (\(39\times 3=117\)):

fold1 <- randomized[1:39,]
fold2 <- randomized[40:78,]
fold3 <- randomized[79:117,]

Now we need to build the model 3 times:

  • Model A: test on fold1, train on fold2 and fold3
  • Model B: test on fold2, train on fold1 and fold3
  • Model C: test on fold3, train on fold1 and fold2

Building the models

modelA <- lm(dailyspend ~ temp, data=rbind(fold2, fold3))
modelB <- lm(dailyspend ~ temp, data=rbind(fold1, fold3))
modelC <- lm(dailyspend ~ temp, data=rbind(fold1, fold2))

Getting the predictions

predictionsA <- predict(modelA, fold1)
predictionsB <- predict(modelB, fold2)
predictionsC <- predict(modelC, fold3)

Calculating RMSPE

We now get 3 different RMSPE estimates:

actualsA <- fold1$dailyspend
sqrt(mean((actualsA - predictionsA)^2))
[1] 0.9898423
actualsB <- fold2$dailyspend
sqrt(mean((actualsB - predictionsB)^2))
[1] 0.745035
actualsC <- fold3$dailyspend
sqrt(mean((actualsC - predictionsC)^2))
[1] 0.8463284

Our final estimate of RMSPE is the average of these 3 estimates:

\[ \text{RMSPE} = \frac{ 0.99 + 0.75 + 0.85 }{3} = 0.86 \]

Re-doing our model comparisons

Model RSE (all data) RMSPE (3 Fold CV) RMSPE (Single Train/Test Split)
Linear \(0.83\) \(0.86\) \(1.01\)
Quadratic \(0.73\) \(0.77\) \(0.85\)
9th Degree \(0.71\) \(0.91\) \(0.95\)
  • Comparing RSE and CV:
    • RSE misses overfitting
    • For every model, RSE is too optimistic about how well the model will perform on new data
  • Our initial train/test split was unlucky! It over-estimated the RMSE

Isn’t there a faster way?

In practice, you don’t have to do this all manually! The tidymodels package makes it easy to do cross-validation even for larger \(k\).

library(tidymodels)

# Randomly split the data into 5 folds
folds <- vfold_cv(utilities, v=5)

# Build a tidymodels "workflow" for a linear model
# with the formula dailyspend ~ temp, using RMSE as the target metric
workflow() %>% 
  add_model(linear_reg()) %>%
  add_formula(dailyspend ~ temp) %>%
  fit_resamples(folds, metrics=metric_set(rmse)) %>%
  collect_metrics()
# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   0.871     5  0.0857 pre0_mod0_post0