Data Science for Business Applications

Class 06 - Model Selection

Introduction to prediction

  • So far, we have been focusing mostly on trying to explain the effects from the predictors \(X\) through the coefficients.
  • Until now, our focus was on the soundness of our model in relation to statistical significance and how well our model was fitting the data (Regression Assumptions).
  • Today, we will focus on making models that return Estimate/predict outcomes with high accuracy without extrapolating, with previously unseen data.

Inference and Prediction

  • Inference \(\rightarrow\) focus on the predictor

  • Interpretability of model

  • Prediction \(\rightarrow\) focus on outcome variable

  • Accuracy of model

Bias vs. Variance

  • Bias vs Variance trade-off

  • Variance: The amount by which the function \(f\) would change if we estimated it using a different training dataset

  • Bias: Error introduced by approximating a real-life problem with a model

  • More flexible models have a higher variance and a lower bias

  • Less flexible models have a lower variance but a higher bias

  • Validation set approach: Training and testing data

  • Balance between flexibility and accuracy

Bias vs. Variance

  • When explaining, bias is usually greater than variance
  • In prediction, we care about both
  • Measures of accuracy will have both bias and variance

Measures of accuracy

  • How do we measure accuracy?

  • Mean Squared Error (MSE): Can be decomposed into variance and bias terms \[ \text{MSE} = \text{Var} + \text{Bias}^2 + \text{Irreducible Error} \] where MSE is equal to \[ MSE = \frac{1}{n} \sum_{i = 1}^n(y_i-\widehat{y}_i)^2 \]

  • Root Mean Squared Error (RMSE): Measured in the same units as the outcome \[ \text{RMSE} = \sqrt{\text{MSE}} \]

  • Other measures: Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC)

Is flexibility always better?

Measures of accuracy

  • Models with increasing flexibility (linear, cubic, spline).
  • Think of a spline as a polynomial model with a high degree.
  • RMSE decreases with flexibility in the training data.
  • The spline overfits the training data since the RMSE of the testing data is large.

What is churn?

  • Churn: Measure of how many customers stop using your product (e.g. cancel a subscription).
  • Less costly to keep a customer than bring a new one
  • Goal: Prevent churn
  • Identify customer that are likely to cancel/quit/fail to renew

Predicting “pre-churn”

  • We will predict “pre-churn”.
  • At a good measure for someone at risk of unsubscribing (“pre-churn”) is the times they’ve logged in the past week.
  • We are interested in the number of log ins in the variable logins.
  • We will predict logins from the other variable in the data.
  • We two candidates: Simple vs Complex

Predicting “pre-churn”

  • Simple Model: \[ logins = \beta_0 + \beta_1 \cdot Succession + \beta_2 \cdot city + \epsilon \]

  • Complex Model: \[ logins = \beta_0 + \beta_1 \cdot Succession + \beta_2 \cdot age + \beta_3 \cdot age^2 + \beta_4 \cdot city + \beta_5 \cdot female + \epsilon \]

  • Can we build more complex methods? Yes!

  • First we will just analyse these two.

Create Validation Sets

  • Create Training and Testing sets
  • We will use 75% of the data to train the data
  • The remaining part of the data, 25%, we reserve for testing
  • This split is done randomly
  • To do so we use the libraries modelr, and rsample
library(modelr) # for common model performance metrics
library(rsample)  # for creating train/test splits

set.seed(100) #Always set seed for replication
hbo_split =  initial_split(hbomax, prop=0.75)
hbo_train = training(hbo_split)
hbo_test  = testing(hbo_split)

RMSE in training and testing data

# Simple Model
lm_simple = lm(logins ~ succession + city, data = hbo_train)

# Complex Model
lm_complex = lm(logins ~ female + city + age + I(age^2) + succession, data = hbo_train)

# Testing error for the simple model:
rmse(lm_simple, hbo_test)
[1] 2.075106
# Testing error for the complex model:
rmse(lm_complex, hbo_test)
[1] 2.080211
  • Which model we should choose?
  • The model with the smallest out of sample error
  • Out of sample means evaluation in the testing data

Cross-Validation

  • To avoid using only one training and testing dataset, we can iterate over k-fold division of our data:
  • Grey: all of the data

  • Pink: Testing data

  • Yellow: Training data

Cross-Validation

Procedure for k-fold cross-validation:

  1. Divide your data in k-folds (usually, \(K = 5\) or \(K = 10\)).

  2. Use as \(k = 1\) the testing data and \(k = 2,3,\dots, K\) as the training data.

  3. Calculate the accuracy measure on the testing data, \(RMSE_k\).

  4. Repeat for each \(k\).

  5. Average \(RMSE_k\) for all \(k\).

Main advantage: Use the entire dataset for training AND testing.

Apple quarterly revenue

  • Install the library caret
library(caret)
set.seed(100)
train.control = trainControl(method = "cv", number = 10)

lm_simple = train(logins ~ succession + city, data = hbomax, method= "lm", trControl = train.control)

lm_simple
Linear Regression 

5000 samples
   2 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 4500, 4501, 4499, 4500, 4500, 4501, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  2.087314  0.6724741  1.639618

Tuning parameter 'intercept' was held constant at a value of TRUE

Stepwise selection

  • We have seen how to choose between some given models. But what if we want to test all possible models?

  • Stepwise selection: Computationally-efficient algorithm to select a model based on the data we have (subset selection).

  • Algorithm for forward stepwise selection:

  1. Start with the null model, (no predictors)

  2. For : (a) Consider all models that augment with one additional predictor. (b) Choose the best among these models and call it .

  3. Select the single best model from using CV.

  • Backwards stepwise follows the same procedure, but starts with the full model.

Stepwise selection and CV

set.seed(100)
# Linear Regression with Forward Selection
# Remove unsubscribe
train.control = trainControl(method = "cv", number = 10) #set up a 10-fold cv
lm.fwd = train(logins ~.- unsubscribe, data = hbomax, method = "leapForward", 
               tuneGrid = data.frame(nvmax = 1:5), trControl = train.control)
lm.fwd$results
  nvmax     RMSE    Rsquared      MAE     RMSESD  RsquaredSD      MAESD
1     1 3.643876 0.001423859 3.168804 0.05856896 0.001837302 0.08173805
2     2 3.643778 0.002541723 3.168174 0.06094142 0.003036447 0.08474783
3     3 3.186594 0.206309738 2.719227 0.62445616 0.282844240 0.59591617
4     4 2.580810 0.468546464 2.125469 0.62430763 0.278310925 0.59607716
5     5 2.087951 0.672274342 1.640141 0.04906724 0.014296583 0.04888083
  • Which one would you choose out of the 5 models? Why?
  • The model with the smallest RMSE, which is model 5.
  • Can we see this model?

Stepwise selection and CV

  • And how does that model looks like:
summary(lm.fwd$finalModel)
Subset selection object
6 Variables  (and intercept)
           Forced in Forced out
X              FALSE      FALSE
female         FALSE      FALSE
city           FALSE      FALSE
age            FALSE      FALSE
succession     FALSE      FALSE
id             FALSE      FALSE
1 subsets of each size up to 5
Selection Algorithm: forward
         X   id  female city age succession
1  ( 1 ) " " " " " "    " "  " " "*"       
2  ( 1 ) " " " " " "    "*"  " " "*"       
3  ( 1 ) " " " " " "    "*"  "*" "*"       
4  ( 1 ) " " " " "*"    "*"  "*" "*"       
5  ( 1 ) "*" " " "*"    "*"  "*" "*"       
  • The selected model has the following variables:

  • female, city, age, succession, id

Conclusion

  • In prediction, everything is going to be about:

  • Bias vs Variance

  • Importance of validation sets

  • We now have methods to select models