Data Science for Business Applications

Class 09 - Building models for explanation

Understanding vs predicting

  • In a purely predictive model, we should include terms that reduce out-of-sample prediction error.
  • With lots of variables, many models will have essentially identical out-of-sample predictive error, and we can use any of them.
  • If we are trying to understand the relationships between a subset of the \(X\) variables and \(Y\), the model form is much more important.

Variables in the study

  • total_sleep_time (TST): Average daily sleep time, recorded early in the term.
  • cumulative_gpa: GPA at the beginning of the term.
  • midterm: Midterm grade, recorded midway through the term.
  • term_gpa: GPA at the end of the term.
  • study: Four studies using different cohorts at different universities:
    • Study 2, 3: “large public university”
    • Study 4: “private Catholic university”
    • Study 5: “private, STEM-focused university”
  • race: 1 if either parent from an under-represented group (non-White/Asian).
  • first_gen: 1 if neither parent attended any college.
  • gender: 1 if reported male, 0 otherwise.
  • height: height in inches.

Model 1: Predicting term GPA from total sleep time

model1 = lm(term_gpa ~ total_sleep_time, data=sleep)
summary(model1)

Call:
lm(formula = term_gpa ~ total_sleep_time, data = sleep)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.99561 -0.21082  0.09999  0.36996  0.71862 

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       2.80814    0.17241  16.287 < 0.0000000000000002 ***
total_sleep_time  0.09619    0.02570   3.742             0.000201 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4844 on 552 degrees of freedom
Multiple R-squared:  0.02474,   Adjusted R-squared:  0.02298 
F-statistic:    14 on 1 and 552 DF,  p-value: 0.0002015
confint(model1)
                      2.5 %    97.5 %
(Intercept)      2.46947087 3.1468058
total_sleep_time 0.04570081 0.1466792

Building a model for understanding

  • We would like to be able to conclude that more sleep causes higher GPAs, but this model will not let us do that.
  • We can rule out potential confounders by adding them to the model (along with other predictive variables) and avoid causal language.
  • How do we decide which variables to include?

How do we decide which variables to include?

  • Theory-driven, often “pre-registered” to avoid \(p\)-hacking.
  • \(p\)-hacking: running statistical tests on a set of data until some statistically significant results arise.
  • That can be done in a few different ways, for example: by stopping the collection of data once you get a \(p\)<0.05, analyzing many outcomes, but only reporting those with \(p\)<0.05, using covariates, excluding participants, etc.
  • Data-driven, but be careful — we don’t want to overfit!

Variables to always include

  • Confounders are variables that causally influence both total sleep time and term GPA
    • Because of their collinearity with sleep time, the sleep time coefficient will be different if omitted from the model.
    • To ensure we are measuring the effect of sleep and not the confounder, they must be in the model!
  • Variables that are strongly predictive of term GPA and are (nearly) uncorrelated with total sleep time.
    • By reducing the residual standard error, these variables reduce the size of confidence intervals for the regression coefficient of interest.

Variables you may want to include

  • Variables that are weakly predictive of term GPA and (nearly) uncorrelated with total sleep time.
    • Is the more complex model worth the extra predictive power?
  • Variables that someone might think are confounders, even if you don’t agree (or they don’t appear to be after seeing the data).
    • This lends your analysis some credibility.

Variables you should NEVER include

  • Variables that are causally influenced by the outcome \(Y\) (term GPA).
    • These variables can introduce bias, especially if they’re also causally influenced by total sleep time (a collider).
    • This can happen by accident when constructing the final dataset (e.g., if we excluded students who dropped out at the end of the term).
  • Variables that are causally influenced by the \(X\) variable of interest (total sleep time).
    • Adjusting for these variables removes a part of the effect we’re interested in!

Report findings:

  • Models with few or no other variables.
  • Models that adjust for theory-driven confounders and predictors.
  • Models that add data-driven confounders and predictors — specifically note these!
  • Models that add or remove subsets of potential confounders, to assess how robust the findings are.

So, what did we find?

  • After adjusting for cohort, cumulative GPA, and demographic variables, each additional hour of sleep is associated with an increase in predicted GPA of about 0.05 points, with a 95% CI of about [0.03, 0.11].
  • The effect estimate is robust to the specific adjustment for demographic variables hypothesized to correlate with sleep patterns and academic performance — this strengthens our argument!
  • In each case, our findings are statistically significant, and the estimated effects (and their plausible values based on the CIs) are all positive.
  • In other words, you should probably get more sleep!

Multicollinearity

Multicollinearity exists whenever 2+ predictors in a regression model are moderately or highly correlated.

  • If two predictors \(X_1\) and \(X_2\) are highly correlated, it is hard to estimate the effect of changing \(X_1\) while keeping \(X_2\) constant.
  • This means the coefficients for \(X_1\) and \(X_2\) could be misleading: we will get wide confidence intervals for \(X_1\) and/or \(X_2\), and the coefficients won’t be stable (adding new data or predictors to the model could drastically change them).
  • That’s a problem if we’re using the size of a coefficient to understand the relationship between \(X_1\) (or \(X_2\)) and \(Y\).

Correlation between the response and the predictors is good, but correlation between the predictors is not!

Extreme examples

  • Predicting your height based on the amount you weighed on one scale (\(X_1\)) and the amount you weighed on a second scale (\(X_2\)): very hard to disentangle the effects of \(X_1\) and \(X_2\).
  • Predicting your height based on your weight in pounds (\(X_1\)) and your weight in kilograms (\(X_2\)): perfect multicollinearity — impossible to disentangle the effects of \(X_1\) vs \(X_2\).

Example: Math SAT predicts graduation rate above and beyond acceptance rate

model1 = lm(Graduation.rate ~ Average.math.SAT + Acceptance.rate, data = colleges)
summary(model1)

Call:
lm(formula = Graduation.rate ~ Average.math.SAT + Acceptance.rate, 
    data = colleges)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.018 -10.076  -0.467   9.891  74.037 

Coefficients:
                   Estimate Std. Error t value             Pr(>|t|)    
(Intercept)      -18.680894   6.703758  -2.787              0.00547 ** 
Average.math.SAT   0.155968   0.009455  16.496 < 0.0000000000000002 ***
Acceptance.rate    0.016856   0.040725   0.414              0.67908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.45 on 725 degrees of freedom
  (574 observations deleted due to missingness)
Multiple R-squared:  0.3104,    Adjusted R-squared:  0.3085 
F-statistic: 163.2 on 2 and 725 DF,  p-value: < 0.00000000000000022

But not after adding verbal SAT?!?

model2 = lm(Graduation.rate ~ Average.math.SAT + Average.verbal.SAT + Acceptance.rate, data = colleges)
summary(model2)

Call:
lm(formula = Graduation.rate ~ Average.math.SAT + Average.verbal.SAT + 
    Acceptance.rate, data = colleges)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.487  -9.230  -0.139   8.968  73.071 

Coefficients:
                    Estimate Std. Error t value       Pr(>|t|)    
(Intercept)        -27.93587    6.67769  -4.183 0.000032235072 ***
Average.math.SAT     0.03063    0.02145   1.428          0.154    
Average.verbal.SAT   0.15736    0.02433   6.468 0.000000000183 ***
Acceptance.rate      0.02072    0.03963   0.523          0.601    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.03 on 724 degrees of freedom
  (574 observations deleted due to missingness)
Multiple R-squared:  0.3481,    Adjusted R-squared:  0.3454 
F-statistic: 128.9 on 3 and 724 DF,  p-value: < 0.00000000000000022