We would like to be able to conclude that more sleep causes higher GPAs, but this model will not let us do that.
We can rule out potential confounders by adding them to the model (along with other predictive variables) and avoid causal language.
How do we decide which variables to include?
How do we decide which variables to include?
Theory-driven, often “pre-registered” to avoid \(p\)-hacking.
\(p\)-hacking: running statistical tests on a set of data until some statistically significant results arise.
That can be done in a few different ways, for example: by stopping the collection of data once you get a \(p\)<0.05, analyzing many outcomes, but only reporting those with \(p\)<0.05, using covariates, excluding participants, etc.
Data-driven, but be careful — we don’t want to overfit!
Variables to always include
Confounders are variables that causally influence both total sleep time and term GPA
Because of their collinearity with sleep time, the sleep time coefficient will be different if omitted from the model.
To ensure we are measuring the effect of sleep and not the confounder, they must be in the model!
Variables that are strongly predictive of term GPA and are (nearly) uncorrelated with total sleep time.
By reducing the residual standard error, these variables reduce the size of confidence intervals for the regression coefficient of interest.
Variables you may want to include
Variables that are weakly predictive of term GPA and (nearly) uncorrelated with total sleep time.
Is the more complex model worth the extra predictive power?
Variables that someone might think are confounders, even if you don’t agree (or they don’t appear to be after seeing the data).
This lends your analysis some credibility.
Variables you should NEVER include
Variables that are causally influenced by the outcome \(Y\) (term GPA).
These variables can introduce bias, especially if they’re also causally influenced by total sleep time (a collider).
This can happen by accident when constructing the final dataset (e.g., if we excluded students who dropped out at the end of the term).
Variables that are causally influenced by the \(X\) variable of interest (total sleep time).
Adjusting for these variables removes a part of the effect we’re interested in!
Report findings:
Models with few or no other variables.
Models that adjust for theory-driven confounders and predictors.
Models that add data-driven confounders and predictors — specifically note these!
Models that add or remove subsets of potential confounders, to assess how robust the findings are.
So, what did we find?
After adjusting for cohort, cumulative GPA, and demographic variables, each additional hour of sleep is associated with an increase in predicted GPA of about 0.05 points, with a 95% CI of about [0.03, 0.11].
The effect estimate is robust to the specific adjustment for demographic variables hypothesized to correlate with sleep patterns and academic performance — this strengthens our argument!
In each case, our findings are statistically significant, and the estimated effects (and their plausible values based on the CIs) are all positive.
In other words, you should probably get more sleep!
Multicollinearity
Multicollinearity exists whenever 2+ predictors in a regression model are moderately or highly correlated.
If two predictors \(X_1\) and \(X_2\) are highly correlated, it is hard to estimate the effect of changing \(X_1\) while keeping \(X_2\) constant.
This means the coefficients for \(X_1\) and \(X_2\) could be misleading: we will get wide confidence intervals for \(X_1\) and/or \(X_2\), and the coefficients won’t be stable (adding new data or predictors to the model could drastically change them).
That’s a problem if we’re using the size of a coefficient to understand the relationship between \(X_1\) (or \(X_2\)) and \(Y\).
Correlation between the response and the predictors is good, but correlation between the predictors is not!
Extreme examples
Predicting your height based on the amount you weighed on one scale (\(X_1\)) and the amount you weighed on a second scale (\(X_2\)): very hard to disentangle the effects of \(X_1\) and \(X_2\).
Predicting your height based on your weight in pounds (\(X_1\)) and your weight in kilograms (\(X_2\)): perfect multicollinearity — impossible to disentangle the effects of \(X_1\) vs \(X_2\).
Example: Math SAT predicts graduation rate above and beyond acceptance rate
model1 =lm(Graduation.rate ~ Average.math.SAT + Acceptance.rate, data = colleges)summary(model1)
Call:
lm(formula = Graduation.rate ~ Average.math.SAT + Acceptance.rate,
data = colleges)
Residuals:
Min 1Q Median 3Q Max
-44.018 -10.076 -0.467 9.891 74.037
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -18.680894 6.703758 -2.787 0.00547 **
Average.math.SAT 0.155968 0.009455 16.496 < 0.0000000000000002 ***
Acceptance.rate 0.016856 0.040725 0.414 0.67908
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.45 on 725 degrees of freedom
(574 observations deleted due to missingness)
Multiple R-squared: 0.3104, Adjusted R-squared: 0.3085
F-statistic: 163.2 on 2 and 725 DF, p-value: < 0.00000000000000022
Call:
lm(formula = Graduation.rate ~ Average.math.SAT + Average.verbal.SAT +
Acceptance.rate, data = colleges)
Residuals:
Min 1Q Median 3Q Max
-49.487 -9.230 -0.139 8.968 73.071
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.93587 6.67769 -4.183 0.000032235072 ***
Average.math.SAT 0.03063 0.02145 1.428 0.154
Average.verbal.SAT 0.15736 0.02433 6.468 0.000000000183 ***
Acceptance.rate 0.02072 0.03963 0.523 0.601
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.03 on 724 degrees of freedom
(574 observations deleted due to missingness)
Multiple R-squared: 0.3481, Adjusted R-squared: 0.3454
F-statistic: 128.9 on 3 and 724 DF, p-value: < 0.00000000000000022
How can we detect multicollinearity?
The VIF for variable \(X_j\) is:
\[\text{VIF}(\beta_j) = \frac{1}{1 - R_j^2},\]
where \(R_j^2\) is the \(R^2\) in a regression predicting \(X_j\) from the other \(X\) variables.
\(\text{VIF}(\beta_j) = 1\) when \(R_j^2 = 0\); i.e., the \(j\)th predictor variable is completely independent from the others.
Higher VIF for variable \(X_j\) means that variable is more closely related to the other \(X\) variables (more multicollinearity).
How can we detect multicollinearity?
To calculate VIF for each predictor, you would have to run one regression for each predictor. But you don’t have to run all of these regressions by hand!
model =lm(Graduation.rate ~ Average.math.SAT + Average.verbal.SAT + Acceptance.rate, data = colleges)# install.packages("car")library(car)vif(model)
There are two general strategies for dealing with multicollinearity:
Drop one of the variables with a high VIF factor, and rerun to see if VIFs have improved. (Just like we drop one of the dummy variables when putting a categorical variable in the model!)
Combine the variables that correlate into a composite variable. (Combined SAT score = Math + Verbal)
Dealing with multicollinearity
When we combine math and verbal SAT into a single variable, VIFs are now close to 1.
model =lm(Graduation.rate ~ Average.combined.SAT + Acceptance.rate, data = colleges)vif(model)
When there is high collinearity in \(X\)’s that are strictly for adjustment, not interpretation, and the coefficient we want to interpret corresponds to a variable with low multicollinearity.
When multicollinearity comes from how we construct \(X\): e.g., adding polynomial terms (like last week!), turning categories into dummy variables, or adding interactions.
When we are just trying to predict using relatively few \(X\)’s, and only predicting at “typical”\(X\) values.