Data Science for Business Applications

Class 02 - Categorical Variables

Categorical predictors with 2 categories

What’s the impact of gender on student evaluations?

Incorporating gender into the multiple regression model

The model is given by: \[ \texttt{eval} = \beta_0 + \beta_1 \cdot \texttt{beauty} + \beta_2 \cdot \texttt{gender} + \epsilon \] Where we have that:

  • \(\texttt{eval}\): is the response variable - (numerical)

  • \(\texttt{beauty}\): is a predictor - (numerical)

  • \(\texttt{gender}\): is a predictor - (categorical) - two groups:

    • female
    • male

Incorporating gender into the multiple regression model

  • Gender is a categorical variable (male or female in this data set) so we can’t use it as-is as a predictor.
  • Idea: Recode gender into the quantitative variable 1 = male, 0 = female. In practice this choice is arbitrary.
  • R does this for us!
  • If you put a categorical variable into a model, R will alphabetically select the group that will associated with “0”, and the next to “1”.

Run the Regression Model

options(scipen = 999)
model1 <- lm(eval ~ beauty + gender, data=profs)
summary(model1)

Call:
lm(formula = eval ~ beauty + gender, data = profs)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.87196 -0.36913  0.03493  0.39919  1.03237 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  3.88377    0.03866  100.47 < 0.0000000000000002 ***
beauty       0.14859    0.03195    4.65           0.00000434 ***
gendermale   0.19781    0.05098    3.88              0.00012 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5373 on 460 degrees of freedom
Multiple R-squared:  0.0663,    Adjusted R-squared:  0.06224 
F-statistic: 16.33 on 2 and 460 DF,  p-value: 0.0000001407

Regression a model using gender

A multiple regression predicting evaluation score from beauty and gender effectively fits two parallel regression lines:

Categorical predictors with 3+ categories

Is there a generation gap?

  • The variable generation is either silent (born before 1945), boomer (born 1945-1964), or genx (born after 1965).
  • Is generation a significant predictor of evaluations above and beyond gender and beauty?
  • To answer this, we need a model that includes as predictors all of gender, beauty, and generation.
  • But we can’t just create a variable that is 0 for the silent generation, 1 for baby boomers, and 2 for gen X—why not?
  • Solution is to pick a “reference category” and create dummy variables for the other categories.

OK boomer

Let’s arbitrarily pick boomers as a reference category:

Category genx silent
Boomers 0 0
Gen Xers 1 0
Silent Gens 0 1

R will do this automatically when you add a categorical variable with 3+ categories to a regression (it will arbitrarily pick a reference category)!

Run the Regression Model

model2 <- lm(eval ~ beauty + generation, data=profs)
summary(model2)

Call:
lm(formula = eval ~ beauty + generation, data = profs)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.82584 -0.36581  0.06317  0.42027  1.07809 

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       4.02537    0.03201 125.742 < 0.0000000000000002 ***
beauty            0.13491    0.03360   4.015            0.0000694 ***
generationgenx   -0.06807    0.06181  -1.101                0.271    
generationsilent -0.08225    0.07889  -1.043                0.298    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5455 on 459 degrees of freedom
Multiple R-squared:  0.03983,   Adjusted R-squared:  0.03355 
F-statistic: 6.346 on 3 and 459 DF,  p-value: 0.0003192

Analysis

All else equal (i.e., among professors of the same gender and beauty):

  • Gen X professors are predicted to get scores that are 0.07 points below those of boomers.
  • Silent gen professors are predicted to get scores that are 0.08 points below those of boomers.
  • Only the boomer/silent generation difference is statistically significant; Gen X professors are not significantly different than boomers.
  • In other words: age only seems to matter if you are really old.