This is what we covered in previous classes:
Simple and Multiple Regression
Categorical Variables and Interactions
Residual Analysis
Nonlinear Transformations
Time Series
Model Selection
Today we will introduce a new model.
OkCupid
data set contains information about 59826 profiles from users of the OkCupid online dating service.sex
of the user based on their height
. (In this data set, everyone is classified as male or female.)\[ \widehat{\text{sex}} = \widehat{\beta}_{0} + \widehat{\beta}_{0} \cdot \text{height} \]
male
or female
), so regular linear regression won’t work here.male
to convert sex to a quantitative dummy variable:male
= 0.7 (or 1.2)?Instead of predicting whether someone is male, let’s predict the probability that they are male
In logistic regression, one level of \(Y\) is always called “success” and the other called “failure.” Since \(Y = 1\) for males, in our setup we have designated males as “success.” (You could also set \(Y = 1\) for females and call females “success.”)
Let’s fit a curve that is always between 0 and 1.
To fit the Logistic regression model we need to know the difference between odds and probability and how they relate.
When something has “even (1/1) odds,” the probability of success is 1/2.
When something has “2/1 odds,” the probability of success is 2/3.
When something has “3/2 odds,” the probability of success is 3/5.
In general, the odds of something happening are \(p/(1 − p)\).
Where \(p\) is the probability defined bewteen zero and one.
You can transform odds to probability: \[ \text{Odds} = \frac{3}{2} = \frac{3/(3+2)}{2/(3+2)} = \frac{3/5}{2/5} = \frac{p}{1-p} \]
If the odds are between zero and one they are not in your favor, \((1-p)>p\)
Let’s the explore this relation!
Probability \(p\) | Odds \(p/(1 − p)\) | Log odds \(\log(p/(1 − p))\) |
---|---|---|
0 | 0 | \(-\infty\) |
0.25 | 0.33 | −1.10 |
0.5 | 1 | 0 |
0.75 | 3 | 1.10 |
0.8 | 4 | 1.39 |
0.9 | 9 | 2.20 |
0.95 | 19 | 2.94 |
1 | \(\infty\) | \(\infty\) |
Logistic regression models the log odds of success \(p\) as a linear function of \(X\): \[ \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 \cdot X + e \]
This fits an “S-shaped” curve to the data
We’ll see what it looks like later
By making this choice, we have a series of benefits.
Let’s try it!
glm()
(generalized linear models)
Call:
glm(formula = male ~ height, family = binomial, data = okcupid)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -44.448609 0.357510 -124.3 <2e-16 ***
height 0.661904 0.005293 125.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 80654 on 59825 degrees of freedom
Residual deviance: 44637 on 59824 degrees of freedom
AIC: 44641
Number of Fisher Scoring iterations: 6
The logistic regression output tells us that our prediction is \[ \log(\text{odds}) = \log\left(\frac{\widehat{p(\text{male})}}{1-\widehat{p(\text{male})}}\right) = −44.45 + 0.66 \cdot \text{height} \]
To get the probability we have to solve in terms \(\widehat{p(\text{male})}\)
The probability of being male
given height
: \[
\widehat{p(\text{male})} = \frac{\exp(−44.45 + 0.66 \cdot \text{height})}{1+ \exp(−44.45 + 0.66 \cdot \text{height})}
\] where \(\exp()\) is the exponential function \(e^x\).
Let \(\widehat{p} = \widehat{p(\text{male})}\), and \(\exp(X\widehat{\beta}) = \exp(−44.45 + 0.66 \cdot \text{height})\):
\[ \begin{eqnarray} \log\left(\frac{\widehat{p}}{1-\widehat{p}}\right) &=& X\widehat{\beta} \\ \exp\left(\log\left(\frac{\widehat{p}}{1-\widehat{p}}\right)\right) &=& \exp(X\widehat{\beta}) \\ \frac{\widehat{p}}{1-\widehat{p}} &=& \exp(X\widehat{\beta})\\ \widehat{p} &=& \exp(X\widehat{\beta})\cdot (1-\widehat{p})\\ \widehat{p} &=&\exp(X\widehat{\beta}) - \exp(X\widehat{\beta}) \cdot \widehat{p} \\ \widehat{p} &=& \frac{\exp(X\widehat{\beta})}{1 + \exp(X\widehat{\beta})} \end{eqnarray} \]
height
.Our prediction equation is:
\[ \log(\text{odds}) = \log\left(\frac{\widehat{p(\text{male})}}{1-\widehat{p(\text{male})}}\right) = −44.45 + 0.66 \cdot \text{height} \]
Let’s start with some basic, but not particularly useful, interpretations:
When height
= 0, we predict that the log odds will be -44.45 , so the probability of male is predicted to be very close to 0%.
When height
increases by 1 inch, we predict that the log odds of being male will increase by 0.66.
Instead of log odds is better to have the interpretation in odds.
Let’s rewrite the prediction equation as:
Predicted odds of male, \(\exp(−44.45 + 0.66 \cdot \text{height})\).
Increasing height by 1 inch will multiply the odds by \(\exp(0.66) = 1.94\); i.e., increase the odds by 94%.
In summary, \[(\exp(\widehat{\beta}) - 1)\times 100 = \text{percentage change in odds}.\]
Increasing height by 2 inches will multiply the odds by \(\exp(2\cdot0.66) = 3.76\); i.e., increase the odds by 276%.
Odds equal to 1 indicate an one-to-one chance.
\[ \widehat{p(\text{male})} = \frac{\exp(−44.45 + 0.66 \cdot 69)}{1+ \exp(−44.45 + 0.66 \cdot 69)} = 0.77 \]
Adding another predictor: can we do better?
Just like with a linear regression model, we can add additional predictors to the model.
Our interpretation of the coefficients in multiple logistic regression is similar to multiple linear regression, in the sense that each coefficient represents the predicted effect of one \(X\) on \(Y\), holding the other \(X\) variables constant.
We could use our model to make a prediction of sex based on the probability.
Suppose we say that our prediction is: \[ \text{Prediction} = \begin{cases} \text{male}, & \text{if $\widehat{P(\text{male})} \geq 0.5$}, \\ \text{female}, & \text{if $\widehat{P(\text{male})} < 0.5$}. \\ \end{cases} \]
Given a threshold of 0.5, now we can compute the fraction of individuals whose sex we correctly predicted.
For males and females.
This is known as the accuracy of the model.
We can use the xtabs
function to get the accuracy:
We add we the number of correctly predicted groups for both male and female, and divide by the total of observations.
okcupid = okcupid %>%
mutate(predict.sex = ifelse(predict(model, type="response") >= 0.5,"m","f"))
xtabs(~ predict.sex + sex,okcupid) %>%
addmargins()
sex
predict.sex f m Sum
f 19466 5494 24960
m 4623 30243 34866
Sum 24089 35737 59826
xtabs
is also called a Confusion matrixActual failure | Actual success | |
---|---|---|
Model predicts failure | True negative | False negative |
Model predicts success | False positive | True positive |
Suppose that the Amazon is trying to build a model to predict which costumers buy a certain product:
Suppose that 0.01% of people are costumers of this product (the product is really expensive / high revenue)
A “null” or “no-brainer” model that predicts that no one is a costumer will be 99.99% accurate.
The revenue coming from our model would be zero.
But the model could make two different kinds of prediction errors:
False positive: predicting someone is a customer when they really are not
False negative: predicting someone is not a customer when they really are
These two measures give us a better idea of the predictive power of our model.
The false positive rate is the proportion of actual failures where the model predicted success.
sex
predict.sex f m Sum
f 19466 5494 24960
m 4623 30243 34866
Sum 24089 35737 59826
The false negative rate is the proportion of actual successes where the model predicted failure.
sex
predict.sex f m Sum
f 19466 5494 24960
m 4623 30243 34866
Sum 24089 35737 59826
How do we reduce false positive/negative rates?
okcupid = okcupid %>%
mutate(predict.sex = ifelse(predict(model, type="response") >= 0.8,"m","f"))
xtabs(~ predict.sex + sex,okcupid) %>%
addmargins()
sex
predict.sex f m Sum
f 22650 12938 35588
m 1439 22799 24238
Sum 24089 35737 59826
We can decrease the false positive rate, but at the expense of increasing the false negative rate.
Or we can decrease the false negative rate, but at the expense of increasing the false positive rate.
We might choose a cutoff probability other than 50% based on our assessment of the relative costs of the two different kinds of errors.
In a logistic regression model, the response variable is binary, taking values of either zero or one.
The model estimates the log odds of the event associated with a response of one.
The model’s effects are interpreted in terms of odds.
Predictions are expressed as probabilities.
The performance of the model is evaluated based on its prediction accuracy.