A bank that extends lines of credit has a sample of 100 customers with information about whether each customer’s loan account is in good standing, along with information about each customer:
Age: The age of the customerOther_Credit: Whether the customer has any other lines of credit, (Yes,No)The bank is looking to build a model to predict Status, because they would like to be able to predict which future customers are most likely to later have accounts in bad standing (so they can avoid approving those customers for lines of credit!).
Status is a categorical variable - transform it into a dummy variable by creating a new variable Bad that is 1 if Status is Bad and 0 if Status is Good.
Bad? It doesn’t really matter, but does make sense here because the bank is looking to predict which customers will later have accounts in bad standing.Age and Other_Credit.Bad status is defined as \[
\text{odds}(\text{Bad}) = \frac{p(\text{Bad})}{1-p(\text{Bad})},
\]and where \(p(\text{Bad})\) is probability of having a Bad status.
R
Call:
glm(formula = Bad ~ Age * Other_Credit, family = "binomial",
data = banco)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 18.1579 4.8547 3.740 0.000184 ***
Age -0.5871 0.1586 -3.701 0.000215 ***
Other_CreditYes -12.1921 5.2995 -2.301 0.021414 *
Age:Other_CreditYes 0.4076 0.1712 2.382 0.017241 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.63 on 99 degrees of freedom
Residual deviance: 62.96 on 96 degrees of freedom
AIC: 70.96
Number of Fisher Scoring iterations: 7
We have the following equation for predicting the log odds of having a Bad status:
\[ \log\left(\frac{p(\text{Bad})}{1-p(\text{Bad})}\right) = 18.16 -0.59 \cdot \text{Age} -12.1921\cdot \text{Other_Credit(Yes)} + \\ 0.4076 \cdot \text{Age} \times \text{Other_Credit(Yes)} \] The odds model is given by \[ \text{odds}(\text{Bad}) = \frac{p(\text{Bad})}{1-p(\text{Bad})} = \exp(18.16 -0.59 \cdot \text{Age} -12.19\cdot \text{Other_Credit(Yes)} + \\ 0.41 \cdot \text{Age} \times \text{Other_Credit(Yes)}) \] Next, we give the interpretation of these coefficients.
Intercept: Setting Age = 0, and Other_Credit(No) = 0 (baseline), results in \(\log(\text{odds}) = 18.16\). Thus, the odds of having a bad status under these conditions are \(\text{odds} = \exp(18.16) = 77,052,688\). This result has no practical meaning since we cannot have bank account holders with zero age.
Age: To interpret the effect of Age alone, we have to set Other_Credit(No) = 0. Thus, for a one unit change in Age (for each year the account holder gets older), there will be a 44.56% decrease (\((\exp(-0.59)-1)\cdot 100 = -44.56\)) in the odds of having a bad status when the account holder doesn’t have other credit.
Other_Credit: When Age = 0, the additional effect of having other credit, Other_Credit(No) = 1, compared to account members without other credit (baseline) is \((\exp(-12.19)-1)\cdot100 = -99\). Or a 99% decrease in the odds of having a bad status compared to account holders that have zero credit with zero Age.
Age\(\times\)Other_Credit: For a unit increase in Age there will be an extra increase in the odds of having a Bad status among account members with other credit of about \((\exp(0.41)-1)\cot 100 = 50.68\), or 50.68%, compared to account members without other credit.
In total, the effect for one unit increase in Age for account holders with other credit is given by \((\exp(-0.59+0.41)-1)\cdot100 = -16.47\), or a decrease of 16.43% on the odds of having a Bad credit compared to account holders without other credit.
This, in turn, means that the rate of decrease in the odds of having bad credit with an increase in age is greater for account holders without other credit compared to those who do have other credit.
As with the linear model we end up with two models. One odds model for account holders with other credit Other_Credit(Yes) = 1.
\[
\text{odds}(\text{Bad}) = \exp(5.97 -0.18 \cdot \text{Age})
\] and when Other_Credit(Yes) = 0 (no other credit) \[
\text{odds}(\text{Bad}) = \exp(18.16 -0.59 \cdot \text{Age})
\]
Suppose we have a new account from which we don’t know the status. The only information we have is the account member’s age, Age = 35, and that this person has another credit. What are the predicted log odds, odds, and probability of this person having a bad status?
Predicted log odds:
\[ \log\left(\text{odds}(\text{Bad})\right) = 5.97 -0.18 \cdot 35 = -0.33 \] - Using the predict function:
\[ \text{odds}(\text{Bad}) = \exp(5.97 -0.18 \cdot 35) = 0.72 \] - Using the predict function:
The predicted the probability of account holder with 35 years of age having a Bad status is 0.42, or around 42%.
How do we know if this model is effective at making predictions in this case, since we don’t have access to common measures used for this goal, as the RSE, \(R^2\), and the RMSE?
First we obtain the TP, TN, FP, and FN
The accuracy is given by: \[ \frac{\text{(TP+TN)}}{\text{Total}} = \frac{43+41}{100} = 0.86 \]
We evaluate the accuracy from this model by comparing it’s accuracy to the ``no brainer’’ method:
There’s an increase in the accuracy of 86% from the 50% from the group that is the most common on Status. In this case, either group is valid.
To measure the model’s accuracy on out sample data, methods we can use cross validation.
