Data Science for Business Applications

Class 11 - Random Forest

Review on Decision Trees

  • What kinds of models do we know how to make?
  • Classification or Regression
  • We can use decision trees for either kind of \(Y\) variable.

Pros & Cons of Decision Trees

  • Main Advantages:
    • Easy to interpret and explain (you can plot them!)
    • Mirrors human decision-making
    • Can handle qualitative predictors (without need for dummies)
  • Main disadvantages:
    • Accuracy not as high as other methods
    • Very sensitive to training data (e.g. overfitting)
    • Biased toward dominant classes

What we can do to improve?

  • Improve accuracy?

  • Less sensitive to random changes in the data?

  • More relevance for low-frequency classifications?

  • Solution: Bagging - Bootstrap Aggregation

    • Bootstrap - quantify sampling variability by resampling from the sample.
    • In Bagginig we basically we use bootstrap on the data and then we aggregate the results.

Bootstrap & Resampling

  • Sampling with replacement: each case selected for the sample is then replaced.

  • Every bootstrapped sample may have its own pattern of ties and omissions.

Bagging

  • Aggregate the results from the method applied to each bootstrap sample.
  • This is also known as an ensemble, or combining the results of multiple models.

Bagging & Ensembling

  • You want to predict on new test data.
  • For classification, take the majority vote!
  • For regression, take the mean or mode!

Random Forests

  • Random sampling of my training data rows helps to reduce variation in my predictions that are due to randomness.

  • We saw last week: sometimes the importance of individual variables can change from dataset to dataset, and therefore from model to model.

  • What if we random sample the columns (variables) as well as the rows?

Random Forests

  • Considering the Titanic data

  • We again predict if the passenger survived (yes, no) - Categorical Variable

  • Classification model

  • Before, I had sex as the variable defining the first split.

  • In my new trees, what if I only give the option for passengerClass and adult?

Random Forests

Let’s build a Random Forest (RF) in R

library(randomForest)
# Run the model
model1 = randomForest(survived ~ adult + sex + passengerClass, data = titanic_age)
# Show details on the model
print(model1)

Call:
 randomForest(formula = survived ~ adult + sex + passengerClass,      data = titanic_age) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 1

        OOB estimate of  error rate: 20.55%
Confusion matrix:
     no yes class.error
no  580  39  0.06300485
yes 176 251  0.41217799
  • In an instant, we built 500 trees.
  • We only had 3 variables, so the RF only needed to try one at each split, and grow the tree.
  • OOB (Out-of-Bag) error rate: 20.46%
  • (580+252)/1046 = 0.795

Classification Error estimates

plot(model1)
  • The plot shows how the error change given more trees are generated.
    • OOB (black line)
    • Among passengers who did not survive, the error in prediction (red line)
    • Among passengers who did survive, the error in prediction (green line)

Feature Importance

  • Feature (or variable) importance is an output from random forests.
  • In this case sex was by far most helpful as the first divide in the trees.
  • Adult was the least important variable in terms of predicting survival.
importance(model1)
               MeanDecreaseGini
adult                  6.241061
sex                   93.624871
passengerClass        34.567394
  • The Mean Decrease in Gini: measures how much each variable contributes to reducing uncertainty when the trees are built in the random forest.

  • A higher value means the variable is more important for making accurate classifications.

Feature Importance

  • The importance of the variable on the split can also be displayed in a plot.
varImpPlot(model1, sort = TRUE)

Random Forest for Regression

  • In this model we’ll now predict the Age of the passengers given class, gender, if they survived or not.
# Run the model
model2 = randomForest(age ~ sex + passengerClass + survived, data = titanic_age)
# Show details on the model
print(model2)

Call:
 randomForest(formula = age ~ sex + passengerClass + survived,      data = titanic_age) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 172.4933
                    % Var explained: 16.89
  • The MSE is given by 173.036, or RMSE = \(\sqrt{173.036}\) = 13.15
  • Variation explained 16.63 (similar to the \(R^2\))

Regression Error estimates

  • The plot displays the MSE in relation to the number of trees generated.
plot(model2)

Random Forest for Regression

  • Variance importance for this model
varImpPlot(model2, sort = TRUE)

Take aways of the Random Forests

  • Improve accuracy on new data

  • Less sensitive to random changes in the data.

  • More relevance for low-frequency classifications.

  • Random forests give me a lot of the benefit of decision

  • trees with fewer downsides!