Random Forest and GBM Vignette

Jeremy Werner

2019-04-07

Tree-based models are amazing. Here’s a very simple vignette demonstrating how to use Artisanal Machine Learning’s Random Forest and GBM models.

Read Abalone Data

library(ArtisanalMachineLearning)

This example will be using the ‘Abalone’ dataset that I robbed from the internet here: https://archive.ics.uci.edu/ml/datasets/abalone where we try to predict the age of abalones from measured characteristics.

dim(abalone_data$data)
## [1] 4177    8
summary(abalone_data$data)
##       sex            length         diameter          height      
##  Min.   :1.000   Min.   :0.075   Min.   :0.0550   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150  
##  Median :2.000   Median :0.545   Median :0.4250   Median :0.1400  
##  Mean   :2.053   Mean   :0.524   Mean   :0.4079   Mean   :0.1395  
##  3rd Qu.:3.000   3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650  
##  Max.   :3.000   Max.   :0.815   Max.   :0.6500   Max.   :1.1300  
##     weight.w         weight.s         weight.v        weight.sh     
##  Min.   :0.0020   Min.   :0.0010   Min.   :0.0005   Min.   :0.0015  
##  1st Qu.:0.4415   1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300  
##  Median :0.7995   Median :0.3360   Median :0.1710   Median :0.2340  
##  Mean   :0.8287   Mean   :0.3594   Mean   :0.1806   Mean   :0.2388  
##  3rd Qu.:1.1530   3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290  
##  Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050

The semi-large data set has many numeric columns and a numeric response that takes integer values from 1-29.

Random Forest Model

random_forest = aml_random_forest(data=abalone_data$data, 
                                  response=abalone_data$response, 
                                  b=200, 
                                  m=6, 
                                  evaluation_criterion=sum_of_squares, 
                                  min_obs=5, 
                                  max_depth=16, 
                                  verbose=FALSE)

Now that we have a random forest model, let’s simply verify that it’s fitting a better-than-garbage model on the training data.

random_forest_predictions = predict_all(random_forest, abalone_data$data, n_trees=200)

SSE on training data

sum((abalone_data$response - random_forest_predictions)^2) / length(abalone_data$response)
## [1] 3.235341

Comparison of predicted and actual for Random Forest

Not bad! This model is clearly picking up some signal. Let’s try out a small GBM now just for kicks.

GBM Model

gbm = aml_gbm(abalone_data$data, 
              abalone_data$response, 
              learning_rate=.1, 
              n_trees=50, 
              evaluation_criterion=sum_of_squares, 
              min_obs=10, 
              max_depth=4, 
              verbose=FALSE)

Circa the RF model, let’s see if this picks up any signal at all on the training data.

gbm_predictions = predict_all(gbm, abalone_data$data, n_trees=50)

SSE on training data

sum((abalone_data$response - gbm_predictions)^2) / length(abalone_data$response)
## [1] 7.357425

Comparison of predicted and actual for GBM

Conclusion

The RF model is outperforming the GBM, but the GBM is significantly smaller and the author didn’t spend much time tuning the hyperparameters ¯_(ツ)_/¯

Also, this illustration only includes looking at statistics on the training data set, so we definitely can’t make huge conclusions. The author simply wanted to demonstrate these hand-crafted models were producing better-than-trash results, and that has been achieved.