Tree-based models are amazing. Here’s a very simple vignette demonstrating how to use Artisanal Machine Learning’s Random Forest and GBM models.
library(ArtisanalMachineLearning)
This example will be using the ‘Abalone’ dataset that I robbed from the internet here: https://archive.ics.uci.edu/ml/datasets/abalone where we try to predict the age of abalones from measured characteristics.
dim(abalone_data$data)
## [1] 4177 8
summary(abalone_data$data)
## sex length diameter height
## Min. :1.000 Min. :0.075 Min. :0.0550 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.450 1st Qu.:0.3500 1st Qu.:0.1150
## Median :2.000 Median :0.545 Median :0.4250 Median :0.1400
## Mean :2.053 Mean :0.524 Mean :0.4079 Mean :0.1395
## 3rd Qu.:3.000 3rd Qu.:0.615 3rd Qu.:0.4800 3rd Qu.:0.1650
## Max. :3.000 Max. :0.815 Max. :0.6500 Max. :1.1300
## weight.w weight.s weight.v weight.sh
## Min. :0.0020 Min. :0.0010 Min. :0.0005 Min. :0.0015
## 1st Qu.:0.4415 1st Qu.:0.1860 1st Qu.:0.0935 1st Qu.:0.1300
## Median :0.7995 Median :0.3360 Median :0.1710 Median :0.2340
## Mean :0.8287 Mean :0.3594 Mean :0.1806 Mean :0.2388
## 3rd Qu.:1.1530 3rd Qu.:0.5020 3rd Qu.:0.2530 3rd Qu.:0.3290
## Max. :2.8255 Max. :1.4880 Max. :0.7600 Max. :1.0050
The semi-large data set has many numeric columns and a numeric response that takes integer values from 1-29.
random_forest = aml_random_forest(data=abalone_data$data,
response=abalone_data$response,
b=200,
m=6,
evaluation_criterion=sum_of_squares,
min_obs=5,
max_depth=16,
verbose=FALSE)
Now that we have a random forest model, let’s simply verify that it’s fitting a better-than-garbage model on the training data.
random_forest_predictions = predict_all(random_forest, abalone_data$data, n_trees=200)
sum((abalone_data$response - random_forest_predictions)^2) / length(abalone_data$response)
## [1] 3.235341
Not bad! This model is clearly picking up some signal. Let’s try out a small GBM now just for kicks.
gbm = aml_gbm(abalone_data$data,
abalone_data$response,
learning_rate=.1,
n_trees=50,
evaluation_criterion=sum_of_squares,
min_obs=10,
max_depth=4,
verbose=FALSE)
Circa the RF model, let’s see if this picks up any signal at all on the training data.
gbm_predictions = predict_all(gbm, abalone_data$data, n_trees=50)
sum((abalone_data$response - gbm_predictions)^2) / length(abalone_data$response)
## [1] 7.357425
The RF model is outperforming the GBM, but the GBM is significantly smaller and the author didn’t spend much time tuning the hyperparameters ¯_(ツ)_/¯
Also, this illustration only includes looking at statistics on the training data set, so we definitely can’t make huge conclusions. The author simply wanted to demonstrate these hand-crafted models were producing better-than-trash results, and that has been achieved.