Classification using Generalized Linear Models, Gradient Boosting Machines, Random Forests and Deep Learning in H2O

This tutorial demonstrates classification modeling in H2O using generalized linear models (GLM), gradient boosting machines (GBM), random forests and Deep Learning. It requires an installation of the H2O R package and its dependencies.

Load the H2O R package and start an local H2O cluster

We will begin this tutorial by starting a local H2O cluster using the default heap size and as much compute as the operating system will allow.

library(h2o)
h2oServer <- h2o.init(nthreads = -1)

rmLastValues <- function(pattern = "Last.value.")
{
  keys <- h2o.ls(h2oServer, pattern = pattern)$Key
  if (!is.null(keys))
    h2o.rm(h2oServer, keys)
  invisible(keys)
}

Load the training and testing data into the H2O key-value store

This tutorial uses a 0.1% sample of the Person-Level 2013 Public Use Microdata Sample (PUMS) from United States Census Bureau with 75% of that sample being designated to the training data set and 25% to the test data set. This data set is intended to be an update to the UCI Adult Data Set.

datadir <- "/data"
pumsdir <- file.path(datadir, "h2o-training", "pums2013")
trainfile <- "adult_2013_train.csv.gz"
testfile  <- "adult_2013_test.csv.gz"
adult_2013_train <- h2o.importFile(h2oServer,
                                   path = file.path(pumsdir, trainfile),
                                   key = "adult_2013_train", sep = ",")
adult_2013_test <- h2o.importFile(h2oServer,
                                  path = file.path(pumsdir, testfile),
                                  key = "adult_2013_test", sep = ",")
dim(adult_2013_train)
dim(adult_2013_test)

For the purposes of validation, we will create a single column data set containing only the target variable TOP2_WAGP from the test data set.

actual_top2_wagp <- h2o.assign(adult_2013_test[, "TOP2_WAGP"],
                               key = "actual_top2_wagp")
rmLastValues()

Also for our data set we have 8 columns that use integer codes to represent categorical levels so we will coerce them to factor after the data read.

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP")) {
  adult_2013_train[[j]] <- as.factor(adult_2013_train[[j]])
  adult_2013_test[[j]]  <- as.factor(adult_2013_test[[j]])
}
rmLastValues()

Fit a basic logistic regression model

top2_wagp_glm_relp <- h2o.glm(x = "RELP", y = "TOP2_WAGP",
                              data = adult_2013_train,
                              key  = "top2_wagp_glm_relp",
                              family = "binomial",
                              lambda = 0)
top2_wagp_glm_relp

Generate performance metrics from a single model

We can create detailed performance metrics using the h2o.performance function that was inspired by the performance function from Tobias Sing's ROCR package, which is hosted on the Comprehensive R Archive Network (CRAN).

Binary classifier performance metrics

pred_top2_wagp_glm_relp <- h2o.predict(top2_wagp_glm_relp, adult_2013_test)
pred_top2_wagp_glm_relp

prob_top2_wagp_glm_relp <- h2o.assign(pred_top2_wagp_glm_relp[, 3L],
                                      key = "prob_top2_wagp_glm_relp")
rmLastValues()

f1_top2_wagp_glm_relp <- h2o.performance(prob_top2_wagp_glm_relp,
                                         actual_top2_wagp,
                                         measure = "F1")
f1_top2_wagp_glm_relp
class(f1_top2_wagp_glm_relp)
getClassDef("H2OPerfModel")

Plot Receiver Operating Characteristic (ROC) curve and find its Area Under the Curve (AUC)

A receiver operating characteristic (ROC) curve is a graph of the true positive rate ( recall ) against the false positive rate (1 - specificity) for a binary classifier.

plot(f1_top2_wagp_glm_relp, type = "cutoffs", col = "blue")

plot(f1_top2_wagp_glm_relp, type = "roc", col = "blue", typ = "b")

We can extract the area under the ROC curve from our model object to see how close it is to the ideal of 1.

f1_top2_wagp_glm_relp@model$auc

The Gini coefficient, not to be confused with the Gini impurity splitting metric used in decision tree based algorithms such as h2o.randomForest, is a linear rescaling of AUC defined by $Gini = 2 * AUC - 1$.

f1_top2_wagp_glm_relp@model$gini
2 * f1_top2_wagp_glm_relp@model$auc - 1

Another metric we could have used to determine a cutoff is the lift metric, which is a ratio of the probability of a positive prediction given the model and the baseline probability for the population, that is generated by the h2o.gains function, which was inspired by Craig Rolling's gains package hosted on CRAN.

h2o.gains(actual_top2_wagp, prob_top2_wagp_glm_relp)

Making class predictions based on a probability cutoff

Using the cutoff that results in the largest harmonic mean of precision and recall (F1 score), we can now classify our predictions on the test data set.

class(f1_top2_wagp_glm_relp@model)
names(f1_top2_wagp_glm_relp@model)
class_top2_wagp_glm_relp <-
  prob_top2_wagp_glm_relp > f1_top2_wagp_glm_relp@model$best_cutoff
class_top2_wagp_glm_relp <- h2o.assign(class_top2_wagp_glm_relp,
                                       key = "class_top2_wagp_glm_relp")
rmLastValues()

Generate a confusion matrix from a single model

To understand how well our model fits the test data set, we can examine the results of the h2o.confusionMatrix function for a tabular display of the predicted versus actual values.

h2o.confusionMatrix(class_top2_wagp_glm_relp, actual_top2_wagp)

Remove temporary objects from R and the H2O cluster

rm(pred_top2_wagp_glm_relp,
   prob_top2_wagp_glm_relp,
   class_top2_wagp_glm_relp)
rmLastValues()

Fit an elastic net logistic regression model across a grid of parameter settings

Now that we are familiar with H2O model fitting in R, we can fit more sophisticated models involving a larger set of predictors.

addpredset <- c("COW", "MAR", "INDP", "RAC1P", "SEX", "POBP", "AGEP",
                "WKHP", "LOG_CAPGAIN", "LOG_CAPLOSS")

In the context of elastic net regularization, we need to search the parameter space defined by the mixing parameter alpha and the shrinkage parameter lambda. To aide us in this search H2O can produce a grid of models for all combinations of a discrete set of parameters.

We will use different methods for specifying the alpha and lambda values as they are dependent upon one another. For the alpha parameter, we will specify five values ranging from 0 (ridge) to 1 (lasso) by increments of 0.25. For lambda, we will turn on an automated lambda search by setting lambda = TRUE and specify the number of lambda values to 10 by setting nlambda = 10.

top2_wagp_glm_grid <- h2o.glm(x = c("RELP_SCHL", addpredset),
                              y = "TOP2_WAGP",
                              data = adult_2013_train,
                              key  = "top2_wagp_glm_grid",
                              family = "binomial",
                              lambda_search = TRUE,
                              nlambda = 10,
                              return_all_lambda = TRUE,
                              alpha = c(0, 0.25, 0.5, 0.75, 1))

We now have an object of class H2OGLMGrid that contains a list of H2OGLMModelList objects for each of the models fit on the grid.

class(top2_wagp_glm_grid)
getClassDef("H2OGLMGrid")

class(top2_wagp_glm_grid@model[[1L]])
getClassDef("H2OGLMModelList")

length(top2_wagp_glm_grid@model[[1L]]@models)
class(top2_wagp_glm_grid@model[[1L]]@models[[1L]])

Currently the h2o package does not contain any helper functions for extracting models of interest, so we have to explore the model object and choose the model we like best.

top2_wagp_glm_grid@model[[1L]]@models[[1L]]@model$params$alpha # ridge
top2_wagp_glm_grid@model[[2L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[3L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[4L]]@models[[1L]]@model$params$alpha
top2_wagp_glm_grid@model[[5L]]@models[[1L]]@model$params$alpha  # lasso

top2_wagp_glm_grid_f1 <-
  sapply(top2_wagp_glm_grid@model,
         function(x)
           sapply(x@models, function(y)
             h2o.performance(h2o.predict(y, adult_2013_test)[, 3L],
                             actual_top2_wagp,
                             measure = "F1")@model$error
           ))
top2_wagp_glm_grid_f1
top2_wagp_glm_grid_f1 == min(top2_wagp_glm_grid_f1)

top2_wagp_glm_best <- top2_wagp_glm_grid@model[[4L]]@models[[7L]]

prob_top2_wagp_glm_best <- h2o.predict(top2_wagp_glm_best, adult_2013_test)[, 3L]
prob_top2_wagp_glm_best <- h2o.assign(prob_top2_wagp_glm_best,
                                      key = "prob_top2_wagp_glm_best")
rmLastValues()

f1_top2_wagp_glm_best <- h2o.performance(prob_top2_wagp_glm_best,
                                         actual_top2_wagp,
                                         measure = "F1")

plot(f1_top2_wagp_glm_best, type = "cutoffs", col = "blue")

plot(f1_top2_wagp_glm_best, type = "roc", col = "blue")

We can find the number of non-zero coefficients in our best logistic regression model to gain some understanding about its overall complexity.

table(coef(top2_wagp_glm_best@model) != 0)
nzcoefs <- coef(top2_wagp_glm_best@model)
nzcoefs <- names(nzcoefs)[nzcoefs != 0]
nzcoefs <- unique(sub("\\..*$", "", nzcoefs))
setdiff(c("RELP_SCHL", addpredset), nzcoefs) # all preds had non-zero coefs

Fit a gradient boosting machine binomial regression model

Given that not all relationships can be reduced to a linear combination or terms, we can compare the GLM results with that of a gradient (tree) boosting machine. As with the final GLM exploration, we will fit a grid of GBM models by varying the number of trees and the shrinkage rate and select the best model with respect to the test data set.

top2_wagp_gbm_grid <- h2o.gbm(x = c("RELP", "SCHL", addpredset),
                              y = "TOP2_WAGP",
                              data = adult_2013_train,
                              key  = "top2_wagp_gbm_grid",
                              distribution = "multinomial",
                              n.trees = c(10, 20, 40),
                              shrinkage = c(0.05, 0.1, 0.2),
                              validation = adult_2013_test,
                              importance = TRUE)
top2_wagp_gbm_grid

class(top2_wagp_gbm_grid)
slotNames(top2_wagp_gbm_grid)
class(top2_wagp_gbm_grid@model)
length(top2_wagp_gbm_grid@model)
class(top2_wagp_gbm_grid@model[[1L]])
top2_wagp_gbm_best <- top2_wagp_gbm_grid@model[[1L]]
top2_wagp_gbm_best

h2o.performance(h2o.predict(top2_wagp_glm_best, adult_2013_test)[, 3L],
                actual_top2_wagp, measure = "F1")@model$error
h2o.performance(h2o.predict(top2_wagp_gbm_best, adult_2013_test)[, 3L],
                actual_top2_wagp, measure = "F1")@model$error

Fit a random forest classifier

We will fit a single random forest model with 200 trees of maximum depth 10.

top2_wagp_forest <- h2o.randomForest(x = c("RELP", "SCHL", addpredset),
                                     y = "TOP2_WAGP",
                                     data = adult_2013_train,
                                     key  = "top2_wagp_forest",
                                     classification = TRUE,
                                     depth = 10,
                                     ntree = 200,
                                     validation = adult_2013_test,
                                     seed = 8675309,
                                     type = "BigData")
top2_wagp_forest

Fit a deep learning classifier

Lastly we will fit a single Deep Learning model with default settings (more details about Deep Learning follow later) and compare the errors across the four model types.

top2_wagp_dl <- h2o.deeplearning(x = c("RELP", "SCHL", addpredset),
                                 y = "TOP2_WAGP",
                                 data = adult_2013_train,
                                 key  = "top2_wagp_dl",
                                 classification = TRUE,
                                 validation = adult_2013_test)
top2_wagp_dl


h2o.performance(h2o.predict(top2_wagp_glm_best, adult_2013_test)[, 3L],
                actual_top2_wagp, measure = "F1")@model$error
h2o.performance(h2o.predict(top2_wagp_gbm_best, adult_2013_test)[, 3L],
                actual_top2_wagp, measure = "F1")@model$error
h2o.performance(h2o.predict(top2_wagp_forest,   adult_2013_test)[, 3L],
                actual_top2_wagp, measure = "F1")@model$error
h2o.performance(h2o.predict(top2_wagp_dl,       adult_2013_test)[, 3L],
                actual_top2_wagp, measure = "F1")@model$error