Feature Engineering on the Adult dataset

This tutorial shows feature engineering on the Adult dataset. This file is both valid R and markdown code.

Start H2O and load the Adult data

Initialize the H2O server and import the Adult dataset.

library(h2o)
h2oServer <- h2o.init(nthreads=-1)
homedir <- "/data/h2o-training/adult/"
TRAIN = "adult.gz"
data_hex <- h2o.importFile(h2oServer, path = paste0(homedir,TRAIN), header = F, sep = ' ', key = 'data_hex')

We manually assign column names since they are missing in the original file.

colnames(data_hex) <- c("age","workclass","fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","income")
summary(data_hex)

We will try to predict whether income is <=50K or >50K. summary(data_hex$income) response = "income"

First, we source a few helper functions that allow us to quickly compare a multitude of binomial classification models, in particular the h2o.fit() and h2o.leaderBoard() functions. Note that these specific functions require variable importances and N-fold cross-validation to be enabled.

source("~/h2o-training/tutorials/advanced/binaryClassificationHelper.R.md")

We then add this simple helper function to split a frame into train/valid/test pieces, train a GLM and a GBM model with 2-fold cross-validation and obtaining the best model after printing a leaderbaord. For more accurate

N_FOLDS = 2

h2o.trainModels <- function(frame) {
  # split the data into train/valid/test
  random <- h2o.runif(frame, seed = 123456789)
  train_hex <- h2o.assign(frame[random < .8,], "train_hex")
  valid_hex <- h2o.assign(frame[random >= .8 & random < .9,], "valid_hex")
  test_hex  <- h2o.assign(frame[random >= .9,], "test_hex")

  predictors <- colnames(frame)[-match(response,colnames(frame))]

  # multi-model comparison with N-fold cross-validation
  data = list(x=predictors, y=response, train=train_hex, valid=valid_hex, nfolds=N_FOLDS)
  models <- c(
    h2o.fit(h2o.glm, data, glmparams),
    h2o.fit(h2o.gbm, data, gbmparams)
  )
  best_model <- h2o.leaderBoard(models, test_hex, match(response,colnames(frame)))

  h2o.rm(h2oServer, grep(pattern = "Last.value", x = h2o.ls(h2oServer)$Key, value = TRUE))
  best_model
}

Baseline performance on original dataset

For simplicity, we use default parameters (no grid search parameter tuning) to establish baseline performance numbers for this dataset.

glmparams <- list(family="binomial", variable_importances=T, use_all_factor_levels=T)
gbmparams <- list(importance=TRUE)

best_model <- h2o.trainModels(data_hex)

Both GLM and GBM do a great job at this dataset, we get validation AUC values of above 90%: GLM: 0.9028564 GBM: 0.9009924 According to GBM, the most important columns are marital-status,relationship,capital-gain,education-num,age

Feature engineering

The following section shows ways to create new derived features. We'll need this simple append function, as we're going to add new columns the our dataset

h2o.append <- function(frame, col) {
  appended_frame <- h2o.assign(cbind(frame, col), "appended_frame")
  appended_frame
}

1. Turn age into a factor

The feature age is an integer value, but GLM for example will have a tough time predicing income from age with a linear relationship, while GBM should be able to carve out these non-linear dependencies by itself (but it might need more trees, deeper interaction depth than default values)

data_hex <- h2o.append(data_hex, as.factor(data_hex$age))
colnames(data_hex)
summary(data_hex)
best_model <- h2o.trainModels(data_hex)

GLM clearly benefited from this. We see that ages 18,19 and 20 are among the most important predictors for income and we get the following validation AUC values: GLM: 0.9066634 GBM: 0.9009924

For fun, let's look at the largest positively and negatively correlated coefficients:

head(sort(best_model@model$normalized_coefficients,decreasing=T),5)
head(sort(best_model@model$normalized_coefficients,decreasing=F),5)

2. Same for capital-gain/loss and work hours per week

data_hex <- h2o.append(data_hex, as.factor(data_hex$'hours-per-week'))
data_hex <- h2o.append(data_hex, as.factor(data_hex$'capital-gain'))
data_hex <- h2o.append(data_hex, as.factor(data_hex$'capital-loss'))
colnames(data_hex)
summary(data_hex)
best_model <- h2o.trainModels(data_hex)

With all these new factor levels as predictors, GLM now got a nice boost: GLM: 0.9285384 GBM: 0.9021225

Let's give GBM a shot at beating GLM by using better parameters:

gbmparams <- list(importance=TRUE, n.tree=50, interaction.depth=10)
best_model <- h2o.trainModels(data_hex)

Ok, now both algorithms reach similar validation AUC values: GLM: 0.9285384 GBM: 0.9286973

data_hex$'capital-gain'   <- log(1+data_hex$'capital-gain')
data_hex$'capital-loss'   <- log(1+data_hex$'capital-loss')
data_hex
best_model <- h2o.trainModels(data_hex)

We see that the training AUC for GLM improves slightly, from 0.9269056070 to 0.9269586507. Intuition: Money is often distributed exponentially, and the log transform brings it back to a linear space. Note that the validation AUC drops, likely due to small data statistical noise. We clearly got close to the limit of this dataset. Note that GBM didn't benefit from this transform, it seems to be able to better split up the original integer space.

frame <- data_hex
random <- h2o.runif(frame, seed = 123456789)
train_hex <- h2o.assign(frame[random < .8,], "train_hex")
valid_hex <- h2o.assign(frame[random >= .8 & random < .9,], "valid_hex")
test_hex  <- h2o.assign(frame[random >= .9,], "test_hex")

predictors <- colnames(frame)[-match(response,colnames(frame))]

# multi-model comparison with N-fold cross-validation
data = list(x=predictors, y=response, train=train_hex, valid=valid_hex, nfolds=N_FOLDS)
models <- c(
  h2o.fit(h2o.deeplearning, data, list())
)
best_model <- h2o.leaderBoard(models, test_hex, match(response,colnames(frame)))