R with H2O

This tutorial demonstrates basic data import, manipulations, and summarizations of data within an H2O cluster from within R. It requires an installation of the h2o R package and its dependencies.

Load the H2O R package and start an local H2O cluster

Connection to an H2O cloud is established through the h2o.init function from the h2o package. For the purposes of this training exercise, we will use a local H2O cluster running on the default port of 54321. We will also use the default cluster memory size and set nthreads = -1 to make all the CPUs available to the H2O cluster.

library(h2o)
h2oServer <- h2o.init(nthreads = -1)

Download Data

This tutorial uses a 10% sample of the Person-Level 1% 2013 Public Use Microdata Sample (PUMS) from United States Census Bureau, making it a Person-Level 0.1% 2013 PUMS.

wget https://s3.amazonaws.com/h2o-training/pums2013/adult_2013_full.csv.gz

Load data into the key-value store in the H2O cluster

We will use the h2o.importFile function to read the data into the H2O key-value store.

datadir <- "~/Downloads"
csvfile <- "adult_2013_full.csv.gz"
adult_2013_full <- h2o.importFile(h2oServer,
                                  path = file.path(datadir, csvfile),
                                  key = "adult_2013_full", sep = ",")

The key argument to the h2o.importFile function sets the name of the data set in the H2O key-value store. If the key argument is not supplied, the data will reside in the H2O key-value store under a machine generated name.

The results of the h2o.ls function shows the size of the object held by the adult_2013_full key in the H2O key-value store.

kvstore <- h2o.ls(h2oServer)
kvstore
kvstore$Bytesize[kvstore$Key == "adult_2013_full"] / 1024^2

Examine the proxy object for the H2O resident data

The resulting adult_2013_full object is of class H2OParsedData, which implements methods commonly associated with native R data.frame objects.

class(adult_2013_full)
dim(adult_2013_full)
head(colnames(adult_2013_full), 50)

Create an up-to-date UCI Adult Data Set

In the interest of familiarity, we will create a data set similar to the UCI Adult Data Set from the University of California Irvine (UCI) Machine Learning Repository. In particular, we want to extract the age of person (AGEP), class of worker (COW), educational attainment (SCHL), marital status (MAR), industry employed (INDP), relationship (RELP), race (RAC1P), sex (SEX), interest/dividends/net rental income over the past 12 months (INTP), usual hours worked per week over the past 12 months (WKHP), place of birth (POBP), and wages/salary income over the past 12 months.

nms <- c("AGEP", "COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX",
         "INTP", "WKHP", "POBP", "WAGP")
adult_2013 <- adult_2013_full[!is.na(adult_2013_full$WAGP) &
                              adult_2013_full$WAGP > 0, nms]
h2o.ls(h2oServer)

Although we created an object in R called adult_2013, there is no value with that key in the H2O key-value store. To make it easier to track our data set, we will copy it's value to the adult_2013 key using the h2o.assign function and delete all the machine generated keys with the prefix Last.value that served as intermediary objects using the h2o.rm function.

adult_2013 <- h2o.assign(adult_2013, key = "adult_2013")
h2o.ls(h2oServer)

rmLastValues <- function(pattern = "Last.value.")
{
  keys <- h2o.ls(h2oServer, pattern = pattern)$Key
  if (!is.null(keys))
    h2o.rm(h2oServer, keys)
  invisible(keys)
}
rmLastValues()

kvstore <- h2o.ls(h2oServer)
kvstore
kvstore$Bytesize[kvstore$Key == "adult_2013"] / 1024^2

Summarize the 2013 update of the UCI Adult Data Set

As mentioned above, an R proxy object to an H2O data set implements several methods commonly associated with R data.frame objects including the summary function to obtain column-level summaries and the dim function to get the row and column count.

summary(adult_2013)
dim(adult_2013)

As with R data.frame objects, individual columns within an H2O data set can be summarized using methods commonly associated with R vector objects. For example, the quantile function in R is used to find sample quantiles at probability values specified in the prob argument.

centiles <- quantile(adult_2013$WAGP, probs = seq(0, 1, by = 0.01))
centiles

The use of the $ operator to extract the column WAGP from the adult_2013 data set generated new Last.value keys that we will clean up in the interest of maintaining a tidy key-value store.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Derive columns: capital gain and capital loss columns

The original UCI Adult Data Set contains columns for capital gain and capital loss, which can be extracted from the INTP column within the Person-Level PUMS data set. We will derive these two columns using the ifelse function where the test condition is whether the INTP column is positive or negative, and if that condition is met, the value is either INT (capital gain) / - INT (capital loss) or 0. If we were just interested in measure the magnitude of either a loss or a gain, we could have used the abs function.

capgain <- ifelse(adult_2013$INTP > 0, adult_2013$INTP, 0)
caploss <- ifelse(adult_2013$INTP < 0, - adult_2013$INTP, 0)
adult_2013$CAPGAIN <- capgain
adult_2013$CAPLOSS <- caploss
adult_2013 <- adult_2013[,- match("INTP", colnames(adult_2013))]

Now that we have the capital gain and loss columns, we can assign our new data set to the adult_2013 key and remove all the temporary keys from the H2O key-value store.

adult_2013 <- h2o.assign(adult_2013, key = "adult_2013")

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Derive columns: log transformations for income variables

The UCI Adult Data Set was originally created to predict whether a person's income in the early 1990s exceeds $50,000 per year. Given that incomes are right-skewed, transforming these measures to a log scale tends to make them more conducive to use in predictive modeling.

adult_2013$LOG_CAPGAIN <- log(adult_2013$CAPGAIN + 1L)
adult_2013$LOG_CAPLOSS <- log(adult_2013$CAPLOSS + 1L)
adult_2013$LOG_WAGP    <- log(adult_2013$WAGP    + 1L)

Now that we have the log transformed columns, we can assign our new data set to the adult_2013 key and remove all the temporary keys from the H2O key-value store.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Create cross-tabulations of original and derived categorical variables

We will begin an analysis of wages by exploring the pairwise relationships between wage groups subdivided into percentiles and the variables we will use as predictors in our statistical models. In the code below the h2o.cut function create the wage groups and the h2o.table function to performs the cross-tabulations.

cutpoints <- centiles
cutpoints[1L] <- 0
adult_2013$CENT_WAGP <- h2o.cut(adult_2013$WAGP, cutpoints)
adult_2013$TOP2_WAGP <- adult_2013$WAGP > centiles[99L]

centcounts <- h2o.table(adult_2013["CENT_WAGP"], return.in.R = TRUE)
round(100 * centcounts/sum(centcounts), 2)

top2counts <- h2o.table(adult_2013["TOP2_WAGP"], return.in.R = TRUE)
round(100 * top2counts/sum(top2counts), 2)

relpxtabs <- h2o.table(adult_2013[c("RELP", "TOP2_WAGP")], return.in.R = TRUE)
relpxtabs
round(100 * relpxtabs/rowSums(relpxtabs), 2)

schlxtabs <- h2o.table(adult_2013[c("SCHL", "TOP2_WAGP")], return.in.R = TRUE)
schlxtabs

round(100 * schlxtabs/rowSums(schlxtabs), 2)

Perform a key-value store clean up.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Coerce integer columns to factor (categorical) columns

As with standard R integer vectors, integer columns in H2O can be converted to a categorical type using an as.factor method. For our data set we have 8 columns that use integer codes to represent categorical levels.

for (j in c("COW", "SCHL", "MAR", "INDP", "RELP", "RAC1P", "SEX", "POBP"))
  adult_2013[[j]] <- as.factor(adult_2013[[j]])

Perform a key-value store clean up.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)

Create pairwise interaction terms for linear modeling

While some modeling approaches, such as gradient boosting machines (GBM), random forests, and deep learning, are able to derive interactions between terms during the modeling training stage, other modeling approaches, such as generalized linear models (GLM), require interactions to be user defined inputs. We will use the h2o.interaction function to generate a new column in our data set that pairs relationship (RELP) with education attainment (SCHL) to form a new column labeled RELP_SCHL that we will "column bind" to our data set using a cbind method.

inter_2013 <- h2o.interaction(adult_2013, factors = c("RELP", "SCHL"),
                              pairwise = TRUE, max_factors = 10000,
                              min_occurrence = 10)
adult_2013 <- cbind(adult_2013, inter_2013)
adult_2013 <- h2o.assign(adult_2013, key = "adult_2013")
colnames(adult_2013)

Now that we have derived a few sets of variables, we can examine the H2O key-value store to ensure we have the expected objects.

h2o.ls(h2oServer)
rmLastValues()

kvstore <- h2o.ls(h2oServer)
kvstore
kvstore$Bytesize[kvstore$Key == "adult_2013"] / 1024^2

Generate group by aggregates

In addition to cross-tabulations, we can create more detailed group by aggregates using the h2o.ddply function that was inspired by the ddply function from Hadley Wickham's plyr package, which is hosted on the Comprehensive R Archive Network (CRAN).

wagpSummary <- function(frame) {
  cbind(N      = length(frame$WAGP),
        Min    = min(frame$WAGP),
        Median = median(frame$WAGP),
        Max    = max(frame$WAGP),
        Mean   = mean(frame$WAGP),
        StdDev = sd(frame$WAGP))
}
h2o.addFunction(h2oServer, wagpSummary)
statsByGroup <- as.data.frame(h2o.ddply(adult_2013, "RELP", wagpSummary))
colnames(statsByGroup)[-1L] <- c("N", "Min", "Median", "Mean", "Max", "StdDev")
statsByGroup <- statsByGroup[order(statsByGroup$median, decreasing = TRUE), ]
rownames(statsByGroup) <- NULL
statsByGroup

Create training and test data sets to use during modeling

As a final step in an exploration of H2O basics in R, we will create a 75% / 25% split, where the larger data set will be used for training a model and the smaller data set will be used for testing the usefulness of the model. We will achieve this by using the h2o.runif function to generate random uniforms over [0, 1] for each row and using those random values to determine the split designation for that row.

rand <- h2o.runif(adult_2013, seed = 1185)
adult_2013_train <- adult_2013[rand <= 0.75, ]
adult_2013_train <- h2o.assign(adult_2013_train, key = "adult_2013_train")
adult_2013_test <- adult_2013[rand  > 0.75, ]
adult_2013_test <- h2o.assign(adult_2013_test, key = "adult_2013_test")

Now check to make sure the size of the resulting data sets meet expectations.

nrow(adult_2013)
nrow(adult_2013_train)
nrow(adult_2013_test)

Perform a key-value store clean up.

h2o.ls(h2oServer)
rmLastValues()
h2o.ls(h2oServer)