Python on H2O

Similar to the R API, the H2O python module provides access to the H2O JVM, its objects, its machine-learning algorithms, and modeling support (basic munging and feature generation) capabilities. The python integration is only available in the new generation H2O since all the REST endpoints it hits are different from the endpoints in the older version of H2O.

The H2O python module is not intended as a replacement for other popular machine learning modules such as scikit-learn, pylearn2, and their ilk. This module is a complementary interface to a modeling engine intended to make the transition of models from development to production as seamless as possible. Additionally, it is designed to bring H2O to a wider audience of data and machine learning devotees that work exclusively with Python (rather than R or scala or Java -- which are other popular interfaces that H2O supports), and are wanting another tool for building applications or doing data munging in a fast, scalable environment without any extra mental anguish about threads and parallelism.

This tutorial demonstrates basic data import, manipulations, and summarizations of data within an H2O cluster from within Python. It requires an installation of the h2o python module and its dependencies.

Setup and Installation

Prerequisites: Python 2.7 and Numpy 1.9.2

First install the necessary dependecies the H2O module is built on and then install the H2O module itself.

pip install requests
pip install tabulate

# Remove any preexisiting H2O module.
pip uninstall h2o
# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o-dev/master/1109/Python/h2o-0.3.0.1109-py2.py3-none-any.whl    

Instantiate a H2O Cluster

The H2O JVM sports a web server such that all communication occurs on a socket (specified by an IP address and a port) via a series of REST calls (see connection.py for the REST layer implementation and details). There is a single active connection to the H2O JVM at any one time, and this handle is stashed away out of sight in a singleton instance of :class:H2OConnection (this is the global :envvar:__H2OConn__).

import h2o
h2o.init()

Get data

This tutorial in particular will use a NYC's public Citibike dataset and import one month of all trip information across 300+ stations into H2O. Then a model predicting the demand and usage of bikes will be built to load-balance the Citibike inventory. Users can download the data from Citibike-NYC Website.

wget https://s3.amazonaws.com/tripdata/201310-citibike-tripdata.zip

Load data into the key-value store in the H2O cluster

We will use the h2o.import_frame function to read the data into the H2O key-value store.

dir_prefix = "~/Downloads/"
small_test = dir_prefix + "201310-citibike-tripdata.zip"
data = h2o.import_frame(path=small_test)

Examine the proxy object for the H2O resident data

The resulting data object is of class H2OFrame, which implements methods such as describe, dim, and head.

data.describe()
data.dim()
data.head()

Derive columns

In this step we will take the time since epoch and divide by the number of seconds in a day to get all the unique days in the dataset and bind the new column created to the original H2OFrame.

startime = data["starttime"]
secsPerDay=1000*60*60*24
data["Days"] = (startime/secsPerDay).floor()
data.describe()

Data munging and aggregations

Once we have the unique Days column we can group by it. The goal is to count the number of bike starts per-station per-day. This will reduce the dataset down from 1,000,000 rows to about 10,000 rows which is the cardinality of Days times the cardinality of start station name).

ddplycols=["Days","start station name"]
bpd = h2o.ddply(data[ddplycols],ddplycols,"(%nrow)")  # Compute bikes-per-day
bpd["C1"]._name = "bikes" # Rename column from generic name
bpd.describe()
bpd.dim()

Data exploration and quantiles

To gauge the distribution of bike counts in the dataset run quantile on the column bikes. It ranges wildly from stations with less than 10 bike checkouts in a day vs busy stations in the heart of NYC with hundreds of checkouts in one day.

print "Quantiles of bikes-per-day"
bpd["bikes"].quantile().show()

Now the dataset only has three features but we can add to the frame columns Month and DayOfWeek hopefully to detect patterns where bike stations near the financial district will be more active during work days while bikes near parks might be more active during the weekend.

secs = bpd["Days"]*secsPerDay
bpd["Month"]     = secs.month()
bpd["DayOfWeek"] = secs.dayOfWeek()
print "Bikes-Per-Day"
bpd.describe()

Create training and test data sets to use during modeling

As a final step in exploration of H2O basics in R, we will create a test/train/holdout split, where the larger data set will be used for training a model and the smaller data set will be used for testing the usefulness of the model. We will achieve this by using the runif function to generate random uniforms over [0, 1] for each row and using those random values to determine the split designation for that row.

Then we will build GBM (h2o.gbm), Random Forest(h2o.random_forest), and GLM (h2o.glm) regression models predicting for bikes counts. Once the models are built we can score on the different subsets of the data and report back the performance of each model using function model_performance.

def split_fit_predict(data):
  # Classic Test/Train split
  r = data['Days'].runif()   # Random UNIForm numbers, one per row
  train = data[  r  < 0.6]
  test  = data[(0.6 <= r) & (r < 0.9)]
  hold  = data[ 0.9 <= r ]
  print "Training data has",train.ncol(),"columns and",train.nrow(),"rows, test has",test.nrow(),"rows, holdout has",hold.nrow()

  # Run GBM
  gbm = h2o.gbm(x           =train.drop("bikes"),
                y           =train     ["bikes"],
                validation_x=test .drop("bikes"),
                validation_y=test      ["bikes"],
                ntrees=500, # 500 works well
                max_depth=6,
                learn_rate=0.1)

  # Run DRF
  drf = h2o.random_forest(x =train.drop("bikes"),
                y           =train     ["bikes"],
                validation_x=test .drop("bikes"),
                validation_y=test      ["bikes"],
                ntrees=500, # 500 works well
                max_depth=50)

  # Run GLM
  glm = h2o.glm(x           =train.drop("bikes"),
                y           =train     ["bikes"],
                validation_x=test .drop("bikes"),
                validation_y=test      ["bikes"],
                dropNA20Cols=True)
  #glm.show()


  # ----------
  # 4- Score on holdout set & report
  train_r2_gbm = gbm.model_performance(train).r2()
  test_r2_gbm  = gbm.model_performance(test ).r2()
  hold_r2_gbm  = gbm.model_performance(hold ).r2()
  print "GBM R2 TRAIN=",train_r2_gbm,", R2 TEST=",test_r2_gbm,", R2 HOLDOUT=",hold_r2_gbm

  train_r2_drf = drf.model_performance(train).r2()
  test_r2_drf  = drf.model_performance(test ).r2()
  hold_r2_drf  = drf.model_performance(hold ).r2()
  print "DRF R2 TRAIN=",train_r2_drf,", R2 TEST=",test_r2_drf,", R2 HOLDOUT=",hold_r2_drf

  train_r2_glm = glm.model_performance(train).r2()
  test_r2_glm  = glm.model_performance(test ).r2()
  hold_r2_glm  = glm.model_performance(hold ).r2()
  print "GLM R2 TRAIN=",train_r2_glm,", R2 TEST=",test_r2_glm,", R2 HOLDOUT=",hold_r2_glm
  # --------------

Here we see an r^2 of 0.92 for GBM, and 0.72 for GLM on the test set. This means given just the station, the month, and the day-of-week we can predict 90% of the variance of the bike-trip-starts.

split_fit_predict(bpd)