Data Science Flow from H2O's Web Interface

You can follow along with our video tutorial:

Step 1: Import Data

The airlines data set we are importing is a subset of the data made available by RITA with a mix of numeric and factor columns. In the following tutorial, we will build multiple classification models predicting flight delays, run model comparisons, and score on a specific model.

Step 2: Data Summary

  • On the data inspect page navigate to the Summary which you can also access by Data > Summary
  • Click Submit to get a summary of all the columns in the data:
    • Numeric Columns: Min, Max, and Quantiles
    • Factor Columns: Counts of each factor, Cardinality, NAs

Step 3: Split Data into Test and Training Sets

Step 4: Build a GLM model

  • Go to Model > Generalized Linear Model
  • Input for source: allyears2k_shuffled_part0.hex
  • Select for response: IsDepDelayed
  • Select to ignore all columns (Ctrl+A) except for Year, Month, DayofMonth, DayOfWeek, UniqueCarrier, Origin, Dest, and Distance (Ctrl)
  • Select for family: binomial
  • Check use all factor levels and variable importances
  • Hit submit to start the job

Step 5: Build a 50-Tree GBM model

  • Go to Model > Gradient Boosting Machine
  • Input for source: allyears2k_shuffled_part0.hex
  • Select for response: IsDepDelayed
  • Select to ignore all columns (Ctrl+A) except for Year, Month, DayofMonth, DayOfWeek, UniqueCarrier, Origin, Dest, and Distance (Ctrl)
  • Click Submit to start the job

Step 6: Build a simpler 5-Tree GBM model

  • Go to Model > Gradient Boosting Machine
  • Input for source: allyears2k_shuffled_part0.hex
  • Select for response: IsDepDelayed
  • Select to ignore all columns (Ctrl+A) except for Year, Month, DayofMonth, DayOfWeek, UniqueCarrier, Origin, Dest, and Distance (Ctrl)
  • Input for ntrees: 5
  • Click Submit to start the job
  • On the model output page, click the JSON tab.
  • On the model output page, click the JAVA tab.
  • Go to Model > Gradient Boosting Machine
  • Input for source: allyears2k_shuffled_part0.hex
  • Select for response: IsDepDelayed
  • Select to ignore all columns (Ctrl+A) except for Year, Month, DayofMonth, DayOfWeek, UniqueCarrier, Origin, Dest, and Distance (Ctrl)
  • Input for hidden: (10,10), (20,20,20)
  • Click Submit to start the job

The models are sorted by error rates. Scroll to all the way to the right to select the first model on the list.

Step 8: Multimodel Scoring Engine

  • Navigate to Score > Multi model Scoring (beta)
  • Select data set allyears2k.hex and scroll to the compatible models and select VIEW THESE MODELS...
  • Select all the models on the left hand task bar.
  • Click SCORE... and select allyears2k_shuffled_part1.hex and click OK

The tabular viewing of the models allows the user to have a side by side comparison of all the models.

Creating Visualizations

  • Navigate to ADVANCED Tab to see overlaying ROC curves
  • Click ADD VISUALIZATION...
  • For the X-Axis Field choose Training Time (ms)
  • For the Y-Axis Field choose AUC

Examine the new graph you created. Weigh the value of extra gain in accuracy for time taken to train the models. Select the model with your desired level of accuracy, then copy and paste the model key and proceed to step 9.

Step 9: Create Frame with Predicted Values

  • Navigate back to Home Page > Score > Predict
  • Input for model: paste the model key you got from Step 8
  • Input for data: allyears2k_shuffled_part1.hex
  • Input for prediction: pred

Step 10: Export Predicted Values as CSV

or export any frame:

  • Navigate to Data > Export Files
  • Input for src key: pred
  • Input for path: /data/h2o-training/airlines/pred.csv

Step 11: Save a model for use later

  • Navigate to Data > View All
  • Choose to filter by the model key
  • Click Save Model
  • Input for path: /data/h2o-training/airlines/50TreesGBMmodel
  • Click Submit

Errors? Download and send us the log files!

Step 12: Shutdown your H2O instance

  • Go to [Admin > Shutdown]

Extra Bonus: Reload that saved model

In a active H2O session:

  • Navigate to the Load Model
  • Input for path: /data/h2o-training/airlines/50TreesGBMmodel
  • Hit Submit