Data Science Flow from H2O's Web Interface
You can follow along with our video tutorial:
Step 1: Import Data
The airlines data set we are importing is a subset of the data made available by RITA with a mix of numeric and factor columns. In the following tutorial, we will build multiple classification models predicting flight delays, run model comparisons, and score on a specific model.
- Navigate to Data > Import File
- Input into path
/data/h2o-training/airlines/allyears2k.csv
and hit Submit - Click the nfs link
C:\data\h2o-training\airlines\allyears2k.csv
- Scroll down the page to get a preview of your data before clicking Submit again.
Step 2: Data Summary
- On the data inspect page navigate to the Summary which you can also access by Data > Summary
- Click Submit to get a summary of all the columns in the data:
- Numeric Columns: Min, Max, and Quantiles
- Factor Columns: Counts of each factor, Cardinality, NAs
Step 3: Split Data into Test and Training Sets
- Navigate back to data inspect page Data > View All > allyears2k.hex > Split Frame
- Select shuffle and click Submit
- Select allyears2k_suffled_part0.hex for the training frame
Step 4: Build a GLM model
- Go to Model > Generalized Linear Model
- Input for source:
allyears2k_shuffled_part0.hex
- Select for response:
IsDepDelayed
- Select to ignore all columns (Ctrl+A) except for
Year
,Month
,DayofMonth
,DayOfWeek
,UniqueCarrier
,Origin
,Dest
, andDistance
(Ctrl) - Select for family:
binomial
- Check use all factor levels and variable importances
- Hit submit to start the job
Step 5: Build a 50-Tree GBM model
- Go to Model > Gradient Boosting Machine
- Input for source:
allyears2k_shuffled_part0.hex
- Select for response:
IsDepDelayed
- Select to ignore all columns (Ctrl+A) except for
Year
,Month
,DayofMonth
,DayOfWeek
,UniqueCarrier
,Origin
,Dest
, andDistance
(Ctrl) - Click Submit to start the job
Step 6: Build a simpler 5-Tree GBM model
- Go to Model > Gradient Boosting Machine
- Input for source:
allyears2k_shuffled_part0.hex
- Select for response:
IsDepDelayed
- Select to ignore all columns (Ctrl+A) except for
Year
,Month
,DayofMonth
,DayOfWeek
,UniqueCarrier
,Origin
,Dest
, andDistance
(Ctrl) - Input for ntrees:
5
- Click Submit to start the job
- On the model output page, click the JSON tab.
- On the model output page, click the JAVA tab.
Step 7: Deep Learning with Model Grid Search
- Go to Model > Gradient Boosting Machine
- Input for source:
allyears2k_shuffled_part0.hex
- Select for response:
IsDepDelayed
- Select to ignore all columns (Ctrl+A) except for
Year
,Month
,DayofMonth
,DayOfWeek
,UniqueCarrier
,Origin
,Dest
, andDistance
(Ctrl) - Input for hidden:
(10,10), (20,20,20)
- Click Submit to start the job
The models are sorted by error rates. Scroll to all the way to the right to select the first model on the list.
Step 8: Multimodel Scoring Engine
- Navigate to Score > Multi model Scoring (beta)
- Select data set
allyears2k.hex
and scroll to the compatible models and selectVIEW THESE MODELS...
- Select all the models on the left hand task bar.
- Click SCORE... and select
allyears2k_shuffled_part1.hex
and click OK
The tabular viewing of the models allows the user to have a side by side comparison of all the models.
Creating Visualizations
- Navigate to ADVANCED Tab to see overlaying ROC curves
- Click ADD VISUALIZATION...
- For the X-Axis Field choose
Training Time (ms)
- For the Y-Axis Field choose
AUC
Examine the new graph you created. Weigh the value of extra gain in accuracy for time taken to train the models. Select the model with your desired level of accuracy, then copy and paste the model key and proceed to step 9.
Step 9: Create Frame with Predicted Values
- Navigate back to Home Page > Score > Predict
- Input for model: paste the model key you got from Step 8
- Input for data:
allyears2k_shuffled_part1.hex
- Input for prediction:
pred
Step 10: Export Predicted Values as CSV
- Inspect the prediction frame
- Select Download as CSV
or export any frame:
- Navigate to Data > Export Files
- Input for src key:
pred
- Input for path:
/data/h2o-training/airlines/pred.csv
Step 11: Save a model for use later
- Navigate to Data > View All
- Choose to filter by the model key
- Click Save Model
- Input for path:
/data/h2o-training/airlines/50TreesGBMmodel
- Click Submit
Errors? Download and send us the log files!
- Navigate to Admin > Inspect Log
- Hit Download all logs
Step 12: Shutdown your H2O instance
- Go to [Admin > Shutdown]
Extra Bonus: Reload that saved model
In a active H2O session:
- Navigate to the Load Model
- Input for path:
/data/h2o-training/airlines/50TreesGBMmodel
- Hit Submit