1. Define Spark Context
sc
<pyspark.context.SparkContext at 0x102cea1d0>
2. Start H2O Context
from pysparkling import *
sc
hc= H2OContext(sc).start()
Warning: Version mismatch. H2O is version 3.6.0.2, but the python package is version 3.7.0.99999.
H2O cluster uptime: |
2 seconds 217 milliseconds |
H2O cluster version: |
3.6.0.2 |
H2O cluster name: |
sparkling-water-nidhimehta |
H2O cluster total nodes: |
2 |
H2O cluster total memory: |
3.83 GB |
H2O cluster total cores: |
16 |
H2O cluster allowed cores: |
16 |
H2O cluster healthy: |
True |
H2O Connection ip: |
172.16.2.98 |
H2O Connection port: |
54329 |
3. Define H2O Context
hc
H2OContext: ip=172.16.2.98, port=54329
4. Import H2O Python library
import h2o
5. View all available H2O Python functions
#dir(h2o)
6. Parse Chicago Crime dataset into H2O
column_type = ['Numeric','String','String','Enum','Enum','Enum','Enum','Enum','Enum','Enum','Numeric','Numeric','Numeric','Numeric','Enum','Numeric','Numeric','Numeric','Enum','Numeric','Numeric','Enum']
f_crimes = h2o.import_file(path ="../data/chicagoCrimes10k.csv",col_types =column_type)
print(f_crimes.shape)
f_crimes.summary()
Parse Progress: [##################################################] 100%
(9999, 22)
| ID | Case Number | Date | Block | IUCR | Primary Type | Description | Location Description | Arrest | Domestic | Beat | District | Ward | Community Area | FBI Code | X Coordinate | Y Coordinate | Year | Updated On | Latitude | Longitude | Location |
type | int | string | string | enum | enum | enum | enum | enum | enum | enum | int | int | int | int | enum | int | int | int | enum | real | real | enum |
mins | 21735.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 111.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1100317.0 | 1814255.0 | 2015.0 | 0.0 | 41.64507243 | -87.906463888 | 0.0 |
mean | 9931318.73737 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.292829282928 | 0.152315231523 | 1159.61806181 | 11.3489885128 | 22.9540954095 | 37.4476447645 | NaN | 1163880.59815 | 1885916.14984 | 2015.0 | NaN | 41.8425652247 | -87.6741405221 | NaN |
maxs | 9962898.0 | NaN | NaN | 6517.0 | 212.0 | 26.0 | 198.0 | 90.0 | 1.0 | 1.0 | 2535.0 | 25.0 | 50.0 | 77.0 | 24.0 | 1205069.0 | 1951533.0 | 2015.0 | 32.0 | 42.022646183 | -87.524773286 | 8603.0 |
sigma | 396787.564221 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.455083515588 | 0.35934414686 | 695.76029875 | 6.94547493301 | 13.6495661144 | 21.2748762223 | NaN | 16496.4493681 | 31274.0163199 | 0.0 | NaN | 0.0860186579358 | 0.0600357970653 | NaN |
zeros | 0 | 0 | 0 | 3 | 16 | 11 | 933 | 19 | 7071 | 8476 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | 603 | 0 | 0 | 1 |
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 162 | 0 | 0 | 0 | 162 | 162 | 0 | 0 | 162 | 162 | 162 |
0 | 9955810.0 | HY144797 | 02/08/2015 11:43:40 PM | 081XX S COLES AVE | 1811 | NARCOTICS | POSS: CANNABIS 30GMS OR LESS | STREET | true | false | 422.0 | 4.0 | 7.0 | 46.0 | 18 | 1198273.0 | 1851626.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.747693646 | -87.549035389 | (41.747693646, -87.549035389) |
1 | 9955861.0 | HY144838 | 02/08/2015 11:41:42 PM | 118XX S STATE ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | true | true | 522.0 | 5.0 | 34.0 | 53.0 | 08B | 1178335.0 | 1826581.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.679442289 | -87.622850758 | (41.679442289, -87.622850758) |
2 | 9955801.0 | HY144779 | 02/08/2015 11:30:22 PM | 002XX S LARAMIE AVE | 2026 | NARCOTICS | POSS: PCP | SIDEWALK | true | false | 1522.0 | 15.0 | 29.0 | 25.0 | 18 | 1141717.0 | 1898581.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.87777333 | -87.755117993 | (41.87777333, -87.755117993) |
3 | 9956197.0 | HY144787 | 02/08/2015 11:30:23 PM | 006XX E 67TH ST | 1811 | NARCOTICS | POSS: CANNABIS 30GMS OR LESS | STREET | true | false | 321.0 | nan | 6.0 | 42.0 | 18 | nan | nan | 2015.0 | 02/15/2015 12:43:39 PM | nan | nan | |
4 | 9955846.0 | HY144829 | 02/08/2015 11:30:58 PM | 0000X S MAYFIELD AVE | 0610 | BURGLARY | FORCIBLE ENTRY | APARTMENT | false | false | 1513.0 | 15.0 | 29.0 | 25.0 | 05 | 1137239.0 | 1899372.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.880025548 | -87.771541324 | (41.880025548, -87.771541324) |
5 | 9955835.0 | HY144778 | 02/08/2015 11:30:21 PM | 010XX W 48TH ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | false | true | 933.0 | 9.0 | 3.0 | 61.0 | 08B | 1169986.0 | 1873019.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.807059405 | -87.65206589 | (41.807059405, -87.65206589) |
6 | 9955872.0 | HY144822 | 02/08/2015 11:27:24 PM | 015XX W ARTHUR AVE | 1320 | CRIMINAL DAMAGE | TO VEHICLE | STREET | false | false | 2432.0 | 24.0 | 40.0 | 1.0 | 14 | 1164732.0 | 1943222.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.999814056 | -87.669342967 | (41.999814056, -87.669342967) |
7 | 21752.0 | HY144738 | 02/08/2015 11:26:12 PM | 060XX W GRAND AVE | 0110 | HOMICIDE | FIRST DEGREE MURDER | STREET | true | false | 2512.0 | 25.0 | 37.0 | 19.0 | 01A | 1135910.0 | 1914206.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.920755683 | -87.776067514 | (41.920755683, -87.776067514) |
8 | 9955808.0 | HY144775 | 02/08/2015 11:20:33 PM | 001XX W WACKER DR | 0460 | BATTERY | SIMPLE | OTHER | false | false | 122.0 | 1.0 | 42.0 | 32.0 | 08B | 1175384.0 | 1902088.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.886707818 | -87.631396356 | (41.886707818, -87.631396356) |
9 | 9958275.0 | HY146732 | 02/08/2015 11:15:36 PM | 001XX W WACKER DR | 0460 | BATTERY | SIMPLE | HOTEL/MOTEL | false | false | 122.0 | 1.0 | 42.0 | 32.0 | 08B | 1175384.0 | 1902088.0 | 2015.0 | 02/15/2015 12:43:39 PM | 41.886707818 | -87.631396356 | (41.886707818, -87.631396356) |
7. Look at the distribution of the IUCR column
f_crimes["IUCR"].table()
IUCR | Count |
0110 | 16 |
0261 | 2 |
0263 | 2 |
0265 | 5 |
0266 | 2 |
0281 | 41 |
0291 | 3 |
0312 | 18 |
0313 | 20 |
031A | 136 |
8. Look at the distribution of the Arrest column
f_crimes["Arrest"].table()
Arrest | Count |
false | 7071 |
true | 2928 |
9. Modify column names to replace blank spaces with underscores
col_names = map(lambda s: s.replace(' ', '_'), f_crimes.col_names)
f_crimes.set_names(col_names)
ID | Case_Number | Date | Block | IUCR | Primary_Type | Description | Location_Description | Arrest | Domestic | Beat | District | Ward | Community_Area | FBI_Code | X_Coordinate | Y_Coordinate | Year | Updated_On | Latitude | Longitude | Location |
9.95581e+06 | HY144797 | 02/08/2015 11:43:40 PM | 081XX S COLES AVE | 1811 | NARCOTICS | POSS: CANNABIS 30GMS OR LESS | STREET | true | false | 422 | 4 | 7 | 46 | 18 | 1.19827e+06 | 1.85163e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.7477 | -87.549 | (41.747693646, -87.549035389) |
9.95586e+06 | HY144838 | 02/08/2015 11:41:42 PM | 118XX S STATE ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | true | true | 522 | 5 | 34 | 53 | 08B | 1.17834e+06 | 1.82658e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.6794 | -87.6229 | (41.679442289, -87.622850758) |
9.9558e+06 | HY144779 | 02/08/2015 11:30:22 PM | 002XX S LARAMIE AVE | 2026 | NARCOTICS | POSS: PCP | SIDEWALK | true | false | 1522 | 15 | 29 | 25 | 18 | 1.14172e+06 | 1.89858e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.8778 | -87.7551 | (41.87777333, -87.755117993) |
9.9562e+06 | HY144787 | 02/08/2015 11:30:23 PM | 006XX E 67TH ST | 1811 | NARCOTICS | POSS: CANNABIS 30GMS OR LESS | STREET | true | false | 321 | nan | 6 | 42 | 18 | nan | nan | 2015 | 02/15/2015 12:43:39 PM | nan | nan | |
9.95585e+06 | HY144829 | 02/08/2015 11:30:58 PM | 0000X S MAYFIELD AVE | 0610 | BURGLARY | FORCIBLE ENTRY | APARTMENT | false | false | 1513 | 15 | 29 | 25 | 05 | 1.13724e+06 | 1.89937e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.88 | -87.7715 | (41.880025548, -87.771541324) |
9.95584e+06 | HY144778 | 02/08/2015 11:30:21 PM | 010XX W 48TH ST | 0486 | BATTERY | DOMESTIC BATTERY SIMPLE | APARTMENT | false | true | 933 | 9 | 3 | 61 | 08B | 1.16999e+06 | 1.87302e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.8071 | -87.6521 | (41.807059405, -87.65206589) |
9.95587e+06 | HY144822 | 02/08/2015 11:27:24 PM | 015XX W ARTHUR AVE | 1320 | CRIMINAL DAMAGE | TO VEHICLE | STREET | false | false | 2432 | 24 | 40 | 1 | 14 | 1.16473e+06 | 1.94322e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.9998 | -87.6693 | (41.999814056, -87.669342967) |
21752 | HY144738 | 02/08/2015 11:26:12 PM | 060XX W GRAND AVE | 0110 | HOMICIDE | FIRST DEGREE MURDER | STREET | true | false | 2512 | 25 | 37 | 19 | 01A | 1.13591e+06 | 1.91421e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.9208 | -87.7761 | (41.920755683, -87.776067514) |
9.95581e+06 | HY144775 | 02/08/2015 11:20:33 PM | 001XX W WACKER DR | 0460 | BATTERY | SIMPLE | OTHER | false | false | 122 | 1 | 42 | 32 | 08B | 1.17538e+06 | 1.90209e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.8867 | -87.6314 | (41.886707818, -87.631396356) |
9.95828e+06 | HY146732 | 02/08/2015 11:15:36 PM | 001XX W WACKER DR | 0460 | BATTERY | SIMPLE | HOTEL/MOTEL | false | false | 122 | 1 | 42 | 32 | 08B | 1.17538e+06 | 1.90209e+06 | 2015 | 02/15/2015 12:43:39 PM | 41.8867 | -87.6314 | (41.886707818, -87.631396356) |
10. Set time zone to UTC for date manipulation
h2o.set_timezone("Etc/UTC")
11. Refine the date column
def refine_date_col(data, col, pattern):
data[col] = data[col].as_date(pattern)
data["Day"] = data[col].day()
data["Month"] = data[col].month() # Since H2O indexes from 0
data["Year"] = data[col].year()
data["WeekNum"] = data[col].week()
data["WeekDay"] = data[col].dayOfWeek()
data["HourOfDay"] = data[col].hour()
# Create weekend and season cols
data["Weekend"] = (data["WeekDay"] == "Sun" or data["WeekDay"] == "Sat").ifelse(1, 0)[0]
data["Season"] = data["Month"].cut([0, 2, 5, 7, 10, 12], ["Winter", "Spring", "Summer", "Autumn", "Winter"])
refine_date_col(f_crimes, "Date", "%m/%d/%Y %I:%M:%S %p")
f_crimes = f_crimes.drop("Date")
12. Parse Census data into H2O
f_census = h2o.import_file("../data/chicagoCensus.csv",header=1)
## Update column names in the table
col_names = map(lambda s: s.strip().replace(' ', '_'), f_census.col_names)
f_census.set_names(col_names)
f_census = f_census[1:78,:]
print(f_census.dim)
#f_census.summary()
Parse Progress: [##################################################] 100%
[77, 9]
13. Parse Weather data into H2O
f_weather = h2o.import_file("../data/chicagoAllWeather.csv")
f_weather = f_weather[1:]
print(f_weather.dim)
#f_weather.summary()
Parse Progress: [##################################################] 100%
[5162, 6]
14. Look at all the null entires in the Weather table
f_weather[f_weather["meanTemp"].isna()]
month | day | year | maxTemp | meanTemp | minTemp |
6 | 19 | 2008 | nan | nan | nan |
9 | 23 | 2008 | nan | nan | nan |
9 | 24 | 2008 | nan | nan | nan |
9 | 25 | 2008 | nan | nan | nan |
9 | 26 | 2008 | nan | nan | nan |
9 | 27 | 2008 | nan | nan | nan |
9 | 28 | 2008 | nan | nan | nan |
9 | 29 | 2008 | nan | nan | nan |
9 | 30 | 2008 | nan | nan | nan |
3 | 4 | 2009 | nan | nan | nan |
15. Look at the help on as_h2o_frame
hc.as_spark_frame?
f_weather
H2OContext: ip=172.16.2.98, port=54329
month | day | year | maxTemp | meanTemp | minTemp |
1 | 1 | 2001 | 23 | 14 | 6 |
1 | 2 | 2001 | 18 | 12 | 6 |
1 | 3 | 2001 | 28 | 18 | 8 |
1 | 4 | 2001 | 30 | 24 | 19 |
1 | 5 | 2001 | 36 | 30 | 21 |
1 | 6 | 2001 | 33 | 26 | 19 |
1 | 7 | 2001 | 34 | 28 | 21 |
1 | 8 | 2001 | 26 | 20 | 14 |
1 | 9 | 2001 | 23 | 16 | 10 |
1 | 10 | 2001 | 34 | 26 | 19 |
16. Copy data frames to Spark from H2O
df_weather = hc.as_spark_frame(f_weather,)
df_census = hc.as_spark_frame(f_census)
df_crimes = hc.as_spark_frame(f_crimes)
17. Look at the weather data as parsed in Spark
(only showing top 2 rows)
df_weather.show(2)
+-----+---+----+-------+--------+-------+
|month|day|year|maxTemp|meanTemp|minTemp|
+-----+---+----+-------+--------+-------+
| 1| 1|2001| 23| 14| 6|
| 1| 2|2001| 18| 12| 6|
+-----+---+----+-------+--------+-------+
18. Join columns from Crime, Census and Weather DataFrames in Spark
## Register DataFrames as tables in SQL context
sqlContext.registerDataFrameAsTable(df_weather, "chicagoWeather")
sqlContext.registerDataFrameAsTable(df_census, "chicagoCensus")
sqlContext.registerDataFrameAsTable(df_crimes, "chicagoCrime")
crimeWithWeather = sqlContext.sql("""SELECT
a.Year, a.Month, a.Day, a.WeekNum, a.HourOfDay, a.Weekend, a.Season, a.WeekDay,
a.IUCR, a.Primary_Type, a.Location_Description, a.Community_Area, a.District,
a.Arrest, a.Domestic, a.Beat, a.Ward, a.FBI_Code,
b.minTemp, b.maxTemp, b.meanTemp,
c.PERCENT_AGED_UNDER_18_OR_OVER_64, c.PER_CAPITA_INCOME, c.HARDSHIP_INDEX,
c.PERCENT_OF_HOUSING_CROWDED, c.PERCENT_HOUSEHOLDS_BELOW_POVERTY,
c.PERCENT_AGED_16__UNEMPLOYED, c.PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA
FROM chicagoCrime a
JOIN chicagoWeather b
ON a.Year = b.year AND a.Month = b.month AND a.Day = b.day
JOIN chicagoCensus c
ON a.Community_Area = c.Community_Area_Number""")
19. Print the crimeWithWeather
data table from Spark
crimeWithWeather.show(2)
+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
|Year|Month|Day|WeekNum|HourOfDay|Weekend|Season|WeekDay|IUCR| Primary_Type|Location_Description|Community_Area|District|Arrest|Domestic|Beat|Ward|FBI_Code|minTemp|maxTemp|meanTemp|PERCENT_AGED_UNDER_18_OR_OVER_64|PER_CAPITA_INCOME|HARDSHIP_INDEX|PERCENT_OF_HOUSING_CROWDED|PERCENT_HOUSEHOLDS_BELOW_POVERTY|PERCENT_AGED_16__UNEMPLOYED|PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA|
+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
|2015| 1| 23| 4| 22| 0|Winter| Fri|143A|WEAPONS VIOLATION| ALLEY| 31| 12| true| false|1234| 25| 15| 29| 31| 30| 32.6| 16444| 76| 9.600000000000001| 25.8| 15.8| 40.7|
|2015| 1| 23| 4| 19| 0|Winter| Fri|4625| OTHER OFFENSE| SIDEWALK| 31| 10| true| false|1034| 25| 26| 29| 31| 30| 32.6| 16444| 76| 9.600000000000001| 25.8| 15.8| 40.7|
+----+-----+---+-------+---------+-------+------+-------+----+-----------------+--------------------+--------------+--------+------+--------+----+----+--------+-------+-------+--------+--------------------------------+-----------------+--------------+--------------------------+--------------------------------+---------------------------+--------------------------------------------+
only showing top 2 rows
20. Copy table from Spark to H2O
hc.as_h2o_frame?
crimeWithWeatherHF = hc.as_h2o_frame(crimeWithWeather,framename="crimeWithWeather")
H2OContext: ip=172.16.2.98, port=54329
crimeWithWeatherHF.summary()
| Year | Month | Day | WeekNum | HourOfDay | Weekend | Season | WeekDay | IUCR | Primary_Type | Location_Description | Community_Area | District | Arrest | Domestic | Beat | Ward | FBI_Code | minTemp | maxTemp | meanTemp | PERCENT_AGED_UNDER_18_OR_OVER_64 | PER_CAPITA_INCOME | HARDSHIP_INDEX | PERCENT_OF_HOUSING_CROWDED | PERCENT_HOUSEHOLDS_BELOW_POVERTY | PERCENT_AGED_16__UNEMPLOYED | PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA |
type | int | int | int | int | int | int | string | string | string | string | string | int | int | string | string | int | int | string | int | int | int | real | int | int | real | real | real | real |
mins | 2015.0 | 1.0 | 1.0 | 4.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | 111.0 | 1.0 | NaN | -2.0 | 15.0 | 7.0 | 13.5 | 8201.0 | 1.0 | 0.3 | 3.3 | 4.7 | 2.5 |
mean | 2015.0 | 1.41944194419 | 17.6839683968 | 5.18081808181 | 13.6319631963 | 0.159115911591 | NaN | NaN | NaN | NaN | NaN | 37.4476447645 | 11.3489885128 | NaN | NaN | 1159.61806181 | 22.9540954095 | NaN | 17.699669967 | 31.7199719972 | 24.9408940894 | 35.0596759676 | 25221.3057306 | 54.4786478648 | 5.43707370737 | 24.600750075 | 16.8288328833 | 21.096639664 |
maxs | 2015.0 | 2.0 | 31.0 | 6.0 | 23.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | 77.0 | 25.0 | NaN | NaN | 2535.0 | 50.0 | NaN | 29.0 | 43.0 | 36.0 | 51.5 | 88669.0 | 98.0 | 15.8 | 56.5 | 35.9 | 54.8 |
sigma | 0.0 | 0.493492406787 | 11.1801043358 | 0.738929830409 | 6.47321735807 | 0.365802434041 | NaN | NaN | NaN | NaN | NaN | 21.2748762223 | 6.94547493301 | NaN | NaN | 695.76029875 | 13.6495661144 | NaN | 8.96118136438 | 6.93809913472 | 7.46302527062 | 7.95653388237 | 18010.0446225 | 29.3247456472 | 3.75289588494 | 10.1450570661 | 7.58926327988 | 11.3868817911 |
zeros | 0 | 0 | 0 | 0 | 374 | 8408 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 162 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 2015.0 | 1.0 | 24.0 | 4.0 | 22.0 | 0.0 | Winter | Sat | 2820 | OTHER OFFENSE | APARTMENT | 31.0 | 10.0 | false | false | 1034.0 | 25.0 | 26 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
1 | 2015.0 | 1.0 | 24.0 | 4.0 | 21.0 | 0.0 | Winter | Sat | 1310 | CRIMINAL DAMAGE | RESTAURANT | 31.0 | 12.0 | true | false | 1233.0 | 25.0 | 14 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
2 | 2015.0 | 1.0 | 24.0 | 4.0 | 18.0 | 0.0 | Winter | Sat | 1750 | OFFENSE INVOLVING CHILDREN | RESIDENCE | 31.0 | 12.0 | false | true | 1235.0 | 25.0 | 20 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
3 | 2015.0 | 1.0 | 24.0 | 4.0 | 18.0 | 0.0 | Winter | Sat | 0460 | BATTERY | OTHER | 31.0 | 10.0 | false | false | 1023.0 | 25.0 | 08B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
4 | 2015.0 | 1.0 | 24.0 | 4.0 | 13.0 | 0.0 | Winter | Sat | 0890 | THEFT | CURRENCY EXCHANGE | 31.0 | 10.0 | false | false | 1023.0 | 25.0 | 06 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
5 | 2015.0 | 1.0 | 24.0 | 4.0 | 9.0 | 0.0 | Winter | Sat | 0560 | ASSAULT | OTHER | 31.0 | 12.0 | false | false | 1234.0 | 25.0 | 08A | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
6 | 2015.0 | 1.0 | 24.0 | 4.0 | 8.0 | 0.0 | Winter | Sat | 0486 | BATTERY | RESIDENCE | 31.0 | 12.0 | true | true | 1235.0 | 25.0 | 08B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
7 | 2015.0 | 1.0 | 24.0 | 4.0 | 1.0 | 0.0 | Winter | Sat | 0420 | BATTERY | SIDEWALK | 31.0 | 10.0 | false | false | 1034.0 | 25.0 | 04B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
8 | 2015.0 | 1.0 | 24.0 | 4.0 | 0.0 | 0.0 | Winter | Sat | 1320 | CRIMINAL DAMAGE | PARKING LOT/GARAGE(NON.RESID.) | 31.0 | 9.0 | false | false | 912.0 | 11.0 | 14 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
9 | 2015.0 | 1.0 | 31.0 | 5.0 | 23.0 | 0.0 | Winter | Sat | 0820 | THEFT | SIDEWALK | 31.0 | 12.0 | false | false | 1234.0 | 25.0 | 06 | 19.0 | 36.0 | 28.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
21. Assign column types to the CrimeWeatherHF
data table in H2O
crimeWithWeatherHF["Season"]= crimeWithWeatherHF["Season"].asfactor()
crimeWithWeatherHF["WeekDay"]= crimeWithWeatherHF["WeekDay"].asfactor()
crimeWithWeatherHF["IUCR"]= crimeWithWeatherHF["IUCR"].asfactor()
crimeWithWeatherHF["Primary_Type"]= crimeWithWeatherHF["Primary_Type"].asfactor()
crimeWithWeatherHF["Location_Description"]= crimeWithWeatherHF["Location_Description"].asfactor()
crimeWithWeatherHF["Arrest"]= crimeWithWeatherHF["Arrest"].asfactor()
crimeWithWeatherHF["Domestic"]= crimeWithWeatherHF["Domestic"].asfactor()
crimeWithWeatherHF["FBI_Code"]= crimeWithWeatherHF["FBI_Code"].asfactor()
crimeWithWeatherHF["Season"]= crimeWithWeatherHF["Season"].asfactor()
crimeWithWeatherHF.summary()
| Year | Month | Day | WeekNum | HourOfDay | Weekend | Season | WeekDay | IUCR | Primary_Type | Location_Description | Community_Area | District | Arrest | Domestic | Beat | Ward | FBI_Code | minTemp | maxTemp | meanTemp | PERCENT_AGED_UNDER_18_OR_OVER_64 | PER_CAPITA_INCOME | HARDSHIP_INDEX | PERCENT_OF_HOUSING_CROWDED | PERCENT_HOUSEHOLDS_BELOW_POVERTY | PERCENT_AGED_16__UNEMPLOYED | PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA |
type | int | int | int | int | int | int | enum | enum | enum | enum | enum | int | int | enum | enum | int | int | enum | int | int | int | real | int | int | real | real | real | real |
mins | 2015.0 | 1.0 | 1.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 111.0 | 1.0 | 0.0 | -2.0 | 15.0 | 7.0 | 13.5 | 8201.0 | 1.0 | 0.3 | 3.3 | 4.7 | 2.5 |
mean | 2015.0 | 1.41944194419 | 17.6839683968 | 5.18081808181 | 13.6319631963 | 0.159115911591 | 0.0 | NaN | NaN | NaN | NaN | 37.4476447645 | 11.3489885128 | 0.292829282928 | 0.152315231523 | 1159.61806181 | 22.9540954095 | NaN | 17.699669967 | 31.7199719972 | 24.9408940894 | 35.0596759676 | 25221.3057306 | 54.4786478648 | 5.43707370737 | 24.600750075 | 16.8288328833 | 21.096639664 |
maxs | 2015.0 | 2.0 | 31.0 | 6.0 | 23.0 | 1.0 | 0.0 | 6.0 | 212.0 | 26.0 | 90.0 | 77.0 | 25.0 | 1.0 | 1.0 | 2535.0 | 50.0 | 24.0 | 29.0 | 43.0 | 36.0 | 51.5 | 88669.0 | 98.0 | 15.8 | 56.5 | 35.9 | 54.8 |
sigma | 0.0 | 0.493492406787 | 11.1801043358 | 0.738929830409 | 6.47321735807 | 0.365802434041 | 0.0 | NaN | NaN | NaN | NaN | 21.2748762223 | 6.94547493301 | 0.455083515588 | 0.35934414686 | 695.76029875 | 13.6495661144 | NaN | 8.96118136438 | 6.93809913472 | 7.46302527062 | 7.95653388237 | 18010.0446225 | 29.3247456472 | 3.75289588494 | 10.1450570661 | 7.58926327988 | 11.3868817911 |
zeros | 0 | 0 | 0 | 0 | 374 | 8408 | 9999 | 1942 | 16 | 11 | 19 | 0 | 0 | 7071 | 8476 | 0 | 0 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 162 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 2015.0 | 1.0 | 24.0 | 4.0 | 22.0 | 0.0 | Winter | Sat | 2820 | OTHER OFFENSE | APARTMENT | 31.0 | 10.0 | false | false | 1034.0 | 25.0 | 26 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
1 | 2015.0 | 1.0 | 24.0 | 4.0 | 21.0 | 0.0 | Winter | Sat | 1310 | CRIMINAL DAMAGE | RESTAURANT | 31.0 | 12.0 | true | false | 1233.0 | 25.0 | 14 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
2 | 2015.0 | 1.0 | 24.0 | 4.0 | 18.0 | 0.0 | Winter | Sat | 1750 | OFFENSE INVOLVING CHILDREN | RESIDENCE | 31.0 | 12.0 | false | true | 1235.0 | 25.0 | 20 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
3 | 2015.0 | 1.0 | 24.0 | 4.0 | 18.0 | 0.0 | Winter | Sat | 0460 | BATTERY | OTHER | 31.0 | 10.0 | false | false | 1023.0 | 25.0 | 08B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
4 | 2015.0 | 1.0 | 24.0 | 4.0 | 13.0 | 0.0 | Winter | Sat | 0890 | THEFT | CURRENCY EXCHANGE | 31.0 | 10.0 | false | false | 1023.0 | 25.0 | 06 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
5 | 2015.0 | 1.0 | 24.0 | 4.0 | 9.0 | 0.0 | Winter | Sat | 0560 | ASSAULT | OTHER | 31.0 | 12.0 | false | false | 1234.0 | 25.0 | 08A | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
6 | 2015.0 | 1.0 | 24.0 | 4.0 | 8.0 | 0.0 | Winter | Sat | 0486 | BATTERY | RESIDENCE | 31.0 | 12.0 | true | true | 1235.0 | 25.0 | 08B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
7 | 2015.0 | 1.0 | 24.0 | 4.0 | 1.0 | 0.0 | Winter | Sat | 0420 | BATTERY | SIDEWALK | 31.0 | 10.0 | false | false | 1034.0 | 25.0 | 04B | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
8 | 2015.0 | 1.0 | 24.0 | 4.0 | 0.0 | 0.0 | Winter | Sat | 1320 | CRIMINAL DAMAGE | PARKING LOT/GARAGE(NON.RESID.) | 31.0 | 9.0 | false | false | 912.0 | 11.0 | 14 | 29.0 | 43.0 | 36.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
9 | 2015.0 | 1.0 | 31.0 | 5.0 | 23.0 | 0.0 | Winter | Sat | 0820 | THEFT | SIDEWALK | 31.0 | 12.0 | false | false | 1234.0 | 25.0 | 06 | 19.0 | 36.0 | 28.0 | 32.6 | 16444.0 | 76.0 | 9.6 | 25.8 | 15.8 | 40.7 |
22. Split final H2O data table into train test and validation sets
ratios = [0.6,0.2]
frs = crimeWithWeatherHF.split_frame(ratios,seed=12345)
train = frs[0]
train.frame_id = "Train"
valid = frs[2]
valid.frame_id = "Validation"
test = frs[1]
test.frame_id = "Test"
23. Import Model Builders from H2O Python
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
24. Inspect the availble GBM parameters
H2OGradientBoostingEstimator?
25. Define Predictors
predictors = crimeWithWeatherHF.names[:]
response = "Arrest"
predictors.remove(response)
26. Create a Simple GBM model to Predict Arrests
model_gbm = H2OGradientBoostingEstimator(ntrees =50,
max_depth =6,
learn_rate =0.1,
#nfolds =2,
distribution ="bernoulli")
model_gbm.train(x =predictors,
y ="Arrest",
training_frame =train,
validation_frame=valid
)
27. Create a Simple Deep Learning model to Predict Arrests
model_dl = H2ODeepLearningEstimator(variable_importances=True,
loss ="Automatic")
model_dl.train(x =predictors,
y ="Arrest",
training_frame =train,
validation_frame=valid)
gbm Model Build Progress: [##################################################] 100%
deeplearning Model Build Progress: [##################################################] 100%
28. Print confusion matrices for the training and validation datasets
print(model_gbm.confusion_matrix(train = True))
print(model_gbm.confusion_matrix(valid = True))
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.335827722991:
|
false |
true |
Error |
Rate |
false |
4125.0 |
142.0 |
0.0333 |
(142.0/4267.0) |
true |
251.0 |
1504.0 |
0.143 |
(251.0/1755.0) |
Total |
4376.0 |
1646.0 |
0.0653 |
(393.0/6022.0) |
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.432844055866:
|
false |
true |
Error |
Rate |
false |
1362.0 |
61.0 |
0.0429 |
(61.0/1423.0) |
true |
150.0 |
443.0 |
0.253 |
(150.0/593.0) |
Total |
1512.0 |
504.0 |
0.1047 |
(211.0/2016.0) |
print(model_gbm.auc(train=True))
print(model_gbm.auc(valid=True))
model_gbm.plot(metric="AUC")
0.974667176776
0.92596751276
29. Print variable importances
model_gbm.varimp(True)
|
variable |
relative_importance |
scaled_importance |
percentage |
0 |
IUCR |
4280.939453 |
1.000000e+00 |
8.234218e-01 |
1 |
Location_Description |
487.323059 |
1.138355e-01 |
9.373466e-02 |
2 |
WeekDay |
55.790558 |
1.303232e-02 |
1.073109e-02 |
3 |
HourOfDay |
55.419220 |
1.294557e-02 |
1.065967e-02 |
4 |
PERCENT_AGED_16__UNEMPLOYED |
34.422894 |
8.040967e-03 |
6.621107e-03 |
5 |
Beat |
31.468222 |
7.350775e-03 |
6.052788e-03 |
6 |
PERCENT_HOUSEHOLDS_BELOW_POVERTY |
29.103352 |
6.798356e-03 |
5.597915e-03 |
7 |
PER_CAPITA_INCOME |
26.233143 |
6.127894e-03 |
5.045841e-03 |
8 |
PERCENT_AGED_UNDER_18_OR_OVER_64 |
24.077402 |
5.624327e-03 |
4.631193e-03 |
9 |
Day |
23.472567 |
5.483041e-03 |
4.514855e-03 |
... |
... |
... |
... |
... |
15 |
maxTemp |
11.300793 |
2.639793e-03 |
2.173663e-03 |
16 |
Community_Area |
10.252146 |
2.394835e-03 |
1.971960e-03 |
17 |
HARDSHIP_INDEX |
10.116072 |
2.363049e-03 |
1.945786e-03 |
18 |
Domestic |
9.294327 |
2.171095e-03 |
1.787727e-03 |
19 |
District |
8.304654 |
1.939914e-03 |
1.597367e-03 |
20 |
minTemp |
6.243027 |
1.458331e-03 |
1.200822e-03 |
21 |
WeekNum |
4.230102 |
9.881246e-04 |
8.136433e-04 |
22 |
FBI_Code |
2.363182 |
5.520241e-04 |
4.545486e-04 |
23 |
Month |
0.000018 |
4.187325e-09 |
3.447935e-09 |
24 |
Weekend |
0.000000 |
0.000000e+00 |
0.000000e+00 |
25 rows × 4 columns
30. Inspect Deep Learning model output
model_dl
Model Details
=============
H2ODeepLearningEstimator : Deep Learning
Model Key: DeepLearning_model_python_1446861372065_4
Status of Neuron Layers: predicting Arrest, 2-class classification, bernoulli distribution, CrossEntropy loss, 118,802 weights/biases, 1.4 MB, 72,478 training samples, mini-batch size 1
|
layer |
units |
type |
dropout |
l1 |
l2 |
mean_rate |
rate_RMS |
momentum |
mean_weight |
weight_RMS |
mean_bias |
bias_RMS |
|
1 |
390 |
Input |
0.0 |
|
|
|
|
|
|
|
|
|
|
2 |
200 |
Rectifier |
0.0 |
0.0 |
0.0 |
0.1 |
0.3 |
0.0 |
-0.0 |
0.1 |
-0.0 |
0.1 |
|
3 |
200 |
Rectifier |
0.0 |
0.0 |
0.0 |
0.1 |
0.2 |
0.0 |
-0.0 |
0.1 |
0.8 |
0.2 |
|
4 |
2 |
Softmax |
|
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.4 |
-0.0 |
0.0 |
ModelMetricsBinomial: deeplearning
Reported on train data.
MSE: 0.0737426129728
R^2: 0.642891439669
LogLoss: 0.242051500943
AUC: 0.950131166302
Gini: 0.900262332604
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.343997370612:
|
false |
true |
Error |
Rate |
false |
4003.0 |
264.0 |
0.0619 |
(264.0/4267.0) |
true |
358.0 |
1397.0 |
0.204 |
(358.0/1755.0) |
Total |
4361.0 |
1661.0 |
0.1033 |
(622.0/6022.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric |
threshold |
value |
idx |
max f1 |
0.3 |
0.8 |
195.0 |
max f2 |
0.2 |
0.9 |
278.0 |
max f0point5 |
0.7 |
0.9 |
86.0 |
max accuracy |
0.5 |
0.9 |
149.0 |
max precision |
1.0 |
1.0 |
0.0 |
max absolute_MCC |
0.3 |
0.7 |
195.0 |
max min_per_class_accuracy |
0.2 |
0.9 |
247.0 |
ModelMetricsBinomial: deeplearning
** Reported on validation data. **
MSE: 0.0843305429737
R^2: 0.593831388139
LogLoss: 0.280203809486
AUC: 0.930515181213
Gini: 0.861030362427
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.493462351545:
|
false |
true |
Error |
Rate |
false |
1361.0 |
62.0 |
0.0436 |
(62.0/1423.0) |
true |
158.0 |
435.0 |
0.2664 |
(158.0/593.0) |
Total |
1519.0 |
497.0 |
0.1091 |
(220.0/2016.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric |
threshold |
value |
idx |
max f1 |
0.5 |
0.8 |
137.0 |
max f2 |
0.1 |
0.8 |
303.0 |
max f0point5 |
0.7 |
0.9 |
82.0 |
max accuracy |
0.7 |
0.9 |
91.0 |
max precision |
1.0 |
1.0 |
0.0 |
max absolute_MCC |
0.7 |
0.7 |
91.0 |
max min_per_class_accuracy |
0.2 |
0.8 |
236.0 |
Scoring History:
|
timestamp |
duration |
training_speed |
epochs |
samples |
training_MSE |
training_r2 |
training_logloss |
training_AUC |
training_classification_error |
validation_MSE |
validation_r2 |
validation_logloss |
validation_AUC |
validation_classification_error |
|
2015-11-06 17:57:05 |
0.000 sec |
None |
0.0 |
0.0 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
|
2015-11-06 17:57:09 |
2.899 sec |
2594 rows/sec |
1.0 |
6068.0 |
0.1 |
0.3 |
0.6 |
0.9 |
0.1 |
0.1 |
0.3 |
0.6 |
0.9 |
0.1 |
|
2015-11-06 17:57:15 |
9.096 sec |
5465 rows/sec |
7.3 |
43742.0 |
0.1 |
0.6 |
0.3 |
0.9 |
0.1 |
0.1 |
0.6 |
0.3 |
0.9 |
0.1 |
|
2015-11-06 17:57:19 |
12.425 sec |
6571 rows/sec |
12.0 |
72478.0 |
0.1 |
0.6 |
0.2 |
1.0 |
0.1 |
0.1 |
0.6 |
0.3 |
0.9 |
0.1 |
Variable Importances:
variable |
relative_importance |
scaled_importance |
percentage |
Domestic.false |
1.0 |
1.0 |
0.0 |
Primary_Type.NARCOTICS |
0.9 |
0.9 |
0.0 |
IUCR.0860 |
0.8 |
0.8 |
0.0 |
FBI_Code.18 |
0.8 |
0.8 |
0.0 |
IUCR.4625 |
0.7 |
0.7 |
0.0 |
--- |
--- |
--- |
--- |
Location_Description.missing(NA) |
0.0 |
0.0 |
0.0 |
Primary_Type.missing(NA) |
0.0 |
0.0 |
0.0 |
FBI_Code.missing(NA) |
0.0 |
0.0 |
0.0 |
WeekDay.missing(NA) |
0.0 |
0.0 |
0.0 |
Domestic.missing(NA) |
0.0 |
0.0 |
0.0 |
31. Predict on the test set using the GBM model
predictions = model_gbm.predict(test)
predictions.show()
predict | false | true |
false | 0.946415 | 0.0535847 |
false | 0.862165 | 0.137835 |
false | 0.938661 | 0.0613392 |
false | 0.870186 | 0.129814 |
false | 0.980488 | 0.0195118 |
false | 0.972006 | 0.0279937 |
false | 0.990995 | 0.00900489 |
true | 0.0210692 | 0.978931 |
false | 0.693061 | 0.306939 |
false | 0.992097 | 0.00790253 |
test_performance = model_gbm.model_performance(test)
test_performance
ModelMetricsBinomial: gbm
** Reported on test data. **
MSE: 0.0893676876445
R^2: 0.57094394422
LogLoss: 0.294019576922
AUC: 0.922152238508
Gini: 0.844304477016
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.365461652105:
|
false |
true |
Error |
Rate |
false |
1297.0 |
84.0 |
0.0608 |
(84.0/1381.0) |
true |
153.0 |
427.0 |
0.2638 |
(153.0/580.0) |
Total |
1450.0 |
511.0 |
0.1209 |
(237.0/1961.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric |
threshold |
value |
idx |
max f1 |
0.4 |
0.8 |
158.0 |
max f2 |
0.1 |
0.8 |
295.0 |
max f0point5 |
0.7 |
0.9 |
97.0 |
max accuracy |
0.6 |
0.9 |
112.0 |
max precision |
1.0 |
1.0 |
0.0 |
max absolute_MCC |
0.6 |
0.7 |
112.0 |
max min_per_class_accuracy |
0.2 |
0.8 |
235.0 |
33. Create Plots of Crime type vs Arrest Rate and Proportion of reported Crime
# Create table to report Crimetype, Arrest count per crime, total reported count per Crime
sqlContext.registerDataFrameAsTable(df_crimes, "df_crimes")
allCrimes = sqlContext.sql("""SELECT Primary_Type, count(*) as all_count FROM df_crimes GROUP BY Primary_Type""")
crimesWithArrest = sqlContext.sql("SELECT Primary_Type, count(*) as crime_count FROM chicagoCrime WHERE Arrest = 'true' GROUP BY Primary_Type")
sqlContext.registerDataFrameAsTable(crimesWithArrest, "crimesWithArrest")
sqlContext.registerDataFrameAsTable(allCrimes, "allCrimes")
crime_type = sqlContext.sql("Select a.Primary_Type as Crime_Type, a.crime_count, b.all_count \
FROM crimesWithArrest a \
JOIN allCrimes b \
ON a.Primary_Type = b.Primary_Type ")
crime_type.show(12)
+--------------------+-----------+---------+
| Crime_Type|crime_count|all_count|
+--------------------+-----------+---------+
| OTHER OFFENSE| 183| 720|
| WEAPONS VIOLATION| 96| 118|
| DECEPTIVE PRACTICE| 25| 445|
| BURGLARY| 14| 458|
| BATTERY| 432| 1851|
| ROBBERY| 17| 357|
| MOTOR VEHICLE THEFT| 17| 414|
| PROSTITUTION| 106| 106|
| CRIMINAL DAMAGE| 76| 1003|
| KIDNAPPING| 1| 7|
| GAMBLING| 3| 3|
|LIQUOR LAW VIOLATION| 12| 12|
+--------------------+-----------+---------+
only showing top 12 rows
34. Copy Crime_type table from Spark to H2O
crime_typeHF = hc.as_h2o_frame(crime_type,framename="crime_type")
35. Create Additional columns Arrest_rate and Crime_propotion
crime_typeHF["Arrest_rate"] = crime_typeHF["crime_count"]/crime_typeHF["all_count"]
crime_typeHF["Crime_proportion"] = crime_typeHF["all_count"]/crime_typeHF["all_count"].sum()
crime_typeHF["Crime_Type"] = crime_typeHF["Crime_Type"].asfactor()
# h2o.assign(crime_typeHF,crime_type)
crime_typeHF.frame_id = "Crime_type"
crime_typeHF
Crime_Type | crime_count | all_count | Arrest_rate | Crime_proportion |
OTHER OFFENSE | 183 | 720 | 0.254167 | 0.0721226 |
WEAPONS VIOLATION | 96 | 118 | 0.813559 | 0.0118201 |
DECEPTIVE PRACTICE | 25 | 445 | 0.0561798 | 0.0445758 |
BURGLARY | 14 | 458 | 0.0305677 | 0.045878 |
BATTERY | 432 | 1851 | 0.233387 | 0.185415 |
ROBBERY | 17 | 357 | 0.047619 | 0.0357608 |
MOTOR VEHICLE THEFT | 17 | 414 | 0.0410628 | 0.0414705 |
PROSTITUTION | 106 | 106 | 1 | 0.0106181 |
CRIMINAL DAMAGE | 76 | 1003 | 0.0757727 | 0.100471 |
KIDNAPPING | 1 | 7 | 0.142857 | 0.000701192 |
hc
H2OContext: ip=172.16.2.98, port=54329
36. Plot in Flow
plot (g) -> g(
g.rect(
g.position "Crime_Type", "Arrest_rate"
g.fillColor g.value 'blue'
g.fillOpacity g.value 0.75
)
g.rect(
g.position "Crime_Type", "Crime_proportion"
g.fillColor g.value 'red'
g.fillOpacity g.value 0.65
)
g.from inspect "data", getFrame "Crime_type"
)
#hc.stop()