In this notebook you will be able to reproduce the work that was done for the LDP project for the group - Zubair Hussain, Lance Villareul, Jack Manjimela, Tafadzwa Goremusandu. Please ensure that you run all the code snippets in the right order of the notebook to avoid errors. Furthermore, do ensure that the right packages are installed to run the codes or you will be faced with errors.

The following code will allow one to import the finalised dataset for this project. Please ensure you are selecting the finalised dataset when running the code or there will be issues further on. Finalised dataset is the one which was random sampled down to 50,000 records.

data <- read.csv(file.choose())
str(data)
'data.frame':   50000 obs. of  67 variables:
 $ Accident_Index                             : Factor w/ 30228 levels "2.01E+100","2.01E+101",..: 5 29409 13347 10208 5 5 5 4137 1419 24645 ...
 $ Location_Easting_OSGR                      : int  576040 446860 518200 543500 459450 463210 455200 529180 479760 328910 ...
 $ Location_Northing_OSGR                     : int  167480 357343 183200 185360 450260 316530 300490 191900 407890 325920 ...
 $ Longitude                                  : num  0.5282 -1.3014 -0.2972 0.0682 -1.0957 ...
 $ Latitude                                   : num  51.4 53.1 51.5 51.5 53.9 ...
 $ Police_Force                               : int  46 31 1 1 12 33 33 1 16 22 ...
 $ Accident_Severity                          : int  0 0 0 0 0 0 0 0 1 0 ...
 $ Number_of_Vehicles                         : int  2 2 3 3 2 2 2 2 3 3 ...
 $ Number_of_Casualties                       : int  1 2 2 1 1 1 1 1 6 1 ...
 $ Date                                       : Factor w/ 4017 levels "01/01/05","01/01/06",..: 1272 1606 522 2952 2015 2885 3006 2 3389 3938 ...
 $ Day_of_Week                                : int  4 6 6 6 1 5 3 1 2 2 ...
 $ Time                                       : Factor w/ 1419 levels "00:01","00:02",..: 506 580 680 1070 1125 472 680 214 733 770 ...
 $ Local_Authority_.District.                 : int  544 340 28 17 189 362 360 32 232 286 ...
 $ Local_Authority_.Highway.                  : Factor w/ 206 levels "E06000001","E06000002",..: 35 143 96 116 14 138 138 101 13 51 ...
 $ X1st_Road_Class                            : int  3 3 3 3 6 3 3 3 1 3 ...
 $ X1st_Road_Number                           : int  2 38 4005 406 0 46 563 406 180 483 ...
 $ Road_Type                                  : int  3 3 6 3 6 3 3 3 3 6 ...
 $ Speed_limit                                : int  30 40 30 70 30 70 50 40 70 60 ...
 $ Junction_Detail                            : int  0 3 3 0 0 6 3 6 0 3 ...
 $ Junction_Control                           : int  -1 2 4 -1 -1 4 2 2 -1 4 ...
 $ X2nd_Road_Class                            : int  -1 6 4 -1 -1 5 3 3 -1 6 ...
 $ X2nd_Road_Number                           : int  0 0 456 0 0 6202 563 109 0 0 ...
 $ Pedestrian_Crossing.Human_Control          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Pedestrian_Crossing.Physical_Facilities    : int  4 5 0 0 0 0 0 5 0 0 ...
 $ Light_Conditions                           : int  1 1 1 1 1 1 1 4 1 1 ...
 $ Weather_Conditions                         : int  1 2 1 1 1 8 1 2 1 1 ...
 $ Road_Surface_Conditions                    : int  1 2 1 1 1 2 1 2 1 2 ...
 $ Special_Conditions_at_Site                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Carriageway_Hazards                        : int  0 0 0 0 0 0 0 0 3 0 ...
 $ Urban_or_Rural_Area                        : int  1 1 1 1 2 2 1 1 2 2 ...
 $ Did_Police_Officer_Attend_Scene_of_Accident: int  1 1 1 1 1 1 1 1 1 1 ...
 $ LSOA_of_Accident_Location                  : Factor w/ 19853 levels "","E01000001",..: 9058 15928 265 1998 7540 14608 14551 841 7443 16442 ...
 $ Vehicle_Reference.x                        : int  1 1 3 1 1 2 1 2 2 3 ...
 $ Casualty_Reference                         : int  1 1 1 1 1 1 1 1 6 1 ...
 $ Casualty_Class                             : int  1 1 1 1 1 1 2 2 2 1 ...
 $ Sex_of_Casualty                            : int  2 1 1 1 1 1 1 1 1 2 ...
 $ Age_of_Casualty                            : int  22 31 35 30 21 35 7 30 1 17 ...
 $ Age_Band_of_Casualty                       : int  5 6 6 6 5 6 2 6 1 4 ...
 $ Casualty_Severity                          : int  0 0 0 0 0 0 0 0 1 0 ...
 $ Pedestrian_Location                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Pedestrian_Movement                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Car_Passenger                              : int  0 0 0 0 0 0 2 2 2 0 ...
 $ Bus_or_Coach_Passenger                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Pedestrian_Road_Maintenance_Worker         : int  0 0 -1 -1 -1 -1 -1 -1 -1 0 ...
 $ Casualty_Type                              : int  9 9 9 9 1 9 9 9 9 9 ...
 $ Casualty_Home_Area_Type                    : int  1 1 -1 1 -1 1 1 1 1 3 ...
 $ Vehicle_Reference.y                        : int  2 1 1 2 1 1 1 1 1 1 ...
 $ Vehicle_Type                               : int  9 9 9 9 1 9 9 9 9 9 ...
 $ Towing_and_Articulation                    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Vehicle_Manoeuvre                          : int  3 18 4 18 18 9 18 18 18 18 ...
 $ Vehicle_Location.Restricted_Lane           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Junction_Location                          : int  0 8 1 0 0 6 1 8 0 8 ...
 $ Skidding_and_Overturning                   : int  0 0 0 0 0 0 0 0 0 1 ...
 $ Hit_Object_in_Carriageway                  : int  0 0 0 0 4 0 0 0 0 0 ...
 $ Vehicle_Leaving_Carriageway                : int  0 0 0 0 0 0 0 0 1 1 ...
 $ Hit_Object_off_Carriageway                 : int  0 0 0 0 0 0 0 0 6 0 ...
 $ X1st_Point_of_Impact                       : int  2 3 2 2 1 0 1 1 1 0 ...
 $ Was_Vehicle_Left_Hand_Drive.               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Journey_Purpose_of_Driver                  : int  6 1 15 15 2 1 15 15 15 6 ...
 $ Sex_of_Driver                              : int  1 1 1 2 1 1 2 1 2 1 ...
 $ Age_of_Driver                              : int  65 31 -1 22 21 51 31 31 32 38 ...
 $ Age_Band_of_Driver                         : int  9 6 -1 5 5 8 6 6 6 7 ...
 $ Engine_Capacity_.CC.                       : int  1896 1956 -1 998 -1 3222 1149 -1 2197 -1 ...
 $ Propulsion_Code                            : int  2 2 -1 1 -1 2 1 -1 1 -1 ...
 $ Age_of_Vehicle                             : int  12 3 -1 1 -1 2 10 -1 7 -1 ...
 $ Driver_IMD_Decile                          : int  7 -1 -1 2 -1 8 4 5 6 -1 ...
 $ Driver_Home_Area_Type                      : int  1 1 -1 1 -1 3 1 1 3 -1 ...

The following code is required so that our categorical variables are factorised.

variable_factor <- lapply(data, class) == 'integer'
data[,variable_factor] <- lapply(data[, variable_factor], as.factor)
data[,c(6,8,9,16,37,61,63,65)] <- lapply(data[,c(6,8,9,16,37,61,63,65)], as.integer)
str(data)
'data.frame':   50000 obs. of  67 variables:
 $ Accident_Index                             : Factor w/ 30228 levels "2.01E+100","2.01E+101",..: 5 29409 13347 10208 5 5 5 4137 1419 24645 ...
 $ Location_Easting_OSGR                      : Factor w/ 30306 levels "75000","119980",..: 28052 16785 23405 26082 18395 18787 17823 24634 19833 4999 ...
 $ Location_Northing_OSGR                     : Factor w/ 31818 levels "18180","18990",..: 6252 19294 8131 8370 26442 17125 16014 9143 23317 17555 ...
 $ Longitude                                  : num  0.5282 -1.3014 -0.2972 0.0682 -1.0957 ...
 $ Latitude                                   : num  51.4 53.1 51.5 51.5 53.9 ...
 $ Police_Force                               : int  32 19 1 1 9 21 21 1 12 16 ...
 $ Accident_Severity                          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Number_of_Vehicles                         : int  2 2 3 3 2 2 2 2 3 3 ...
 $ Number_of_Casualties                       : int  1 2 2 1 1 1 1 1 6 1 ...
 $ Date                                       : Factor w/ 4017 levels "01/01/05","01/01/06",..: 1272 1606 522 2952 2015 2885 3006 2 3389 3938 ...
 $ Day_of_Week                                : Factor w/ 7 levels "1","2","3","4",..: 4 6 6 6 1 5 3 1 2 2 ...
 $ Time                                       : Factor w/ 1419 levels "00:01","00:02",..: 506 580 680 1070 1125 472 680 214 733 770 ...
 $ Local_Authority_.District.                 : Factor w/ 416 levels "1","2","3","4",..: 302 172 28 17 110 189 187 32 122 150 ...
 $ Local_Authority_.Highway.                  : Factor w/ 206 levels "E06000001","E06000002",..: 35 143 96 116 14 138 138 101 13 51 ...
 $ X1st_Road_Class                            : Factor w/ 6 levels "1","2","3","4",..: 3 3 3 3 6 3 3 3 1 3 ...
 $ X1st_Road_Number                           : int  3 39 1969 407 1 47 563 407 181 483 ...
 $ Road_Type                                  : Factor w/ 6 levels "1","2","3","6",..: 3 3 4 3 4 3 3 3 3 4 ...
 $ Speed_limit                                : Factor w/ 6 levels "20","30","40",..: 2 3 2 6 2 6 4 3 6 5 ...
 $ Junction_Detail                            : Factor w/ 10 levels "-1","0","1","2",..: 2 5 5 2 2 7 5 7 2 5 ...
 $ Junction_Control                           : Factor w/ 6 levels "-1","0","1","2",..: 1 4 6 1 1 6 4 4 1 6 ...
 $ X2nd_Road_Class                            : Factor w/ 7 levels "-1","1","2","3",..: 1 7 5 1 1 6 4 4 1 7 ...
 $ X2nd_Road_Number                           : Factor w/ 2775 levels "-1","0","1","2",..: 2 2 444 2 2 2381 540 111 2 2 ...
 $ Pedestrian_Crossing.Human_Control          : Factor w/ 4 levels "-1","0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ Pedestrian_Crossing.Physical_Facilities    : Factor w/ 7 levels "-1","0","1","4",..: 4 5 2 2 2 2 2 5 2 2 ...
 $ Light_Conditions                           : Factor w/ 5 levels "1","4","5","6",..: 1 1 1 1 1 1 1 2 1 1 ...
 $ Weather_Conditions                         : Factor w/ 10 levels "-1","1","2","3",..: 2 3 2 2 2 9 2 3 2 2 ...
 $ Road_Surface_Conditions                    : Factor w/ 6 levels "-1","1","2","3",..: 2 3 2 2 2 3 2 3 2 3 ...
 $ Special_Conditions_at_Site                 : Factor w/ 9 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Carriageway_Hazards                        : Factor w/ 7 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 5 2 ...
 $ Urban_or_Rural_Area                        : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 1 1 2 2 ...
 $ Did_Police_Officer_Attend_Scene_of_Accident: Factor w/ 4 levels "-1","1","2","3": 2 2 2 2 2 2 2 2 2 2 ...
 $ LSOA_of_Accident_Location                  : Factor w/ 19853 levels "","E01000001",..: 9058 15928 265 1998 7540 14608 14551 841 7443 16442 ...
 $ Vehicle_Reference.x                        : Factor w/ 39 levels "1","2","3","4",..: 1 1 3 1 1 2 1 2 2 3 ...
 $ Casualty_Reference                         : Factor w/ 57 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 6 1 ...
 $ Casualty_Class                             : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 2 2 2 1 ...
 $ Sex_of_Casualty                            : Factor w/ 3 levels "-1","1","2": 3 2 2 2 2 2 2 2 2 3 ...
 $ Age_of_Casualty                            : int  24 33 37 32 23 37 9 32 3 19 ...
 $ Age_Band_of_Casualty                       : Factor w/ 12 levels "-1","1","2","3",..: 6 7 7 7 6 7 3 7 2 5 ...
 $ Casualty_Severity                          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Pedestrian_Location                        : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Pedestrian_Movement                        : Factor w/ 10 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Car_Passenger                              : Factor w/ 4 levels "-1","0","1","2": 2 2 2 2 2 2 4 4 4 2 ...
 $ Bus_or_Coach_Passenger                     : Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Pedestrian_Road_Maintenance_Worker         : Factor w/ 4 levels "-1","0","1","2": 2 2 1 1 1 1 1 1 1 2 ...
 $ Casualty_Type                              : Factor w/ 20 levels "0","1","2","3",..: 8 8 8 8 2 8 8 8 8 8 ...
 $ Casualty_Home_Area_Type                    : Factor w/ 4 levels "-1","1","2","3": 2 2 1 2 1 2 2 2 2 4 ...
 $ Vehicle_Reference.y                        : Factor w/ 46 levels "1","2","3","4",..: 2 1 1 2 1 1 1 1 1 1 ...
 $ Vehicle_Type                               : Factor w/ 20 levels "-1","1","2","3",..: 8 8 8 8 2 8 8 8 8 8 ...
 $ Towing_and_Articulation                    : Factor w/ 7 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Vehicle_Manoeuvre                          : Factor w/ 19 levels "-1","1","2","3",..: 4 19 5 19 19 10 19 19 19 19 ...
 $ Vehicle_Location.Restricted_Lane           : Factor w/ 11 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Junction_Location                          : Factor w/ 10 levels "-1","0","1","2",..: 2 10 3 2 2 8 3 10 2 10 ...
 $ Skidding_and_Overturning                   : Factor w/ 7 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 2 3 ...
 $ Hit_Object_in_Carriageway                  : Factor w/ 13 levels "-1","0","1","2",..: 2 2 2 2 5 2 2 2 2 2 ...
 $ Vehicle_Leaving_Carriageway                : Factor w/ 10 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ Hit_Object_off_Carriageway                 : Factor w/ 13 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 8 2 ...
 $ X1st_Point_of_Impact                       : Factor w/ 6 levels "-1","0","1","2",..: 4 5 4 4 3 2 3 3 3 2 ...
 $ Was_Vehicle_Left_Hand_Drive.               : Factor w/ 3 levels "-1","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ Journey_Purpose_of_Driver                  : Factor w/ 8 levels "-1","1","2","3",..: 7 2 8 8 3 2 8 8 8 7 ...
 $ Sex_of_Driver                              : Factor w/ 3 levels "1","2","3": 1 1 1 2 1 1 2 1 2 1 ...
 $ Age_of_Driver                              : int  64 30 1 21 20 50 30 30 31 37 ...
 $ Age_Band_of_Driver                         : Factor w/ 12 levels "-1","1","2","3",..: 10 7 1 6 6 9 7 7 7 8 ...
 $ Engine_Capacity_.CC.                       : int  386 396 1 161 1 632 194 1 454 1 ...
 $ Propulsion_Code                            : Factor w/ 8 levels "-1","1","2","3",..: 3 3 1 2 1 3 2 1 2 1 ...
 $ Age_of_Vehicle                             : int  13 4 1 2 1 3 11 1 8 1 ...
 $ Driver_IMD_Decile                          : Factor w/ 11 levels "-1","1","2","3",..: 8 1 1 3 1 9 5 6 7 1 ...
 $ Driver_Home_Area_Type                      : Factor w/ 4 levels "-1","1","2","3": 2 2 1 2 1 4 2 2 4 1 ...

The following code creates subset of our original data. This is so that going forth the code requires less computing power when doing analysis. One is accident specific and the other is casualty specific.

data_accidents <- data[,c(7,8,9,17,18,20,23,24,25,26,27,28,29,30,37,61)]
data_casualties <- data[,c(8,9,18,36,37,39,40,41,42,45)]
str(data_accidents)
'data.frame':   50000 obs. of  16 variables:
 $ Accident_Severity                      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Number_of_Vehicles                     : int  2 2 3 3 2 2 2 2 3 3 ...
 $ Number_of_Casualties                   : int  1 2 2 1 1 1 1 1 6 1 ...
 $ Road_Type                              : Factor w/ 6 levels "1","2","3","6",..: 3 3 4 3 4 3 3 3 3 4 ...
 $ Speed_limit                            : Factor w/ 6 levels "20","30","40",..: 2 3 2 6 2 6 4 3 6 5 ...
 $ Junction_Control                       : Factor w/ 6 levels "-1","0","1","2",..: 1 4 6 1 1 6 4 4 1 6 ...
 $ Pedestrian_Crossing.Human_Control      : Factor w/ 4 levels "-1","0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ Pedestrian_Crossing.Physical_Facilities: Factor w/ 7 levels "-1","0","1","4",..: 4 5 2 2 2 2 2 5 2 2 ...
 $ Light_Conditions                       : Factor w/ 5 levels "1","4","5","6",..: 1 1 1 1 1 1 1 2 1 1 ...
 $ Weather_Conditions                     : Factor w/ 10 levels "-1","1","2","3",..: 2 3 2 2 2 9 2 3 2 2 ...
 $ Road_Surface_Conditions                : Factor w/ 6 levels "-1","1","2","3",..: 2 3 2 2 2 3 2 3 2 3 ...
 $ Special_Conditions_at_Site             : Factor w/ 9 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Carriageway_Hazards                    : Factor w/ 7 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 5 2 ...
 $ Urban_or_Rural_Area                    : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 1 1 2 2 ...
 $ Age_of_Casualty                        : int  24 33 37 32 23 37 9 32 3 19 ...
 $ Age_of_Driver                          : int  64 30 1 21 20 50 30 30 31 37 ...
str(data_casualties)
'data.frame':   50000 obs. of  10 variables:
 $ Number_of_Vehicles  : int  2 2 3 3 2 2 2 2 3 3 ...
 $ Number_of_Casualties: int  1 2 2 1 1 1 1 1 6 1 ...
 $ Speed_limit         : Factor w/ 6 levels "20","30","40",..: 2 3 2 6 2 6 4 3 6 5 ...
 $ Sex_of_Casualty     : Factor w/ 3 levels "-1","1","2": 3 2 2 2 2 2 2 2 2 3 ...
 $ Age_of_Casualty     : int  24 33 37 32 23 37 9 32 3 19 ...
 $ Casualty_Severity   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
 $ Pedestrian_Location : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Pedestrian_Movement : Factor w/ 10 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Car_Passenger       : Factor w/ 4 levels "-1","0","1","2": 2 2 2 2 2 2 4 4 4 2 ...
 $ Casualty_Type       : Factor w/ 20 levels "0","1","2","3",..: 8 8 8 8 2 8 8 8 8 8 ...

The following code will split our accident related dataset into training and test dataset using the caTools package. We set the seed so that we can produce the same results further on.

str(training)
'data.frame':   40000 obs. of  16 variables:
 $ Accident_Severity                      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
 $ Number_of_Vehicles                     : int  2 2 3 2 2 3 1 2 3 3 ...
 $ Number_of_Casualties                   : int  1 2 2 1 1 1 1 2 5 3 ...
 $ Road_Type                              : Factor w/ 6 levels "1","2","3","6",..: 3 3 4 3 3 4 3 4 4 4 ...
 $ Speed_limit                            : Factor w/ 6 levels "20","30","40",..: 2 3 2 6 4 5 6 2 2 2 ...
 $ Junction_Control                       : Factor w/ 6 levels "-1","0","1","2",..: 1 4 6 6 4 6 6 4 6 6 ...
 $ Pedestrian_Crossing.Human_Control      : Factor w/ 4 levels "-1","0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ Pedestrian_Crossing.Physical_Facilities: Factor w/ 7 levels "-1","0","1","4",..: 4 5 2 2 2 2 2 4 2 2 ...
 $ Light_Conditions                       : Factor w/ 5 levels "1","4","5","6",..: 1 1 1 1 1 1 2 2 1 1 ...
 $ Weather_Conditions                     : Factor w/ 10 levels "-1","1","2","3",..: 2 3 2 9 2 2 2 2 5 2 ...
 $ Road_Surface_Conditions                : Factor w/ 6 levels "-1","1","2","3",..: 2 3 2 3 2 3 2 2 2 5 ...
 $ Special_Conditions_at_Site             : Factor w/ 9 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Carriageway_Hazards                    : Factor w/ 7 levels "-1","0","1","2",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Urban_or_Rural_Area                    : Factor w/ 3 levels "1","2","3": 1 1 1 2 1 2 2 1 1 1 ...
 $ Age_of_Casualty                        : int  24 33 37 37 9 19 22 1 29 41 ...
 $ Age_of_Driver                          : int  64 30 1 50 30 37 27 27 42 30 ...

The following code applies that random forest model to our training accident dataset. We apply 100 trees to our random forest model. Please ensure you have installed the randomForest package before running the following code. You can install the package by running this code –install.packages(“randomForest”)– in the console area.

library(randomForest)
set.seed(123)
classifier_acc = randomForest(Accident_Severity~., 
                              data = training, ntree = 100)
classifier_acc

Call:
 randomForest(formula = Accident_Severity ~ ., data = training,      ntree = 100) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 3

        OOB estimate of  error rate: 15.57%
Confusion matrix:
      0   1 class.error
0 33249 374  0.01112334
1  5855 522  0.91814333

The following code will allow us to examine the importance of each variable.

imp_acc <- importance(classifier_acc)
imp_acc_plot <- varImpPlot(classifier_acc)

imp_acc
                                        MeanDecreaseGini
Number_of_Vehicles                             485.84967
Number_of_Casualties                           649.71850
Road_Type                                      232.01233
Speed_limit                                    367.35355
Junction_Control                               308.53693
Pedestrian_Crossing.Human_Control               17.73569
Pedestrian_Crossing.Physical_Facilities        190.76429
Light_Conditions                               239.59845
Weather_Conditions                             318.39407
Road_Surface_Conditions                        224.82693
Special_Conditions_at_Site                     101.19578
Carriageway_Hazards                             82.71466
Urban_or_Rural_Area                            137.22752
Age_of_Casualty                               1204.34574
Age_of_Driver                                 1145.81724
imp_acc_plot
                                        MeanDecreaseGini
Number_of_Vehicles                             485.84967
Number_of_Casualties                           649.71850
Road_Type                                      232.01233
Speed_limit                                    367.35355
Junction_Control                               308.53693
Pedestrian_Crossing.Human_Control               17.73569
Pedestrian_Crossing.Physical_Facilities        190.76429
Light_Conditions                               239.59845
Weather_Conditions                             318.39407
Road_Surface_Conditions                        224.82693
Special_Conditions_at_Site                     101.19578
Carriageway_Hazards                             82.71466
Urban_or_Rural_Area                            137.22752
Age_of_Casualty                               1204.34574
Age_of_Driver                                 1145.81724

In the following code we apply our classifer to our accident test dataset and examine the performance of our random forest classifier. We use the caret package to produce our confusion matrix and other important statistics to analyse model performance. –Please ensure that the caret package is installed before running the following code–

library(caret)
y_pred_acc <- predict(classifier_acc, newdata = test[-1], type = 'class')
cm_acc <- table(test[,1], y_pred_acc)
cm_acc <- confusionMatrix(cm_acc)
cm_acc
Confusion Matrix and Statistics

   y_pred_acc
       0    1
  0 8318   88
  1 1455  139
                                          
               Accuracy : 0.8457          
                 95% CI : (0.8385, 0.8527)
    No Information Rate : 0.9773          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1176          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8511          
            Specificity : 0.6123          
         Pos Pred Value : 0.9895          
         Neg Pred Value : 0.0872          
             Prevalence : 0.9773          
         Detection Rate : 0.8318          
   Detection Prevalence : 0.8406          
      Balanced Accuracy : 0.7317          
                                          
       'Positive' Class : 0               
                                          

We now create the decision tree classifier on our training accident dataset. Once classifier is built on our training set, we apply it to the test set and analyse the results. –Please ensure that you have installed the rpart package before running the following code–

library(rpart)
set.seed(123)
classifier_tree <- rpart(formula = Accident_Severity ~ .,
                        data = training)
#predicting test set results
y_pred_tree <- predict(classifier_tree, newdata = test[-1], type = 'class')
#confusion matrix for decision tree model
cm_tree <- table(test[,1], y_pred_tree)
cm_tree <- confusionMatrix(cm_tree)
cm_tree
Confusion Matrix and Statistics

   y_pred_tree
       0    1
  0 8406    0
  1 1594    0
                                          
               Accuracy : 0.8406          
                 95% CI : (0.8333, 0.8477)
    No Information Rate : 1               
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0               
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.8406          
            Specificity :     NA          
         Pos Pred Value :     NA          
         Neg Pred Value :     NA          
             Prevalence : 1.0000          
         Detection Rate : 0.8406          
   Detection Prevalence : 0.8406          
      Balanced Accuracy :     NA          
                                          
       'Positive' Class : 0               
                                          

We now create the NaiveBayes classifier on our training accident dataset. Once classifier is built on our training set, we apply it to the test set and analyse the results. –Please ensure that you have installed the rpart package before running the following code–

library(e1071)
set.seed(123)
classifier_naive <- naiveBayes(formula = Accident_Severity ~ .,
                               data = training)

#predicting test set results - naiveBayes
y_pred_naive <- predict(classifier_naive, newdata = test[-1])

#making confusion matrix - naiveBayes
cm_naive <- table(test[, 1], y_pred_naive)
cm_naive <- confusionMatrix(cm_naive)
cm_naive

Now we move on to our second question and focus on building classifiers for Casualty severity. The following code will split our casualty related dataset into training and test dataset using the caTools package. We set the seed so that we can produce the same results further on.

set.seed(1234)
classifier_svm_radial <- svm(formula = Accident_Severity ~ .,
                      data = training,
                      type = 'C-classification',
                      kernel = 'radial')

y_pred_svm_acc_radial <- predict(classifier_svm_radial, newdata = test[-1])
cm_svm_acc_radial <- table(test[, 1], y_pred_svm_acc_radial)
confusionMatrix(cm_svm_acc_radial)
library(caTools)
set.seed(123)
split1 = sample.split(data_casualties$Casualty_Severity, SplitRatio = 0.8)
training_cas <- subset(data_casualties, split1 == TRUE)
test_cas <- subset(data_casualties, split1 == FALSE)
str(training_cas)
'data.frame':   40000 obs. of  10 variables:
 $ Number_of_Vehicles  : int  2 2 3 2 2 3 3 1 2 3 ...
 $ Number_of_Casualties: int  1 2 2 1 1 6 1 1 2 5 ...
 $ Speed_limit         : Factor w/ 6 levels "20","30","40",..: 2 3 2 6 4 6 5 6 2 2 ...
 $ Sex_of_Casualty     : Factor w/ 3 levels "-1","1","2": 3 2 2 2 2 2 3 2 2 2 ...
 $ Age_of_Casualty     : int  24 33 37 37 9 3 19 22 1 29 ...
 $ Casualty_Severity   : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
 $ Pedestrian_Location : Factor w/ 11 levels "0","1","2","3",..: 1 1 1 1 1 1 1 6 1 1 ...
 $ Pedestrian_Movement : Factor w/ 10 levels "0","1","2","3",..: 1 1 1 1 1 1 1 4 1 1 ...
 $ Car_Passenger       : Factor w/ 4 levels "-1","0","1","2": 2 2 2 2 4 4 2 2 3 2 ...
 $ Casualty_Type       : Factor w/ 20 levels "0","1","2","3",..: 8 8 8 8 8 8 8 1 8 8 ...

The following code applies that random forest model to our training casualty dataset. We apply 100 trees to our random forest model. Please ensure you have installed the randomForest package before running the following code. You can install the package by running this code –install.packages(“randomForest”)– in the console area.

library(randomForest)
set.seed(123)
classifier1_cas = randomForest(x = training_cas[-6],
                           y = training_cas$Casualty_Severity,
                           ntree = 100)
imp_cas <- importance(classifier1_cas)
imp_cas_plot <- varImpPlot(classifier1_cas)