As an expert data scientist, you have been hired by the Mayor of Tempe to conduct a study and make recommendations on ways to reduce traffic injuries in the city. You will use the crash data from the city’s open data portal and your data wrangling skills to look for patterns that help us understand the causes of traffic accidents in the city, and might suggest some ways to reduce injuries and fatalities.
Consider the following questions:
You will use the following packages for this lab:
library( dplyr ) # data wrangling
library( pander ) # formatting output
library( ggmap ) # grab map tiles
library( viridis ) # color pallette for maps
library( ggplot2 ) # fancy graphics
library( ggthemes ) # fancy themes for ggplots
In this lab you will use a traffic accidents dataset from the Tempe Open Data Portal:
URL <- "https://github.com/DS4PS/Data-Science-Class/blob/master/DATA/TempeTrafficAccidents.rds?raw=true"
dat <- readRDS(gzcon(url( URL )))
head( dat )
## Incidentid DateTime Year StreetName CrossStreet
## 1 2579417 1/10/12 9:04 2012 Baseline Rd Price Rd
## 2 2582044 1/5/12 17:24 2012 Rural Rd Playa Del Norte
## 3 2582996 1/16/12 19:08 2012 Rio Salado Pkwy State Route 101 Exit 51 J-Ramp
## 4 2584311 1/27/12 14:41 2012 Rio Salado Pkwy State Route 101 Exit 51 J-Ramp
## 5 2584437 1/10/12 13:41 2012 Scottsdale Rd State Route 202 Exit 7 P-Ramp
## 6 2584439 1/9/12 17:49 2012 Priest Dr 12th St
## Distance JunctionRelation Totalinjuries Totalfatalities
## 1 -24.816 Intersection Related Interchange 0 0
## 2 -796.224 Not Junction Related 0 0
## 3 0.000 Intersection Interchange 0 0
## 4 76.032 Not Junction Related 1 0
## 5 -40.128 Entrance Exit Ramp Interchange 0 0
## 6 70.224 Not Junction Related 0 0
## Injuryseverity Collisionmanner Lightcondition
## 1 No Injury Rear End Daylight
## 2 No Injury Rear End Dusk
## 3 No Injury ANGLE (Front To Side)(Other Than Left Turn) Dark Lighted
## 4 Possible Injury Rear End Daylight
## 5 No Injury Rear End Daylight
## 6 No Injury Sideswipe Same Direction Dark Lighted
## Weather SurfaceCondition Unittype_One Age_Drv1 Gender_Drv1
## 1 Clear Dry Driver 43 Male
## 2 Clear Dry Driver 19 Male
## 3 Clear Dry Driver 26 Male
## 4 Cloudy Dry Driver 29 Male
## 5 Clear Dry Driver 255
## 6 Clear Unknown Driver 255
## Traveldirection_One Unitaction_One Violation1_Drv1
## 1 East Going Straight Ahead Inattention Distraction
## 2 North Going Straight Ahead Speed To Fast For Conditions
## 3 North Making Left Turn Made Improper Turn
## 4 West Going Straight Ahead Speed To Fast For Conditions
## 5 North Going Straight Ahead Speed To Fast For Conditions
## 6 Unknown Unknown Unknown
## AlcoholUse_Drv1 DrugUse_Drv1 Unittype_Two Age_Drv2 Gender_Drv2
## 1 No Apparent Influence No Apparent Influence Driver 62 Male
## 2 No Apparent Influence No Apparent Influence Driver 28 Male
## 3 No Apparent Influence No Apparent Influence Driver 24 Male
## 4 No Apparent Influence No Apparent Influence Driver 18 Female
## 5 No Apparent Influence No Apparent Influence Driver 26 Male
## 6 No Apparent Influence No Apparent Influence Driver 31 Female
## Traveldirection_Two Unitaction_Two Violation1_Drv2
## 1 East Going Straight Ahead No Improper Action
## 2 North Stopped In Trafficway No Improper Action
## 3 West Going Straight Ahead No Improper Action
## 4 West Stopped In Trafficway No Improper Action
## 5 North Stopped In Trafficway No Improper Action
## 6 South Going Straight Ahead No Improper Action
## AlcoholUse_Drv2 DrugUse_Drv2 Latitude Longitude
## 1 No Apparent Influence No Apparent Influence 33.37845 -111.8926
## 2 No Apparent Influence No Apparent Influence 33.43184 -111.9263
## 3 No Apparent Influence No Apparent Influence 33.42931 -111.8903
## 4 No Apparent Influence No Apparent Influence 33.42931 -111.8900
## 5 No Apparent Influence No Apparent Influence 33.43533 -111.9262
## 6 No Apparent Influence No Apparent Influence 33.41633 -111.9608
The dataset contains the following variables:
column | type | label | description |
---|---|---|---|
Incidentid | numeric | Incident ID | Unique incident ID number assigned by Arizona Department of Transportation (ADOT). |
DateTime | timestamp | Date Time | Date and time that the crash occurred. |
Year | numeric | Year | Year that the crash occurred. |
StreetName | text | Street Name | The street that the crash occurred on. |
CrossStreet | text | Cross-street | The nearest intersecting street or road. |
Distance | numeric | Distance from Intersection | The distance, in feet, that the crash occurred from the cross-street. |
JunctionRelation | text | Junction Relation | The location of the crash in relation to a junction, either an intersection or connection between a driveway and a roadway. |
Totalinjuries | numeric | Total Injuries | Total number of persons with non-fatal injuries involved in the crash. |
Totalfatalities | numeric | Total Fatalities | Total number of persons with fatal injuries involved in the crash. |
Injuryseverity | text | Injury Severity | The highest severity of injury of all persons involved in the crash. |
Collisionmanner | text | Collision Manner | Identifies the manner in which two vehicles initially came into contact. |
Lightcondition | text | Lighting Conditions | The type/level of light that existed at the time of the crash. |
Weather | text | Weather | The prevailing (most significant) atmospheric conditions that existed at the time of the crash. |
SurfaceCondition | text | Surface Condition | The roadway surface condition at the time and place of a crash. |
Unittype_One | text | Unit Type One | Driver, Passenger, Pedestrian, Pedalcyclist or Driverless. |
Age_Drv1 | numeric | ||
Gender_Drv1 | text | ||
Traveldirection_One | text | Travel Direction | The direction the unit was traveling before the incident occurred, |
Unitaction_One | text | Unit Action One | The maneuver, or last action, of the unit before the crash. |
Violation1_Drv1 | text | Violation One | The main violation/behavior of the unit that contributed to the crash. |
AlcoholUse_Drv1 | text | Alcohol Use 1 | Indicates whether alcohol was a contributing factor in the crash or not. |
DrugUse_Drv1 | text | Drug Use 1 | Indicates whether drug use was a contributing factor in the crash or not. |
Unittype_Two | text | Unit Type Two | Driver, Passenger, Pedestrian, Pedalcyclist or Driverless. |
Age_Drv2 | numeric | ||
Gender_Drv2 | text | ||
Traveldirection_Two | text | Travel Direction Two | The direction the unit was traveling before the incident occurred. |
Unitaction_Two | text | Unit Action Two | The maneuver, or last action, of the unit before the crash. |
Violation1_Drv2 | text | Violation Two | The main violation/behavior of the unit that contributed to the crash. |
AlcoholUse_Drv2 | text | Alcohol Use 2 | Indicates whether alcohol was a contributing factor in the crash or not. |
DrugUse_Drv2 | text | Drug Use 2 | Indicates whether drug use was a contributing factor in the crash or not. |
Latitude | numeric | Latitude | Used to specify the precise location of the crash. |
Longitude | numeric | Longitude | Used to specify the precise location of the crash. |
date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )
dat$hour <- format( date.vec, format="%H" )
dat$month <- format( date.vec, format="%b" )
dat$day <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week <- format( date.vec, format="%V" )
Correct the order of categorical variables by making them ordered factors:
table( dat$day ) %>% pander()
Fri | Mon | Sat | Sun | Thu | Tue | Wed |
---|---|---|---|---|---|---|
5006 | 4094 | 3044 | 2145 | 4814 | 4656 | 4711 |
# correct order of days
dat$day <- factor( dat$day, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") )
table( dat$day ) %>% pander()
Mon | Tue | Wed | Thu | Fri | Sat | Sun |
---|---|---|---|---|---|---|
4094 | 4656 | 4711 | 4814 | 5006 | 3044 | 2145 |
Create 12-hour format and order the times correctly:
dat$hour12 <- format( date.vec, format="%l %p" )
table( dat$hour12 ) %>% head() %>% pander()
1 AM | 1 PM | 2 AM | 2 PM | 3 AM | 3 PM |
---|---|---|---|---|---|
289 | 1832 | 362 | 1928 | 190 | 2301 |
# set the levels so they are in the correct order
time.levels <-
c( "12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM",
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM",
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM",
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
dat$hour12 <- factor( dat$hour12, levels=time.levels )
table( dat$hour12 ) %>% head() %>% pander()
12 AM | 1 AM | 2 AM | 3 AM | 4 AM | 5 AM |
---|---|---|---|---|---|
388 | 289 | 362 | 190 | 125 | 300 |
age.labels <- paste0( "Age ", c(16,18,25,35,45,55,65,75), "-", c(18,25,35,45,55,65,75,100) )
age.labels
## [1] "Age 16-18" "Age 18-25" "Age 25-35" "Age 35-45" "Age 45-55"
## [6] "Age 55-65" "Age 65-75" "Age 75-100"
dat$age <- cut( dat$Age_Drv1, breaks=c(16,18,25,35,45,55,65,75,100), labels=age.labels )
In this lab you will practice your logical statements, data verbs (dplyr functions), and recipes to conduct analysis looking for types of accidents that cause serious injury. You will need to pay attention to the difference between counts of events, and severity of events.
We will define “harm” as any accident that causes at least one injury OR fatality. You need to define a new variable in your dataset that indicates whether accidents caused harm or not.
For each question, write down your data recipe or pseudocode.
You can create a new R Markdown file, or download the LAB-05 RMD template:
Practice writing logical statements and basic data recipes for the following:
dat %>% group_by( factor ) %>% summarize( my.stat = formula or logical statement )
day | n | injuries | fatalities | harm.rate |
---|---|---|---|---|
Mon | 4094 | 1644 | 13 | 0.3 |
Tue | 4656 | 2056 | 8 | 0.32 |
Wed | 4711 | 2144 | 9 | 0.31 |
Thu | 4814 | 2204 | 10 | 0.33 |
Fri | 5006 | 2103 | 12 | 0.3 |
Sat | 3044 | 1192 | 11 | 0.26 |
Sun | 2145 | 866 | 6 | 0.27 |
Which age group has the largest number of accidents at 7am?
You can use the dplyr count() function for this (or use table() in core R if you are old school like that).
As a public health expert specializing in traffic accidents, you need to think about how to best target traffic accidents to reduce harm.
Should we focus on the volume of traffic accidents, or the types of accidents that are most likely to cause harm?
Calculate each of these four descriptive statistics above as a function of the 24 hours of the day, and either print a table with times and counts/rates, or plot a graph of the statistics as a function of time similar to the examples above.
# example plotting code
plot( as.numeric(d2$hour), d2$ave.num.injuries,
pch=19, type="b", cex=2, bty="n",
xlab="Hour of the Day",
ylab="Ave. Number of Passengers Hurt",
main="Average Injuries or Fatalities Per Harmful Crash")
Reflection point: as the analyst your job is to translate research questions into the models that are presented to decision-makers. You will likely receive a broad objective like identifying patterns in traffic accidents, and you will often have a lot of flexibility in how you operationalize the question.
How might strategies for addressing traffic injuries change if you switch from an emphasis on the accidents that are most common to accidents that are most likely to cause harm? These are subtle nuances in how to calculate simple statistics (counts and proportions) over groups, but they can have a big impact on policy-making!
Report your tables or graphcs. You don’t have to include a written response to the reflection point.
Using at most two variables in the dataset to define a group structure and identify:
The most dangerous accident to be involved in (highest rate of harm).
For example, your groups could be teen-agers (group 1: age) that rear-end another driver (group 2: collision type), or drunk-drivers (group 1: alcohol) that hit pedestrians (group 2: driver type), or men (group 1: gender) on Labor Day (group 2: date). Calculate rates of harm for each type of accident, and identify the most harmful case.
There is one constraint: there must be at least five cases of the accident in the dataset. For example, any type of accident involving a 95-year old likely occurs once in the dataset since these drivers are rare. If the driver was injured then that accident type will have a 100% harm rate, but it’s unlikely an accurate representation of the true harm rate because we are generalizing from a single observation.
You can use any variables from the dataset, but you are limit to groups constructed from two variables. Report the most dangerous accident type (most likely to cause harm - i.e. injury or death) that you can identify.
# Working with Dates
So far we have worked with character, numeric, logical, and categorical (factor) vectors.
We need to introduce a new type of vector class for this lab, a date variable. Dates are complicated because they must function simultaneously as categorical variables (months of the year) and numeric variables capable of arithmatic (time that passes between two dates). Furthermore, we often want to convert between idiosyncratic date representations, such as a day of a specific month to a day of the week.
They are also complicated because they can be represented in many ways:
When you first read a dataset, they are typically loaded as character vectors:
head( dat$DateTime )
## [1] "1/10/12 9:04 " "1/5/12 17:24 " "1/16/12 19:08 " "1/27/12 14:41 "
## [5] "1/10/12 13:41 " "1/9/12 17:49 "
We can convert dates stored as characters to a special date object by specifying the format using codes understood by the strptime() function:
date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )
head( date.vec )
## [1] "2012-01-10 09:04:00 EST" "2012-01-05 17:24:00 EST"
## [3] "2012-01-16 19:08:00 EST" "2012-01-27 14:41:00 EST"
## [5] "2012-01-10 13:41:00 EST" "2012-01-09 17:49:00 EST"
Now R will recognize that the variable is a date, not a string, and it will be able to do complex day and time manipulations. Note that the format= argument above requires you to tell R what each value represents. In this case, %m represents month, %d represents day, and %y represents year. In the original data they are separated by a back slash, so that’s included in the format argument.
We need to be explicit because dates can be stored as DD-MM-YYYY, MM-DD-YY, YYYY-MM-DD, or any other number of formats. The format argument tells R how to structure the date.
We can now use the format() function to specify how
we want the date represented using many common styles. Note that
format()
will return a character vector, not another date
class.
format( head( date.vec ), format="%H" ) # hour of day 0-23
## [1] "09" "17" "19" "14" "13" "17"
format( head( date.vec ), format="%I" ) # hour of day 1-12
## [1] "09" "05" "07" "02" "01" "05"
format( head( date.vec ), format="%p" ) # AM or PM
## [1] "AM" "PM" "PM" "PM" "PM" "PM"
format( head( date.vec ), format="%m" ) # month 1-12
## [1] "01" "01" "01" "01" "01" "01"
format( head( date.vec ), format="%b" ) # abbreviated month Jan, Feb, etc
## [1] "Jan" "Jan" "Jan" "Jan" "Jan" "Jan"
format( head( date.vec ), format="%A" ) # day of the week Monday, Tuesday, etc.
## [1] "Tuesday" "Thursday" "Monday" "Friday" "Tuesday" "Monday"
format( head( date.vec ), format="%a" ) # abbreviated day of the week Mon, Tue, etc.
## [1] "Tue" "Thu" "Mon" "Fri" "Tue" "Mon"
We can apply a wide range of formatting options to dates:
%a: Abbreviated weekday name in the current locale on this platform. (Also matches full name on input: in some locales there are no abbreviations of names.)
%A: Full weekday name in the current locale. (Also matches abbreviated name on input.)
%b: Abbreviated month name in the current locale on this platform. (Also matches full name on input: in some locales there are no abbreviations of names.)
%B: Full month name in the current locale. (Also matches abbreviated name on input.)
%c: Date and time. Locale-specific on output, “%a %b %e %H:%M:%S %Y” on input.
%C: Century (00–99): the integer part of the year divided by 100.
%d: Day of the month as decimal number 01–31.
%D: Date format such as %m/%d/%y: the C99 standard says it should be that exact format, but not all OS’s comply.
%e: Day of the month as decimal number 1–31, with a leading space for a single-digit number.
%F: Equivalent to %Y-%m-%d the ISO 8601 date format.
%g: The last two digits of the week-based year. Accepted but ignored on input.
%G: The week-based year as a decimal number. Accepted but ignored on input.
%h: Equivalent to %b.
%H: Hours as decimal number 00–23. As a special exception strings such as 24:00:00 are accepted for input, since ISO 8601 allows these.
%I: Hours as decimal number 01–12.
%j: Day of year as decimal number 001–366.
%m: Month as decimal number 01–12.
%M: Minute as decimal number 00–59.
%n: Newline on output, arbitrary whitespace on input.
%p: AM/PM indicator in the locale. Used in conjunction with %I and not with %H. An empty string in some locales (and the behaviour is undefined if used for input in such a locale). Some platforms accept %P for output, which uses a lower-case version: others will output P.
%r: The 12-hour clock time (using the locale’s AM or PM). Only defined in some locales.
%R: Equivalent to %H:%M.
%S: Second as integer 00–61, allowing for up to two leap-seconds (but POSIX-compliant implementations will ignore leap seconds).
%t: Tab on output, arbitrary whitespace on input.
%T: Equivalent to %H:%M:%S.
%u: Weekday as a decimal number 1–7, Monday is 1.
%U: Week of the year as decimal number 00–53 using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.
%V: Week of the year as decimal number 01–53 as defined in ISO 8601. If the week (starting on Monday) containing 1 January has four or more days in the new year, then it is considered week 1. Otherwise, it is the last week of the previous year, and the next week is week 1. (Accepted but ignored on input.)
%w: Weekday as decimal number 0–6, Sunday is 0.
%W: Week of the year as decimal number 00–53 using Monday as the first day of week (and typically with the first Monday of the year as day 1 of week 1). The UK convention.
%x: Date. Locale-specific on output, “%y/%m/%d” on input.
%X: Time. Locale-specific on output, “%H:%M:%S” on input.
%y: Year without century 00–99. On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 – that is the behaviour specified by the 2004 and 2008 POSIX standards, but they do also say ‘it is expected that in a future version the default century inferred from a 2-digit year will change’.
%Y: Year with century. Note that whereas there was no zero in the original Gregorian calendar, ISO 8601:2004 defines it to be valid (interpreted as 1BC): see https://en.wikipedia.org/wiki/0_(year). Note that the standards also say that years before 1582 in its calendar should only be used with agreement of the parties involved. For input, only years 0:9999 are accepted.
%z: Signed offset in hours and minutes from UTC, so -0800 is 8 hours behind UTC. Values up to +1400 are accepted as from R 3.1.1: previous versions only accepted up to +1200. (Standard only for output.)
%Z: (Output only.) Time zone abbreviation as a character string (empty if not available). This may not be reliable when a time zone has changed abbreviations over the years.
date.vec <- strptime( dat$DateTime, format="%m/%d/%y %H:%M" )
dat$hour <- format( date.vec, format="%H" )
dat$month <- format( date.vec, format="%b" )
dat$day <- format( date.vec, format="%a" )
dat$day365 <- format( date.vec, format="%j" )
dat$week <- format( date.vec, format="%V" )
# set the levels so they are in the correct order
time.levels <-
c( "12 AM", " 1 AM", " 2 AM", " 3 AM", " 4 AM", " 5 AM",
" 6 AM", " 7 AM", " 8 AM", " 9 AM", "10 AM", "11 AM",
"12 PM", " 1 PM", " 2 PM", " 3 PM", " 4 PM", " 5 PM",
" 6 PM", " 7 PM", " 8 PM", " 9 PM", "10 PM", "11 PM" )
dat$hour12 <- format( date.vec, format="%l %p" )
dat$hour12 <- factor( dat$hour12, levels=time.levels )
qmplot( Longitude, Latitude, data=dat, geom = "blank",
zoom = 13, maptype = "toner-background", darken = .1 ) +
stat_density_2d( aes(fill = ..level..), geom = "polygon", alpha=0.3, color = NA) +
scale_fill_viridis( ) +
facet_wrap( ~ hour12, ncol=6, nrow=4)
d2 <-
dat %>%
filter( as.numeric(week) <= 52 ) %>%
count( week )
plot( as.numeric(d2$week), d2$n, pch=19, type="b", cex=2, bty="n",
xlab="Week", ylab="Number of Crashes",
main="Cumulative Crashes by Week of the Year: 2012-2018" )
d2 <-
dat %>%
filter( as.numeric(week) <= 52 ) %>%
group_by( week ) %>%
summarize( harm = mean( Totalinjuries > 0 | Totalfatalities > 0 ) )
plot( as.numeric(d2$week), d2$harm, pch=19, type="b", cex=2, bty="n",
xlab="Weeks in the Year", ylab="Rate of Harm",
main="Proportion of Crashes that Result in Harm Across Weeks")
abline( h=mean(d2$harm), col="gray", lty=2, lwd=2 )
If we want to be more precise about crash counts per week within a given year, which is a more intuitive and actionable statistic than summing across all years in the dataset:
d2 <-
dat %>%
filter( as.numeric(week) <= 52 ) %>%
group_by( Year ) %>%
count( week ) %>%
group_by( week ) %>%
summarize( ave.crashes.per.week = mean(n) )
plot( as.numeric(d2$week), d2$ave.crashes.per.week,
pch=19, type="b", cex=2, bty="n",
xlab="Week", ylab="Number of Crashes",
main="Ave Crashes by Week of Year" )
Some of the categorical variables are hard to work with because they have a levels that are small or hard to interpret.
count( dat, Collisionmanner ) %>% arrange(n) %>% pander()
Collisionmanner | n |
---|---|
10 | 3 |
Rear To Rear | 56 |
Rear To Side | 174 |
Sideswipe Opposite Direction | 189 |
Unknown | 345 |
Head On | 348 |
Other | 971 |
Single Vehicle | 1737 |
Sideswipe Same Direction | 3565 |
ANGLE (Front To Side)(Other Than Left Turn) | 4686 |
Left Turn | 5395 |
Rear End | 11001 |
dat$Collisionmanner <- recode( dat$Collisionmanner,
"ANGLE (Front To Side)(Other Than Left Turn)"="Angle" )
dat$Collisionmanner <- recode( dat$Collisionmanner, "Sideswipe Same Direction"="Lane Change" )
drop.these <- c("Unknown","10","Rear To Side","Rear To Rear","Sideswipe Opposite Direction","Other")
dat <- filter( dat, ! ( Collisionmanner %in% drop.these ) )
dat$Collisionmanner <- factor( dat$Collisionmanner )
count( dat, Collisionmanner ) %>% arrange(n) %>% pander()
Collisionmanner | n |
---|---|
Head On | 348 |
Single Vehicle | 1737 |
Lane Change | 3565 |
Angle | 4686 |
Left Turn | 5395 |
Rear End | 11001 |
Patterns in types of crashes by time of day:
table( dat$hour, dat$Collisionmanner ) %>% pander()
Angle | Head On | Left Turn | Rear End | Lane Change | Single Vehicle | |
---|---|---|---|---|---|---|
00 | 71 | 4 | 59 | 83 | 50 | 89 |
01 | 34 | 8 | 32 | 78 | 28 | 87 |
02 | 40 | 7 | 23 | 87 | 34 | 136 |
03 | 21 | 8 | 13 | 35 | 7 | 92 |
04 | 18 | 6 | 23 | 17 | 5 | 50 |
05 | 49 | 7 | 63 | 73 | 39 | 49 |
06 | 127 | 9 | 117 | 213 | 82 | 57 |
07 | 286 | 19 | 359 | 616 | 180 | 69 |
08 | 258 | 13 | 279 | 623 | 192 | 60 |
09 | 212 | 11 | 180 | 360 | 174 | 63 |
10 | 217 | 6 | 188 | 398 | 153 | 51 |
11 | 250 | 12 | 246 | 531 | 203 | 58 |
12 | 333 | 14 | 284 | 729 | 210 | 56 |
13 | 347 | 21 | 325 | 738 | 233 | 54 |
14 | 353 | 19 | 334 | 764 | 295 | 64 |
15 | 379 | 24 | 420 | 998 | 303 | 79 |
16 | 397 | 23 | 582 | 1200 | 323 | 87 |
17 | 427 | 36 | 619 | 1285 | 337 | 90 |
18 | 289 | 22 | 421 | 811 | 236 | 76 |
19 | 186 | 19 | 249 | 437 | 143 | 63 |
20 | 135 | 26 | 182 | 316 | 111 | 83 |
21 | 114 | 9 | 177 | 294 | 98 | 74 |
22 | 76 | 13 | 130 | 192 | 76 | 66 |
23 | 67 | 12 | 90 | 123 | 53 | 84 |
d3 <- data.frame( table( dat$hour, dat$Collisionmanner ) )
ggplot( d3, aes( x=as.numeric(Var1), y=Freq, fill=Var2 ) ) +
geom_area( position='fill' ) +
scale_fill_brewer( type="qual" ) +
xlab("Time of Day (hours)") + ylab("Proportion of Accidents")
We have a wide range of driver ages:
summary( dat$Age_Drv1 ) %>% pander()
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s |
---|---|---|---|---|---|---|
2 | 22 | 31 | 43.64 | 51 | 255 | 360 |
# remove meaningless ages
dat$Age_Drv1[ dat$Age_Drv1 > 99 ] <- NA
dat$Age_Drv1[ dat$Age_Drv1 < 16 ] <- NA
dat %>%
filter( ! is.na(Age_Drv1) ) %>%
count( Age_Drv1 ) %>%
ggplot( aes( x=Age_Drv1, y=n ) ) +
geom_point(size=3) + geom_line() +
theme_fivethirtyeight() +
ggtitle("Crash Count by Age") +
xlab("Age")
This many ages will make our analysis complicated, so it is better to convert the numeric age variable into a categorical age-group variable. We will use the cut() function for this, which accepts a numeric variable and group cut points (the breaks= argument), then returns the proper group label for each age.
dat$age <- cut( dat$Age_Drv1, breaks=c(16,18,25,35,45,55,65,75,100) )
barplot( table(dat$age) )
These group labels are a little awkward, so let’s improve them a bit by creating our own:
```r
age.labels <- paste0( "Age ", c(16,18,25,35,45,55,65,75), "-", c(18,25,35,45,55,65,75,100) )
dat$age <- cut( dat$Age_Drv1, breaks=c(16,18,25,35,45,55,65,75,100), labels=age.labels )
barplot( table(dat$age) )
```
<img src="lab-05-instructions_files/figure-html/unnamed-chunk-29-1.png" width="864" style="display: block; margin: auto;" />
We can now analyze some trends by age group.
```r
d3 <-
dat %>%
count( hour, age )
d3 <- na.omit( d3 )
qplot( data=d3, x=as.numeric(as.character(hour)), y=n ) +
geom_line( size=0.8, color="firebrick4" ) +
geom_point( size=3, color="darkred" ) +
facet_wrap( ~ age, ncol=4 ) +
xlab("Time of Day (24hrs)") +
ylab("Number of Accidents") +
ggtitle("Number of Accidents by Time and Age Group") +
# theme_minimal()
theme_wsj( base_size=10, color="gray" )
```
<img src="lab-05-instructions_files/figure-html/unnamed-chunk-30-1.png" width="960" style="display: block; margin: auto;" />
<br><br>
# How to Submit
Use the following instructions to submit your assignment, which may vary depending on your course's platform.
When you have completed your assignment, click the “Knit” button to
render your .RMD
file into a .HTML
report.
Perform the following depending on your course’s platform:
* **Canvas:** Upload both your `.RMD` and `.HTML` files to the appropriate link
* **Blackboard or iCollege:** Compress your `.RMD` and `.HTML` files in a `.ZIP` file and upload to the appropriate link
`.HTML` files are preferred but not allowed by all platforms.
<br>
### Before You Submit
Remember to ensure the following before submitting your assignment.
1. Name your files using this format: **Lab-##-LastName.rmd** and **Lab-##-LastName.html**
2. Show both the solution for your code and write out your answers in the body text
3. Do not show excessive output; truncate your output, e.g. with function `head()`
4. Follow appropriate styling conventions, e.g. spaces after commas, etc.
5. Above all, ensure that your conventions are consistent
See [Google's R Style Guide](https://google.github.io/styleguide/Rguide.xml) for examples of common conventions.
.RMD
files are knit into .HTML
and other
formats procedural, or line-by-line.
install.packages()
or
setwd()
are bound to cause errors in knittinglibrary()
in a previous chunkIf All Else Fails: If you cannot determine and fix
the errors in a code chunk that’s preventing you from knitting your
document, add eval = FALSE
inside the brackets of
{r}
at the beginning of a chunk to ensure that R does not
attempt to evaluate it, that is: {r eval = FALSE}
. This
will prevent an erroneous chunk of code from halting the knitting
process.
<br>
<br>
<br>