Once you have loaded raw data into R’s memory, you will often have a need to do various data transformations. For example, you might have a variable called gender in your raw data, taking values 0 and 1. This is a bad way of coding gender - what does 0 mean? You might know that males are coded as 0, but if you hand off your data to someone else, they will have no idea. So recode it as “male” and “female”. Similarly, suppose you have dates in your data and you want to know what day of the week each date correspond to. You then need to define a new variable with this information.
This dataset contains information on every single trip taken with a yellow New York City taxi cab in the month of June, 2015. This is over 12 million trips! You can download a zip file with a 5% sample of the full data and code here, If you want the full data with all 12 million trips (256MB file), you can get that here. You can download data in raw format for other months of the year here.
In the following we will use the 5% sample. We start by loading the data:
taxi <- readRDS('data/yellow_trip_2015-06_small.rds')
We will be using the dplyr package for transformations so let’s load this package and then use the glimpse command to peek at the data:
library(dplyr)
glimpse(taxi)
## Observations: 616,619
## Variables: 18
## $ VendorID (int) 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1...
## $ tpep_pickup_datetime (time) 2015-06-01 07:05:49, 2015-06-07 04:17:2...
## $ tpep_dropoff_datetime (time) 2015-06-01 07:16:11, 2015-06-07 04:27:2...
## $ passenger_count (int) 1, 5, 2, 1, 2, 1, 5, 1, 4, 1, 5, 1, 6, 1...
## $ trip_distance (dbl) 1.70, 2.68, 1.17, 3.39, 0.69, 0.80, 1.49...
## $ pickup_longitude (dbl) -73.96334, -73.94919, -73.98258, -73.991...
## $ pickup_latitude (dbl) 40.76592, 40.71407, 40.77238, 40.75027, ...
## $ RateCodeID (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ store_and_fwd_flag (chr) "N", "N", "N", "N", "N", "N", "N", "N", ...
## $ dropoff_longitude (dbl) -73.98391, -73.90588, -73.99293, -73.996...
## $ dropoff_latitude (dbl) 40.75550, 40.70319, 40.75834, 40.71296, ...
## $ payment_type (int) 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1...
## $ fare_amount (dbl) 9.0, 10.5, 7.5, 14.5, 6.5, 6.0, 8.5, 53....
## $ extra (dbl) 0.0, 0.5, 0.0, 0.5, 1.0, 0.5, 0.0, 0.0, ...
## $ mta_tax (dbl) 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, ...
## $ tip_amount (dbl) 1.00, 0.00, 0.00, 2.00, 0.00, 0.00, 1.86...
## $ tolls_amount (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ total_amount (dbl) 10.80, 11.80, 8.30, 17.80, 8.30, 7.30, 1...
So there is a total of 616619 trips in the 5% sample and 18 variables in the data. As far as raw data goes this is in a pretty good condition. However, it still needs a few tweaks before we can start using it for analytics.
Let’s start by adding the week day of each trip. Day of week is not found in the original data. This only contains the date of each trip. For example, the first 5 trips were made on dates
taxi$tpep_pickup_datetime[1:5]
## [1] "2015-06-01 07:05:49 UTC" "2015-06-07 04:17:25 UTC"
## [3] "2015-06-21 18:11:47 UTC" "2015-06-14 20:54:44 UTC"
## [5] "2015-06-24 17:19:30 UTC"
We want to know what the day of week these dates correspond to. There are different ways to do this but when it comes to working with date and time functions it is a huge advantage to use functions from the lubridate package. Let’s load it:
## use install.package("lubridate") first if you have not yet installed this package on your R system
library(lubridate)
This library contains a neat function wday that returns the weekday of a date. We can then use this together with the mutate function to create a new variable in the taxi data:
taxi.new <- mutate(taxi,weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE))
Ok - what is going on here? This command “mutates” the orginal taxi data and turns it into a new data frame called taxi.new. This data frame will have 19 variables. In this case “mutate” simply means adding another variable to the data called weekday. The options to the wday function tells R that we wish text labels for each observation and that weekday abbreviations are to be used. With this done you can now easily count number of trips per weekday:
table(taxi.new$weekday)
##
## Sun Mon Tues Wed Thurs Fri Sat
## 75030 95128 100787 84369 86344 87838 87123
Suppose we also wanted a variable capturing hour of the day in addition to weekday. We can use the hour command for this:
taxi.new <- mutate(taxi,
weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE),
hour.trip.start = hour(tpep_pickup_datetime))
This will create another variable capturing hour of the day of each trip. The first 5 trips were made at the following hours:
taxi.new$hour.trip.start[1:5]
## [1] 7 4 18 20 17
The hour function returns hour of day as a numeric variable. For later use it might be better to define as a factor. R calls categorical variables for “factors” and it is usually a good idea to force categorical variables to be factors:
taxi.new <- mutate(taxi,
weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE),
hour.trip.start = factor(hour(tpep_pickup_datetime)))
The number of trips by hour of day is
table(taxi.new$hour.trip.start)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 24264 17646 12743 9151 6943 6637 14299 23338 28572 28856 28153 29232
## 12 13 14 15 16 17 18 19 20 21 22 23
## 30085 29603 30592 28578 25182 30046 36109 37915 34861 36971 35448 31395
You can keep adding further statements in the mutate function.
Find out what the data transformations transf1-transf4 do to the original data. Give them names that are meaningful (i.e., replace the names transf1-transf4 with a new name that matches the meaning of the transformation). You probably want to use the R help function to aide you. For example help(difftime) will give you help on the difftime function.
taxi.new <- mutate(taxi,
weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE),
hour.trip.start = factor(hour(tpep_pickup_datetime)),
transf1 = factor(mday(tpep_dropoff_datetime)),
transf2 = as.numeric(difftime(tpep_dropoff_datetime,tpep_pickup_datetime)/60),
transf3 = ifelse(transf2 > 1, trip_distance/(transf2/60), -1),
transf4 = cut(payment_type,
breaks = c(0,1,2,5),
labels=c('Credit Card','Cash','Other')))