Data Transformations

Once you have loaded raw data into R’s memory, you will often have a need to do various data transformations. For example, you might have a variable called gender in your raw data, taking values 0 and 1. This is a bad way of coding gender - what does 0 mean? You might know that males are coded as 0, but if you hand off your data to someone else, they will have no idea. So recode it as “male” and “female”. Similarly, suppose you have dates in your data and you want to know what day of the week each date correspond to. You then need to define a new variable with this information.

Example: New York Taxi Trip Data

WWW This dataset contains information on every single trip taken with a yellow New York City taxi cab in the month of June, 2015. This is over 12 million trips! You can download a zip file with a 5% sample of the full data and code here, If you want the full data with all 12 million trips (256MB file), you can get that here. You can download data in raw format for other months of the year here.

In the following we will use the 5% sample. We start by loading the data:

taxi <- readRDS('data/yellow_trip_2015-06_small.rds')

We will be using the dplyr package for transformations so let’s load this package and then use the glimpse command to peek at the data:

library(dplyr)
glimpse(taxi)

## Observations: 616,619
## Variables: 18
## $ VendorID              (int) 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1...
## $ tpep_pickup_datetime  (time) 2015-06-01 07:05:49, 2015-06-07 04:17:2...
## $ tpep_dropoff_datetime (time) 2015-06-01 07:16:11, 2015-06-07 04:27:2...
## $ passenger_count       (int) 1, 5, 2, 1, 2, 1, 5, 1, 4, 1, 5, 1, 6, 1...
## $ trip_distance         (dbl) 1.70, 2.68, 1.17, 3.39, 0.69, 0.80, 1.49...
## $ pickup_longitude      (dbl) -73.96334, -73.94919, -73.98258, -73.991...
## $ pickup_latitude       (dbl) 40.76592, 40.71407, 40.77238, 40.75027, ...
## $ RateCodeID            (int) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ store_and_fwd_flag    (chr) "N", "N", "N", "N", "N", "N", "N", "N", ...
## $ dropoff_longitude     (dbl) -73.98391, -73.90588, -73.99293, -73.996...
## $ dropoff_latitude      (dbl) 40.75550, 40.70319, 40.75834, 40.71296, ...
## $ payment_type          (int) 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1...
## $ fare_amount           (dbl) 9.0, 10.5, 7.5, 14.5, 6.5, 6.0, 8.5, 53....
## $ extra                 (dbl) 0.0, 0.5, 0.0, 0.5, 1.0, 0.5, 0.0, 0.0, ...
## $ mta_tax               (dbl) 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, ...
## $ tip_amount            (dbl) 1.00, 0.00, 0.00, 2.00, 0.00, 0.00, 1.86...
## $ tolls_amount          (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ total_amount          (dbl) 10.80, 11.80, 8.30, 17.80, 8.30, 7.30, 1...

So there is a total of 616619 trips in the 5% sample and 18 variables in the data. As far as raw data goes this is in a pretty good condition. However, it still needs a few tweaks before we can start using it for analytics.

Let’s start by adding the week day of each trip. Day of week is not found in the original data. This only contains the date of each trip. For example, the first 5 trips were made on dates

taxi$tpep_pickup_datetime[1:5]

## [1] "2015-06-01 07:05:49 UTC" "2015-06-07 04:17:25 UTC"
## [3] "2015-06-21 18:11:47 UTC" "2015-06-14 20:54:44 UTC"
## [5] "2015-06-24 17:19:30 UTC"

We want to know what the day of week these dates correspond to. There are different ways to do this but when it comes to working with date and time functions it is a huge advantage to use functions from the lubridate package. Let’s load it:

## use install.package("lubridate") first if you have not yet installed this package on your R system
library(lubridate)

This library contains a neat function wday that returns the weekday of a date. We can then use this together with the mutate function to create a new variable in the taxi data:

taxi.new <- mutate(taxi,weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE))

Ok - what is going on here? This command “mutates” the orginal taxi data and turns it into a new data frame called taxi.new. This data frame will have 19 variables. In this case “mutate” simply means adding another variable to the data called weekday. The options to the wday function tells R that we wish text labels for each observation and that weekday abbreviations are to be used. With this done you can now easily count number of trips per weekday:

table(taxi.new$weekday)

## 
##    Sun    Mon   Tues    Wed  Thurs    Fri    Sat 
##  75030  95128 100787  84369  86344  87838  87123

Suppose we also wanted a variable capturing hour of the day in addition to weekday. We can use the hour command for this:

taxi.new <- mutate(taxi,
                   weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE),
                   hour.trip.start = hour(tpep_pickup_datetime))

This will create another variable capturing hour of the day of each trip. The first 5 trips were made at the following hours:

taxi.new$hour.trip.start[1:5]

## [1]  7  4 18 20 17

The hour function returns hour of day as a numeric variable. For later use it might be better to define as a factor. R calls categorical variables for “factors” and it is usually a good idea to force categorical variables to be factors:

taxi.new <- mutate(taxi,
                   weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE),
                   hour.trip.start = factor(hour(tpep_pickup_datetime)))

The number of trips by hour of day is

table(taxi.new$hour.trip.start)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 24264 17646 12743  9151  6943  6637 14299 23338 28572 28856 28153 29232 
##    12    13    14    15    16    17    18    19    20    21    22    23 
## 30085 29603 30592 28578 25182 30046 36109 37915 34861 36971 35448 31395

You can keep adding further statements in the mutate function.

Exercise

Find out what the data transformations transf1-transf4 do to the original data. Give them names that are meaningful (i.e., replace the names transf1-transf4 with a new name that matches the meaning of the transformation). You probably want to use the R help function to aide you. For example help(difftime) will give you help on the difftime function.

taxi.new <- mutate(taxi,
                   weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE),                             
                   hour.trip.start = factor(hour(tpep_pickup_datetime)),                                  
                   transf1 = factor(mday(tpep_dropoff_datetime)),                                             
                   transf2 = as.numeric(difftime(tpep_dropoff_datetime,tpep_pickup_datetime)/60),   
                   transf3 = ifelse(transf2 > 1, trip_distance/(transf2/60), -1),
                   transf4 = cut(payment_type, 
                             breaks = c(0,1,2,5), 
                             labels=c('Credit Card','Cash','Other')))