Data Transformations

Once you have loaded raw data into R’s memory, you will often have a need to do various data transformations. For example, you might have a variable called gender in your raw data, taking values 0 and 1. This is a bad way of coding gender - what does 0 mean? You might know that males are coded as 0, but if you hand off your data to someone else, they will have no idea. So recode it as “male” and “female”. Similarly, suppose you have dates in your data and you want to know what day of the week each date correspond to. You then need to define a new variable with this information.

Example: New York Taxi Trip Data

WWW This dataset contains information on every single trip taken with a yellow New York City taxi cab in the month of June, 2015. This is over 12 million trips! You can download a zip file with a 5% sample of the full data and code here, If you want the full data with all 12 million trips (256MB file), you can get that here. You can download data in raw format for other months of the year here.

In the following we will use the 5% sample. We start by loading the usual base library tidyverse, a couple of other helpful libraries that we will rely on below and the data:

library(tidyverse)
library(lubridate)  # for working with dates and times 
library(forcats)    # for recoding factor values 
taxi <- read_rds('data/yellow_tripdata_2015-06_small.rds')

Use the glimpse command to peek at the data:

glimpse(taxi)

## Observations: 616,247
## Variables: 19
## $ VendorID              <int> 2, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2...
## $ tpep_pickup_datetime  <dttm> 2015-06-12 12:51:09, 2015-06-25 06:13:5...
## $ tpep_dropoff_datetime <dttm> 2015-06-12 13:02:49, 2015-06-25 06:36:3...
## $ passenger_count       <int> 1, 1, 5, 1, 2, 1, 1, 4, 5, 1, 1, 1, 1, 6...
## $ trip_distance         <dbl> 1.58, 5.30, 4.47, 0.88, 3.80, 1.35, 0.40...
## $ pickup_longitude      <dbl> -73.97024, -73.99834, -73.99641, -73.995...
## $ pickup_latitude       <dbl> 40.75240, 40.75673, 40.72044, 40.74418, ...
## $ RateCodeID            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1...
## $ store_and_fwd_flag    <chr> "N", "N", "N", "N", "N", "N", "N", "N", ...
## $ dropoff_longitude     <dbl> -73.95473, -73.92249, -73.99150, -74.003...
## $ dropoff_latitude      <dbl> 40.76952, 40.75463, 40.76827, 40.73352, ...
## $ payment_type          <int> 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2...
## $ fare_amount           <dbl> 9.0, 20.0, 17.5, 7.5, 14.5, 9.5, 3.5, 5....
## $ extra                 <dbl> 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.5, 0.5, ...
## $ mta_tax               <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, ...
## $ tip_amount            <dbl> 1.96, 4.15, 2.00, 0.00, 0.00, 0.00, 0.00...
## $ tolls_amount          <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00...
## $ improvement_surcharge <dbl> 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, ...
## $ total_amount          <dbl> 11.76, 24.95, 20.30, 8.30, 15.80, 10.30,...

So there is a total of 616247 trips in the 5% sample and 19 variables in the data. As far as raw data goes this is in a pretty good condition. However, it still needs a few tweaks before we can start using it for analytics.

Let’s start by adding the week day of each trip. Day of week is not found in the original data. This only contains the date of each trip. For example, the first 5 trips were made on dates

taxi$tpep_pickup_datetime[1:5]

## [1] "2015-06-12 12:51:09 EDT" "2015-06-25 06:13:56 EDT"
## [3] "2015-06-07 10:39:18 EDT" "2015-06-26 11:26:23 EDT"
## [5] "2015-06-22 01:03:54 EDT"

We want to know what the day of week these dates correspond to. There are different ways to do this but when it comes to working with date and time functions it is a huge advantage to use functions from the lubridate package. This library contains a neat function wday that returns the weekday of a date. We can then use this together with the mutate function to create a new variable in the taxi data:

taxi.new <- mutate(taxi,weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE))

Ok - what is going on here? This command “mutates” the orginal taxi data and turns it into a new data frame called taxi.new. This data frame will have 19 variables. In this case “mutate” simply means adding another variable to the data called weekday. The options to the wday function tells R that we wish text labels for each observation and that weekday abbreviations are to be used. With this done you can now easily count number of trips per weekday:

table(taxi.new$weekday)

## 
##    Sun    Mon   Tues    Wed  Thurs    Fri    Sat 
##  75080  95052 100643  84497  86718  87448  86809

Suppose we also wanted a variable capturing hour of the day in addition to weekday. We can use the hour command for this:

taxi.new <- mutate(taxi,
                   weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE),
                   hour.trip.start = hour(tpep_pickup_datetime))

This will create another variable capturing hour of the day of each trip. The first 5 trips were made at the following hours:

taxi.new$hour.trip.start[1:5]

## [1] 12  6 10 11  1

The hour function returns hour of day as a numeric variable. For later use it might be better to define it as a factor. R calls categorical variables for “factors” and it is usually a good idea to force categorical variables to be factors:

taxi.new <- mutate(taxi,
                   weekday = wday(tpep_pickup_datetime,label=TRUE,abbr=TRUE),
                   hour.trip.start = factor(hour(tpep_pickup_datetime)))

The number of trips by hour of day is

table(taxi.new$hour.trip.start)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 24623 17663 12829  8969  6964  6639 14341 23116 28581 28900 28162 28858 
##    12    13    14    15    16    17    18    19    20    21    22    23 
## 29966 29675 30828 28601 25115 29722 36354 37689 34733 36771 35536 31612

You can keep adding further statements in the mutate function.