Raw data is typically not available as R data files. If your data has been extracted from a database, you will most likely receive it as flat text files, e.g., csv files. Data recording web activity is often stored in a format called json. In this section we will look at how to easily read data files into R’s memory and how to store them as R databases.
In the following we will use some data from the NFL as example data. You can download all data and code for this lesson as a zip folder here.
If your data is already stored in an R database file, you can read it in with the load command:
FileName <- 'c:/mydata/2013_nfl.rda'
load(FileName)
This will load the data frames (or other R objects) contained in the R database “2013_nfl.rda”. Note that this database may contain one or several data frames. There are several advantages to storing your data as an R database: They are highy compressed files and R will read them into memory very fast. If you have obtained your data in another data format, it is recommended to start your analysis by converting all data to an R database (see below for how you do this).
Note that in the example above we provided the full file path and name when loading the data. If the file you are loading is located in the current working directory, then you can simply write the file name:
FileName <- '2013_nfl.rda'
load(FileName)
Don’t know your current working directory? Simply execute the command
getwd()
in the R console. If the file is located in a subdirectory of your current work directory called - for example - “data” - then you can write
FileName <- 'data/2013_nfl.rda'
load(FileName)
It is good practice to start out a data analytics project by defining a new folder on your local drive, e.g., “SuperDuperAnalytics”. Then set the working directory to be this folder. Then you can have subfolders called “data”, “code”, “results” etc.
One feature of the load command that is sometimes annoying, is that you don’t know the names of the R objects, e.g., data frames, that you are loading until after you have read them into memory. This can create problems if you are reading in multiple .rda files in sequence and some of them have objects with identical names. An alternative is to save and read R data using files where there is only one single object rather than multiple objects. If R data has been saved using the saveRDS command (see below for more on this), you can read it in as
nfl.data <- readRDS('c:/mydata/2013_nfl.rds')
The advantage here is that you have control over what to call the R data frame when loaded in memory.
It is straightforward to read a csv file into R’s memory:
FileName <- 'c:/mydata/NFL/data/2013 NFL Play-by-Play Data.csv'
nfl.data <- read.csv(FileName)
If the first line of the csv file contains column names, R will automatically assign these as variable names in the resulting data.frame:
names(nfl.data)
## [1] "Date"
## [2] "Tm"
## [3] "Opp"
## [4] "Quarter"
## [5] "Time"
## [6] "Down"
## [7] "ToGo"
## [8] "Side.of.Field"
## [9] "Yard.Marker"
## [10] "Tm.Score"
## [11] "Opp.Score"
## [12] "Detail"
## [13] "Yds"
## [14] "Play.Type"
## [15] "Pass.Result"
## [16] "Pass.Distance"
## [17] "Pass.Location"
## [18] "Run.Location"
## [19] "Turnover"
## [20] "Fumble"
## [21] "Interception"
## [22] "Passer"
## [23] "Intended.Receiver"
## [24] "Receiver"
## [25] "Targeted.Receiver"
## [26] "Rusher"
## [27] "Touchdown"
## [28] "First.Down"
## [29] "Time.Under"
## [30] "Score.Differential"
## [31] "Absolute.Score.Differential"
## [32] "Game.Week"
## [33] "Team.Game.Location"
## [34] "Rush.Attempt"
## [35] "Pass.Attempt"
## [36] "Reception"
## [37] "Interception.Thrown"
## [38] "Target"
## [39] "Fumble.Count"
## [40] "Sack"
## [41] "Len"
## [42] "Yard.Line"
## [43] "Playmaker"
## [44] "Touches"
## [45] "Play.Attempts"
## [46] "Playmaker.Position"
## [47] "Yards.Gained"
## [48] "Play.Location"
## [49] "Touchdown.Count"
## [50] "Play.Percent.of.Goal"
## [51] "First.Down.Count"
## [52] "Conversion.Count"
## [53] "Goal.To.Go"
## [54] "Success.Count"
## [55] "Penalty.Removed.Detail"
## [56] "Len.Penalty"
## [57] "Playmaker.Fantasy.Points"
## [58] "Passer.Fantasy.Points"
## [59] "Fantasy.Points.Rushing.Yards"
## [60] "Fantasy.Points.Receiving.Yards"
## [61] "Fantasy.Points.Receptions"
## [62] "Fantasy.Points.Rushing.TD"
## [63] "Fantasy.Points.Receiving.TD"
## [64] "Fantasy.Points.Playmaker.Fumble"
## [65] "Fantasy.Points.Passer.Yards"
## [66] "Fantasy.Points.Passer.TD"
## [67] "Fantasy.Points.Passer.Interception"
## [68] "Team.Game.Number"
If you have very large csv files (in the GB range), you may want to swith to the read_csv function in the readr library:
library(readr)
nfl.data <- read_csv(FileName)
This is quite a bit faster than than the base read.csv function and allows for fine control over variable formatting (see the documentation for the readr package). A cool feature of the readr library is that you can read raw data stored in zip files without “unzipping” them first:
ZipFileName <- 'c:/mydata/NFL/2013 NFL Play-by-Play Data.zip'
nfl.data <- read_csv(ZipFileName)
An alternative to using the above functions in R is to use RIO. This is an R package that provides data input functions for a variety of data formats.
You can store data in R’s compressed database format using the save command:
save(nfl.data,file='c:/mydata/nfl_data.rda')
You can store multiple objects in one file. Suppose you had two different versions of the data, nfl.data.V1 and nfl.data.V2. Then you can store both data frames in one file with
save(nfl.data.V1,nfl.data.V2,file='c:/mydata/nfl_data.rda')
When you then use
load('c:/mydata/nfl_data.rda')
the data frames nfl.data.V1 and nfl.data.V2 will appear in R’s memory.
To save only a single object so that you can later rename it at the time of loading, use the saveRDS command:
saveRDS(nfl.data,file='c:/mydata/nfl_data.rds')
In this case you can only save a single object.
Always start by converting raw data to R data files. You can then start any subsequent data analysis by reading in the data in R format rather than in raw format. This will be much faster. Save your code for converting raw data to R format in case you want to go back to check for errors: Make an R source file called something like convert_data_to_R.r. This file should contain the code you used to (a) read in the raw data, (b) perform any initial data clean-up or transformations and (c) save data in R format.
A good prescription is to save large data files using the saveRDS command. You can then load them later using a customized name.
You will often have a need for merging data frames. For example, suppose you have a customer transaction file where each customer is identified by some unique id. In another file you may have other information on each customer, e.g., address, type, length of relationship etc. If you want to link customer characteristics to transaction behavior, you will need to merge the two files. In this case, we would merge on the unique customer id.
You can merge data frames in R using the join command in the dplyr library. As an example, suppose we first count the number of turnovers for each NFL team in the 2013 season (for details on the group_by command see the group summaries section):
library(dplyr)
turnover.total <- nfl.data %>%
group_by(Tm) %>%
summarize(Turnovers=sum(Turnover))
The result is
## Tm Turnovers
## 1 49ers 21
## 2 Bears 29
## 3 Bengals 36
## 4 Bills 40
## 5 Broncos 29
## 6 Browns 34
## 7 Buccaneers 25
## 8 Cardinals 39
## 9 Chargers 21
## 10 Chiefs 26
## 11 Colts 20
## 12 Cowboys 27
## 13 Dolphins 31
## 14 Eagles 28
## 15 Falcons 31
## 16 Giants 50
## 17 Jaguars 38
## 18 Jets 37
## 19 Lions 46
## 20 Packers 31
## 21 Panthers 21
## 22 Patriots 29
## 23 Raiders 39
## 24 Rams 26
## 25 Ravens 36
## 26 Redskins 45
## 27 Saints 19
## 28 Seahawks 27
## 29 Steelers 29
## 30 Texans 41
## 31 Titans 36
## 32 Vikings 37
Now let’s read in another file with information on each teams win-loss percentage in the same season:
nfl.record <- read_csv(file = 'data/NFL_2013_Record.csv')
nfl.record
## Source: local data frame [32 x 3]
##
## Tm Record Conference
## (chr) (dbl) (chr)
## 1 Eagles 0.625 NFC
## 2 Cowboys 0.500 NFC
## 3 Giants 0.438 NFC
## 4 Redskins 0.188 NFC
## 5 Packers 0.531 NFC
## 6 Bears 0.500 NFC
## 7 Lions 0.438 NFC
## 8 Vikings 0.344 NFC
## 9 Panthers 0.750 NFC
## 10 Saints 0.688 NFC
## .. ... ... ...
We can now merge the two data frames on the variable Tm:
turnover.total.record <- turnover.total %>%
left_join(nfl.record,by="Tm")
This will return the following data frame:
## Tm Turnovers Record Conference
## 1 49ers 21 0.750 NFC
## 2 Bears 29 0.500 NFC
## 3 Bengals 36 0.688 AFC
## 4 Bills 40 0.375 AFC
## 5 Broncos 29 0.813 AFC
## 6 Browns 34 0.250 AFC
## 7 Buccaneers 25 0.250 NFC
## 8 Cardinals 39 0.625 NFC
## 9 Chargers 21 0.563 AFC
## 10 Chiefs 26 0.688 AFC
## 11 Colts 20 0.688 AFC
## 12 Cowboys 27 0.500 NFC
## 13 Dolphins 31 0.500 AFC
## 14 Eagles 28 0.625 NFC
## 15 Falcons 31 0.250 NFC
## 16 Giants 50 0.438 NFC
## 17 Jaguars 38 0.250 AFC
## 18 Jets 37 0.500 AFC
## 19 Lions 46 0.438 NFC
## 20 Packers 31 0.531 NFC
## 21 Panthers 21 0.750 NFC
## 22 Patriots 29 0.750 AFC
## 23 Raiders 39 0.250 AFC
## 24 Rams 26 0.438 NFC
## 25 Ravens 36 0.500 AFC
## 26 Redskins 45 0.188 NFC
## 27 Saints 19 0.688 NFC
## 28 Seahawks 27 0.813 NFC
## 29 Steelers 29 0.500 AFC
## 30 Texans 41 0.125 AFC
## 31 Titans 36 0.438 AFC
## 32 Vikings 37 0.344 NFC
By default left_join will return all columns from both data frames. If there are rows in the “left” data frame (here turnover.total) that do not have a match in the “right” data frame (here nfl.record), then these will have a NA in the corresponding row. To see this in action, let’s remove two teams from the nfl.record data frame and then merge:
nfl.record.miss <- nfl.record %>%
filter(!Tm %in% c('Bears','Chiefs'))
turnover.total.record.miss <- turnover.total %>%
left_join(nfl.record.miss,by="Tm")
The result is now
## Tm Turnovers Record Conference
## 1 49ers 21 0.750 NFC
## 2 Bears 29 NA <NA>
## 3 Bengals 36 0.688 AFC
## 4 Bills 40 0.375 AFC
## 5 Broncos 29 0.813 AFC
## 6 Browns 34 0.250 AFC
## 7 Buccaneers 25 0.250 NFC
## 8 Cardinals 39 0.625 NFC
## 9 Chargers 21 0.563 AFC
## 10 Chiefs 26 NA <NA>
## 11 Colts 20 0.688 AFC
## 12 Cowboys 27 0.500 NFC
## 13 Dolphins 31 0.500 AFC
## 14 Eagles 28 0.625 NFC
## 15 Falcons 31 0.250 NFC
## 16 Giants 50 0.438 NFC
## 17 Jaguars 38 0.250 AFC
## 18 Jets 37 0.500 AFC
## 19 Lions 46 0.438 NFC
## 20 Packers 31 0.531 NFC
## 21 Panthers 21 0.750 NFC
## 22 Patriots 29 0.750 AFC
## 23 Raiders 39 0.250 AFC
## 24 Rams 26 0.438 NFC
## 25 Ravens 36 0.500 AFC
## 26 Redskins 45 0.188 NFC
## 27 Saints 19 0.688 NFC
## 28 Seahawks 27 0.813 NFC
## 29 Steelers 29 0.500 AFC
## 30 Texans 41 0.125 AFC
## 31 Titans 36 0.438 AFC
## 32 Vikings 37 0.344 NFC
If you want the result only for teams where there is a positive match, you can use inner_join instead:
turnover.total.record.miss <- turnover.total %>%
inner_join(nfl.record.miss,by="Tm")
This results in
## Tm Turnovers Record Conference
## 1 49ers 21 0.750 NFC
## 2 Bengals 36 0.688 AFC
## 3 Bills 40 0.375 AFC
## 4 Broncos 29 0.813 AFC
## 5 Browns 34 0.250 AFC
## 6 Buccaneers 25 0.250 NFC
## 7 Cardinals 39 0.625 NFC
## 8 Chargers 21 0.563 AFC
## 9 Colts 20 0.688 AFC
## 10 Cowboys 27 0.500 NFC
## 11 Dolphins 31 0.500 AFC
## 12 Eagles 28 0.625 NFC
## 13 Falcons 31 0.250 NFC
## 14 Giants 50 0.438 NFC
## 15 Jaguars 38 0.250 AFC
## 16 Jets 37 0.500 AFC
## 17 Lions 46 0.438 NFC
## 18 Packers 31 0.531 NFC
## 19 Panthers 21 0.750 NFC
## 20 Patriots 29 0.750 AFC
## 21 Raiders 39 0.250 AFC
## 22 Rams 26 0.438 NFC
## 23 Ravens 36 0.500 AFC
## 24 Redskins 45 0.188 NFC
## 25 Saints 19 0.688 NFC
## 26 Seahawks 27 0.813 NFC
## 27 Steelers 29 0.500 AFC
## 28 Texans 41 0.125 AFC
## 29 Titans 36 0.438 AFC
## 30 Vikings 37 0.344 NFC
There is also a semi_join and anti_join but they tend to be less useful (for the details see the help for join in the dplyr library).
What is the result? Let’s do a quick visualization (you will learn how to do this in the Data Visualization section):
What is the relationship between a quarterback’s interception rate (defined as number of interceptions thrown divided by the total number of passes) and the quaterback’s age and experience?
First we load data on frequently playing quarterbacks’ interception rate:
FileName <- 'data/interceptions_qb.rds'
interceptions.qb <- readRDS(FileName)
Next load data on NFL player characteristics:
FileName <- 'data/NFL_2013_Player_Census.rds'
player.data <- readRDS(FileName)
Hint: To perform the relevant merge, note that the player name variable has a different name in the two data frames to be merged. To find out how to solve this problem look in the help section for left_join (write help(“left_join”) in the R console).