Managing Data in R

Raw data is typically not available as R data files. If your data has been extracted from a database, you will most likely receive it as flat text files, e.g., csv files. Data recording web activity is often stored in a format called json. In this section we will look at how to easily read data files into R’s memory and how to store them as R databases.

Reading Data into R

In the following we will use some data from the NFL as example data. You can download all data and code for this lesson as an Rstudio project here. You should unzip this file somewhere on you computer’s hard drive and then open the file Data Import and Export.Rproj in Rstudio.

The currently best R package for importing data is readr which is part of the tidyverse library set. If you have not already installed this library you should do so now:

#install.packages("tidyverse")    ## only if you have not already installed this package
library(tidyverse)

R data files

The recommended format for storing a single data file for use in R (e.g., a large database) is as a RDS file. This is a highly compressed file format, typically much smaller than, for example, a csv file (and often even smaller than a zipped csv file). RDS files can be read into R very fast. If your data is already stored in an R database file in RDS format, you can read it in with the readRDS command:

FileName <- 'c:/mydata/nflplaybyplay2015.rds'
nfl2015 <- read_rds(FileName)

This will read the file “nflplaybyplay2015.rds” into an R data frame called nfl2015 (you can give this data frame any name you wish). This assumes that the file is located in the folder mydata on the c-drive of your computer.

Digression: Paths in R

Note that in the example above we provided the full file path and name when loading the data. If the file you are loading is located in the current working directory, then you can simply write the file name:

FileName <- 'nflplaybyplay2015.rds'
nfl2015 <- read_rds(FileName)

Don’t know your current working directory? Simply execute the command

getwd()

in the R console. If the file is located in a subdirectory of your current work directory called - for example - “data” - then you can write

FileName <- 'data/nflplaybyplay2015.rds'
nfl2015 <- read_rds(FileName)

It is good practice to start out a data analytics project by defining a new folder on your local drive, e.g., “SuperDuperAnalytics”. Then set the working directory to be this folder. Then you can have subfolders called “data”, “code”, “results” etc. The easiest way to manage all of this is via an Rstudio project.

Getting Raw Data into R

It is rare that your raw data is available as an RDS file so we need to be able to get non-R data into R.

Reading CSV Files

CSV files is probably the most popular format for exchanging and storing data. These are straightforward to read into R:

FileName <- 'c:/mydata/nflplaybyplay2015.csv'
nfl2015 <- read_csv(FileName)

The read_csv command assumes that your data is separated by commas (CSV=Comma Separated Values). If you have a file where the values are separated by something else (e.g., “|”), then you can use the read_delim function in the readr package.

The read_csv command will try to guess what the best data type is for each of your data fields (e.g, character, date, integer etc.). If you want to see what read_csv has chosen you can write

spec(nfl2015)

If you don’t like what read_csv has chosen, you can force different choices by using the col_types option (for details type vignette("column-types") in the R command line).

A cool thing about the readr package is that you can read zipped files without unzipping them first. This is a killer feature beacuse you can then store your raw data in zipped files and just read those in directly:

FileName <- 'c:/mydata/nflplaybyplay2015.zip'
nfl2015 <- read_csv(FileName)

Digression: Using the RStudio Data Import Wizard

You can also use Rstudio’s built-in data import wizard (note: this requires that you have the preview version RStudio installed). Simply click on the “Import Dataset” button in the top right window in RStudio and follow the dialogue boxes. If you decide to take this approach, make sure to save the sample R code used (you can find it in the lower right hand box in the data import window) by copy-and-pasting it into an R script file for later use.

Reading Excel Files

Excel files can be read using the read_excel command. You can read separate sheets from the Excel file by providing an optional sheet number. The Excel file used below contains data from a 2015 travel decision survey conducted in San Francisco (the file was downloaded from Data.Gov from this link). Let’s read in this data and plot the distribution of total number of monthly trips taken by the survey respondents:

library(readxl)   # library for reading excel files 

file.name <- "data/TDS_202015_20Data-WEBPAGE.xlsx"

survey.info <- read_excel(file.name,sheet = 1)
survey.data <- read_excel(file.name,sheet = 2)
survey.dict <- read_excel(file.name,sheet = 3)

survey.data %>%
  ggplot(aes(x=Trips)) + geom_histogram()

The first sheet contains some basic information about the survey, the second contains the actual survey response data, while the third sheet contains the data dictionary. The trip distribution is quite skewed with most respondents taking very few trips (note: the ggplot command is the main plot command in the ggplot library - an extremely powerful visualization package. You will learn all the details about visualization using ggplot in a later module). How does this break out across the different counties in the bay area? We can investigate this by re-plotting the histogram for each county. This is most easily done using facets (again, you will learn this at a later stage):

survey.data %>%
  ggplot(aes(x=Trips)) + geom_histogram() + facet_wrap(~Q2,scales='free')

Hmm…this is interesting but not all that informative since we don’t know what the codes mean! These are hidden in the data dictionary. We can extract these by using a little R data-munging magic and then redoing the plot (don’t worry about the details of the code below - you will learn these later):

county.codes <- data.frame(codes=unlist(strsplit( survey.dict$Codes[2], "\r\n"))) %>%
  separate(codes, into=c("Q2","Q2.Name"),sep='=') %>%
  mutate(Q2=as.numeric(Q2))

survey.data %>%
  left_join(county.codes,by='Q2') %>%
  ggplot(aes(x=Trips)) + geom_histogram() + facet_wrap(~Q2.Name,scales='free')

Reading Online Files

If your original data file exists online, you can read it into R without downloading it first by simply replacing the file path with the URL of the data file. Let’s read in some data from the city of San Diego’s open data portal:

the.file <- "http://data.sandiego.gov/sites/default/files/indicator_bacteria_tests_datasd_3.csv"
bacteria.test <- read_csv(the.file)

the.file <- "http://datasd-prod.s3.amazonaws.com/water_testing/sample_sites_datasd.csv"
test.sites <- read_csv(the.file)

The first file contains the results of about 2 1/2 years of bacteria test of drinking water at each of 155 sample sites. The second file contains information on each of the sites. Let’s quickly plot the measured pH value over time for 6 of the sites:

bacteria.test %>%
  filter(source %in% c("239 SYS","280 SYS","323 SYS","55A SYS","49A SYS","309 SYS")) %>%
  left_join(test.sites,by=c("source"="site_ID")) %>%
  ggplot(aes(x=sample_date,y=ph,group=source,color=source)) + geom_line() + 
  geom_smooth() + facet_wrap(~location_description)+
  theme(legend.position="NULL")+
  ggtitle("Drinking Water pH at 6 San Diego Sample Sites")+xlab('Date')

Reading JSON files

JSON data files are less well known than csv or other such delimited text files, but are quite common for exchanging data measuring web activity. You can read in JSON files using the jsonlite package. Warning: Reading in JSON files can be a fit frustrating at times as there are different types of JSON formats (flat JSON, hierarchical JSON etc.) so you will often have to do some trial and error (coupled with some Google research). Having said that, here is an example where we read in a JSON file sitting in a zipped folder containing information on all products in Amazon’s patio furniture category:

library(jsonlite)
patio.data <- stream_in(gzfile("data/Patio_Lawn_and_Garden.json.gz"))