In many situations your data will be stored in separate files with dimensions defined by the nature of the objects that the data represents. For example, a large transactional database might contain four individual files (or “tables”): (1) a transaction file (2) a product file (3) a customer file and (4) a store file. Each of these will contain different information that can be linked to each other. This is also called a relational database since the objects stored in it can be linked or related to each other.
You can download data and code for this module as an Rstudio project here. You should unzip this file somewhere on you computer’s hard drive and then open the file Relational data.Rproj in Rstudio.
Here is a simple example of a relational database: Suppose you have data on the results of two running races. Each set of race results form a table. A third table contains a list of runners, while a fourth contains the list of races. These four tables define a relational database:
load('data/run_race_database.rda')
Race1Results
Race2Results
Runners
Races
## RaceId RunnerId RaceTime
## 1 1 1 100
## 2 1 2 120
## 3 1 3 90
## 4 1 4 140
## RaceId RunnerId RaceTime
## 1 2 2 115
## 2 2 4 145
## 3 2 5 105
## 4 2 6 95
## RunnerId Name Gender Age
## 1 1 Peter M 35-39
## 2 2 Jenny F 30-34
## 3 3 Heather F 25-29
## 4 4 John M 30-34
## 5 5 Max M 35-39
## 6 6 Julie F 40-44
## RaceId RaceName
## 1 1 La Jolla Half Marathon
## 2 2 Silver Strand Half Marathon
There are multiple relationships across tables in this database. For example, we can link the same runners across races or we can link runner characteristics to race results or race name to results. All of these operations require us to join tables which is the standard operation done on relational databases. Once the join operation is done, we can continue with other summary operations, visualizations or other more advanced analytics tools.
There are different join operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both of the two race results tables. You can think of this as the intersection between the two tables. This is done by using the inner_join function:
library(tidyverse)
RunnersRace12 <- inner_join(Race1Results, Race2Results, by = 'RunnerId')
RunnersRace12
## RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
## 1 1 2 120 2 115
## 2 1 4 140 2 145
Notice that R renamed the race specific variables by adding an “x” and “y”. This is because the RaceId and RaceTime variable has the same name in Race1Result and Race2Result and we obviously need to distinguish between them. The “by” option informs R on which variable the join operation is to be performed. There may situations where you want the racers who ran both races (the intersection) but you only want the results from the first race. You can do this with a semi_join :
RunnersRace12 <- semi_join(Race1Results, Race2Results, by = 'RunnerId')
RunnersRace12
## RaceId RunnerId RaceTime
## 1 1 2 120
## 2 1 4 140
Suppose instead that you were interested in a “master” race file that had information on all races for all runners. You can think of this as the union of the race 1 and race 2 table. You can do this operation by using full_join:
RunnersRace12All <- full_join(Race1Results, Race2Results, by = 'RunnerId')
RunnersRace12All
## RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
## 1 1 1 100 NA NA
## 2 1 2 120 2 115
## 3 1 3 90 NA NA
## 4 1 4 140 2 145
## 5 NA 5 NA 2 105
## 6 NA 6 NA 2 95
Notice that we now have a bunch of NAs (=“Not Available” - this is how R represents missing data) in the merged result. That is because the result - being the union of the two race tables - has one row with all race results for each runner and several runners only ran in one race.
Suppose instead you wanted to extract the race data for all races that runners who ran race 1 has ever run. This can be accomplished by a left_join operation:
RunnersRace1All <- left_join(Race1Results, Race2Results, by = 'RunnerId')
RunnersRace1All
## RaceId.x RunnerId RaceTime.x RaceId.y RaceTime.y
## 1 1 1 100 NA NA
## 2 1 2 120 2 115
## 3 1 3 90 NA NA
## 4 1 4 140 2 145
Finally, you may sometimes be interested in extracting the records that do not overlap between two tables rather than the ones that do. For example, suppose you were interested in the runners who ran in race 1 but not in race 2. You can think of this as the complement of the inner join operation above when compared to the original race 1 table. This can be generated by using an anti_join operation:
Race1Only <- anti_join(Race1Results,Race2Results,by='RunnerId')
Race1Only
## RaceId RunnerId RaceTime
## 1 1 1 100
## 2 1 3 90
Suppose now you wanted to compare race results by age group for race1. How should we do that? Well, now we need to join the race1 and runners table:
Race1Groups <- inner_join(Race1Results, Runners, by = 'RunnerId')
Race1Groups
## RaceId RunnerId RaceTime Name Gender Age
## 1 1 1 100 Peter M 35-39
## 2 1 2 120 Jenny F 30-34
## 3 1 3 90 Heather F 25-29
## 4 1 4 140 John M 30-34
Suppose instead you wanted results for runners participating in both races and the associated race names. We can do this as
RunnersRace12 <- inner_join(
inner_join(Race1Results, Races, by = 'RaceId'),
inner_join(Race2Results, Races, by = 'RaceId'),
by = 'RunnerId'
)
RunnersRace12
## RaceId.x RunnerId RaceTime.x RaceName.x RaceId.y RaceTime.y
## 1 1 2 120 La Jolla Half Marathon 2 115
## 2 1 4 140 La Jolla Half Marathon 2 145
## RaceName.y
## 1 Silver Strand Half Marathon
## 2 Silver Strand Half Marathon
Before considering any join operation, first think about what you are trying to accomplish - what is the end result you are looking for? Then find the appropriate join operation consistent with you goal. When you are working with large databases, the wrong type of join method can be disastrous. For example, wrongly using a full_join when you should be using an inner_join or left_join can lead to enormous data frames being generated which will often crash you computer.
The classic example of a relational database is a transactional database. This database will contain one table with the actual transactions, i.e., one row for each transaction, where a transaction is an interaction between a certain customer in a certain store buying a certain product at a certain date. An “order” is then a series of transactions made by the same customer on a certain date. The product file will contain one row for each product available with a set of variables capturing various product characteristics. The customer file will contain one row for each customer (with associated customer information), while the store file will contain one row per store (with store information).
We can illustrate the structure of the relations between the files as:
Transaction.File | Product.File | Customer.File | Store.File | |
---|---|---|---|---|
Order ID | X | |||
Order Date | X | |||
Product ID | X | X | ||
Customer ID | X | X | ||
Store ID | X | X | ||
Volume | X | |||
Total Price | X | |||
Product Info | X | |||
Customer Info | X | |||
Store Info | X |
The specific joins to perform in this appliation will depend on your objectives. For example, if you wanted to investigate how sales of a certain product group varied with the regional location of stores you would need to join both the product file (to get information about which product IDs belong to the group of interest) and the store file (to get the regional location of each store) with the transaction file. On the other hand, if you wanted to know the characteristics of your top customers, you would need to join the transaction file with the customer file.
DonorsChoose is a non-profit organization that raises money for school projects. Potential donors can go to the website and pick among active projects and donate to any project of their choosing. When a project is funded the organization channels the funds to the school in question.
The data for this case study is not included in the Rstudio project file you downloaded above. There are two data files (or tables) for in this database - donations and projects. You can download the files in R format for 2015 and 2016 here. These are large files: 155mb for the donation file and 34mb for the projects file. If you wish to have a go at the full files, you can get the full donation file (552 mb) and projects file (115mb) here. You should download the files and put them in the data folder for the current Rstudio project. In the following we will use the full database.
The full database contains a total of 1,108,217 projects dating from 2002 to 2016 to which a total of 5,820,478 donations have been made.
Let’s load the data:
donations <- read_rds('data/donations.rds')
projects <- read_rds'data/projects.rds')
There is a lot of information about both donors and projects (you can see a sample of the data if you write glimpse(projects)) in R. Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the table command:
table(projects$school_state)
##
## AK AL AR AZ CA CO CT DC DE FL
## 2819 11860 10636 20449 195055 11967 13532 9180 3004 54826
## GA HI IA ID IL IN KS KY La LA
## 32121 5472 5860 4527 58909 28412 5190 8602 3 19672
## MA MD ME MI MN MO MS MT NC ND
## 18925 15867 5544 24069 8618 21478 11599 1820 63154 974
## NE NH NJ NM NV NY OH OK OR PA
## 2873 2824 18602 4749 13745 103999 16221 25144 12398 26277
## RI SC SD TN TX UT VA VT WA WI
## 3356 33700 2006 22225 70754 16061 19316 1022 21606 12024
## WV WY
## 4471 700
Consider now the following question: How local is giving? In other words, do donors mainly care about their own state or do the give to projects in many states? A related question would be to ask where donations to projects located in certain state originated. We will later return to this question in more detail, but for now let’s focus on donors in New York state. Where do these donors give? We can start by only considering the donation data for donors who reside in New York:
ny.donations <- donations %>%
filter(donor_state == 'NY')
This leaves us with a donation file with 710100 donations. Now we need to calculate to where these donations were made. To do this we need information on the location on each project. This is in the projects file. Therefore we need to join the ny.donations file with the projects file:
ny.donations.projects <- ny.donations %>%
inner_join(projects, by = '_projectid')
This file will have the same number of rows as the ny.donations file, but will now have add columns with information on the project that each of the donations were made to - including the state id of the school for each project. We can now simply count the project locations:
table(ny.donations.projects$school_state)
##
## AK AL AR AZ CA CO CT DC DE FL
## 1335 5097 3523 8771 98895 6984 8736 5519 1675 29020
## GA HI IA ID IL IN KS KY La LA
## 16434 2128 1872 2246 29174 12261 1791 3599 1 8747
## MA MD ME MI MN MO MS MT NC ND
## 13115 9839 3181 13929 4393 9691 3914 1069 27534 304
## NE NH NJ NM NV NY OH OK OR PA
## 941 1831 10736 2346 5418 225965 8393 8358 5831 17490
## RI SC SD TN TX UT VA VT WA WI
## 2584 12514 882 11003 28324 5394 10165 717 8333 5791
## WV WY
## 1746 561
prop.table(table(ny.donations.projects$school_state))
##
## AK AL AR AZ CA
## 1.880017e-03 7.177862e-03 4.961273e-03 1.235178e-02 1.392691e-01
## CO CT DC DE FL
## 9.835234e-03 1.230249e-02 7.772145e-03 2.358823e-03 4.086748e-02
## GA HI IA ID IL
## 2.314322e-02 2.996761e-03 2.636248e-03 3.162935e-03 4.108435e-02
## IN KS KY La LA
## 1.726658e-02 2.522180e-03 5.068300e-03 1.408252e-06 1.231798e-02
## MA MD ME MI MN
## 1.846923e-02 1.385579e-02 4.479651e-03 1.961555e-02 6.186453e-03
## MO MS MT NC ND
## 1.364737e-02 5.511900e-03 1.505422e-03 3.877482e-02 4.281087e-04
## NE NH NJ NM NV
## 1.325165e-03 2.578510e-03 1.511900e-02 3.303760e-03 7.629911e-03
## NY OH OK OR PA
## 3.182157e-01 1.181946e-02 1.177017e-02 8.211520e-03 2.463033e-02
## RI SC SD TN TX
## 3.638924e-03 1.762287e-02 1.242079e-03 1.549500e-02 3.988734e-02
## UT VA VT WA WI
## 7.596113e-03 1.431489e-02 1.009717e-03 1.173497e-02 8.155189e-03
## WV WY
## 2.458809e-03 7.900296e-04
So 32 percent of the donations originating in New York are made to school projects in New York. What would you say this says about how local giving is?
Suppose you were to consider the related question: From what states do donations to projects in New York come from? Mainly New York or other places? How would you answer this question?
The standard data frame in R is organized in “short and wide” format, i.e., rows are records (e.g., users, transactions etc.) and columns are records/fields (i.e., different measures of users, transactions etc.). However, there are many situations where this is not the optimal structure for your data. In particular, it is often recommend to organize your data as “tall and thin” instead. This is somtimes referred to as “tidy” data, see here for a lengthy introduction to the idea of tidy data.
Here we will consider the most common problem in tidy data organization. This is the problem where your data looks like this:
Customer | V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|---|
ID1 | 5 | 5 | 4 | 4 | 8 |
ID2 | 9 | 4 | 3 | 1 | 9 |
ID3 | 1 | 8 | 7 | 2 | 6 |
…but you want it to look like this:
Customer | Variable | Value |
---|---|---|
ID1 | V1 | 5 |
ID1 | V2 | 4 |
ID1 | V3 | 8 |
. | . | . |
. | . | . |
The first version of the data is “short and wide”, while the second is “tall and thin”. For example, when you import data from an Excel sheet it will often be organized with one row per customer/user and one column for each measurement. However, this is often not the ideal way to organize your data. Suppose, for example, that you are analyzing a customer satisfaction survey. Each row is a customer and you may have measured 20 to 30 different dimensions of satisfaction (e.g., ordering process, product, packaging, etc.). Now you want to summarize mean satisfaction for each dimension and different customer segments. The most elegant and efficient way of doing this is first to make the data long and thin and then do a group-by statement.
Let’s look at a simple examples:
library(tidyverse)
states <- read_rds('data/states.rds')
states
## state Population Income Life.Exp Murder HS.Grad Frost
## 1 California 21198 5114 71.71 10.3 62.6 20
## 2 Illinois 11197 5107 70.14 10.3 52.6 127
## 3 Rhode Island 931 4558 71.90 2.4 46.4 127
This data is not tidy: We want each row to be a (state,measurement) combination. To make it tidy we use the gather function:
states.t <- states %>%
gather(variable, value, Population:Frost)
This will take all the columns from Population to Frost and stack them into one column called “variable” and all the values into another column called “value”. The result is
states.t
## state variable value
## 1 California Population 21198.00
## 2 Illinois Population 11197.00
## 3 Rhode Island Population 931.00
## 4 California Income 5114.00
## 5 Illinois Income 5107.00
## 6 Rhode Island Income 4558.00
## 7 California Life.Exp 71.71
## 8 Illinois Life.Exp 70.14
## 9 Rhode Island Life.Exp 71.90
## 10 California Murder 10.30
## 11 Illinois Murder 10.30
## 12 Rhode Island Murder 2.40
## 13 California HS.Grad 62.60
## 14 Illinois HS.Grad 52.60
## 15 Rhode Island HS.Grad 46.40
## 16 California Frost 20.00
## 17 Illinois Frost 127.00
## 18 Rhode Island Frost 127.00
We will see later on that it is much easier to perform group summaries and visualizations of data when it is “tidy”.
Another example where reshaping your data into tidy format is useful is when performing “join” operations as above. The following case study is an illustration of this.
Pew Research occasionally conducts a multinational survey on various issues. Here we will at the spring 2016 survey where n=23,462 respondents across 19 countries where surveyed on a number of attitudes around politics, government, news, education etc. When analyzing survey data like this you will often need to perform join operations between survey outcomes, survey questions and values.
We start by loading the required libraries and then import the data:
## load libraries
library(tidyverse)
library(stringr)
library(forcats)
## import data, questions and response options
surveyData <- read_rds('data/pew_global_attitudes_data.rds')
surveyQuestions <- read_rds('data/pew_global_attitudes_questions.rds')
surveyValueLabels <- read_rds('data/pew_global_attitudes_values.rds')
The columns of the surveyData data frame contains the survey question responses (there are a lot in this survey!), surveyQuestions contain the actual questions while surveyValueLabels contain the labels for all the responses.
Let’s focus on a small subset of questions (you can - and shoudd - explore the others on your own) about sources of news. We extract the relevant questions which have variable names Q37A to Q37E along with the ID for each respondent and country of residence:
CountryAttitudes <- surveyData %>%
select(ID,country,Q37A:Q37E)
We can now summarize or visualize these five attitudes. For example, we can calulate the average reponse for each of the questions:
MeanAttitudes <- CountryAttitudes %>%
summarise_at(vars(Q37A:Q37E), mean,na.rm = TRUE)
MeanAttitudes
## # A tibble: 1 x 5
## Q37A Q37B Q37C Q37D Q37E
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.505626 2.354403 2.280326 2.721507 2.903845
This is not very informative because (1) we don’t know what each question/attitude is and (2) we don’t know how the attitude is scaled (i.e., what does a “1” or a “2” mean?). Let’s start with the first problem which we can solve by first tidying the data and then joining the surveyQuestions data:
as.data.frame(
MeanAttitudes %>%
gather(variable,value) %>%
left_join(surveyQuestions,by=c("variable"))
)
## variable value
## 1 Q37A 1.505626
## 2 Q37B 2.354403
## 3 Q37C 2.280326
## 4 Q37D 2.721507
## 5 Q37E 2.903845
## label
## 1 Q37A. How often do you get news from each of the following? Do you get news from television often, sometimes, hardly ever or never?
## 2 Q37B. How often do you get news from each of the following? Do you get news from print newspapers often, sometimes, hardly ever or never?
## 3 Q37C. How often do you get news from each of the following? Do you get news from radio often, sometimes, hardly ever or never?
## 4 Q37D. How often do you get news from each of the following? Do you get news from online news websites, blogs or apps often, sometimes, hardly ever or never?
## 5 Q37E. How often do you get news from each of the following? Do you get news from social networking sites such as Facebook, Twitter, (INSERT COUNTRY SPECIFIC EXAMPLES) often, sometimes, hardly ever or never?
Ok - now we can see what each question/attitude is. But what about the reponses? To see how the question is scaled we can check the surveyValueLabels data frame. For example, the possible responses to the first question are
filter(surveyValueLabels,question=="Q37A")
## question label value
## 1 Q37A Often 1
## 2 Q37A Sometimes 2
## 3 Q37A Hardly ever 3
## 4 Q37A Never 4
## 5 Q37A Don<U+0092>t know (DO NOT READ) 8
## 6 Q37A Refused (DO NOT READ) 9
Yikes! now we realize that we have screwed up - we obviously shouldn’t be computing the average of these responses. Let’s do a bar chart instead so we can represent the number of responses at each possibility for each question.
We first make a tidy version of the raw data and then create a bar chart for each question:
CountryAttitudesTidy <- CountryAttitudes %>%
gather(Question,Value,Q37A:Q37E)
CountryAttitudesTidy %>%
ggplot(aes(x=factor(Value))) + geom_bar() + facet_wrap(~Question)
Ok - now we have a bar chart but one that is fairly useless since no one will know what the questions and x-values mean! Let’s add these by appropriate join operations. First add labels to each question:
CountryAttitudesTidy %>%
left_join(surveyQuestions,by=c("Question"="variable")) %>%
ggplot(aes(x=factor(Value))) + geom_bar() + facet_wrap(~label)
Hmm…that solved the problem but created a new one: The labels are too long so let’s “wrap” in the title bar:
CountryAttitudesTidy %>%
left_join(surveyQuestions,by=c("Question"="variable")) %>%
mutate(labelWrap = str_wrap(label,40)) %>%
ggplot(aes(x=factor(Value))) + geom_bar() + facet_wrap(~labelWrap)
Alright - one problem solved. Now let’s add labels for the x-axis:
CountryAttitudesTidy %>%
left_join(surveyQuestions,by=c("Question"="variable")) %>%
mutate(labelWrap = str_wrap(label,40)) %>%
select(-label) %>%
left_join(surveyValueLabels,by=c("Question"="question","Value"="value")) %>%
ggplot(aes(x=label)) + geom_bar() + facet_wrap(~labelWrap) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Almost done - let’s remove the answers with very few counts and change the order of the x-axis labels:
CountryAttitudesTidy %>%
left_join(surveyQuestions,by=c("Question"="variable")) %>%
mutate(labelWrap = str_wrap(label,40)) %>%
select(-label) %>%
left_join(surveyValueLabels,by=c("Question"="question","Value"="value")) %>%
filter(Value %in% c(1:4)) %>%
mutate(label = factor(label,levels=c("Often","Sometimes","Hardly ever","Never"))) %>%
ggplot(aes(x=label)) + geom_bar() + facet_wrap(~labelWrap) + labs(x="Answer")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Copyright © 2017 Karsten T. Hansen. All rights reserved.