In many situations your data will be stored in separate files with dimensions defined by the nature of the objects that the data represents. For example, a large transactional database might contain four individual files (or “tables”): (1) a transaction file (2) a product file (3) a customer file and (4) a store file. Each of these will contain different information that can be linked to each other. This is also called a relational database since the objects stored in it can be linked or related to each other.
You can download data and code for this module as an Rstudio project here. You should unzip this file somewhere on you computer’s hard drive and then open the file Relational Databases.Rproj in Rstudio.
Here is a simple example of a relational database: Suppose you have data on the results of two running races, race1 and race2. Each set of race results form a table. A third table contains a list of runners, while a fourth contains the list of races. These four tables define a relational database:
load('data/run_race_database.rda')
race1
race2
runners
races
## race.id runner.id race.time
## 1 1 1 100
## 2 1 2 120
## 3 1 3 90
## 4 1 4 140
## race.id runner.id race.time
## 1 2 2 115
## 2 2 4 145
## 3 2 5 105
## 4 2 6 95
## runner.id name gender age
## 1 1 Peter M 35-39
## 2 2 Jenny F 30-34
## 3 3 Heather F 25-29
## 4 4 John M 30-34
## 5 5 Max M 35-39
## 6 6 Julie F 40-44
## race.id race.name
## 1 1 La Jolla Half Marathon
## 2 2 Silver Strand Marathon
There are multiple relationships across tables in this database. For example, we can link the same runners across races or we can link runner characteristics to race results or race name to results. All of these operations require us to join tables which is the standard operation done on relational databases. Once the join operation is done, we can continue with other summary operations, visualizations or other more advanced analytics tools.
There are different join operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both the race1 and race2 table. You can think of this as the intersection between the two tables. This is done by using the inner_join function:
library(tidyverse)
runners.race12 <- inner_join(race1,race2,by='runner.id')
runners.race12
## race.id.x runner.id race.time.x race.id.y race.time.y
## 1 1 2 120 2 115
## 2 1 4 140 2 145
Notice that R renamed the race specific variables by adding an “x” and “y”. This is because the race.id and race.time variable has the same name in race1 and race2 and we obviously need to distinguish between them. The “by” option informs R on which variable the join operation is to be performed.
Suppose instead that you were interested in a “master” race file that had information on all races for all runners. You can think of this as the union of the race1 and race2 table. You can do this operation by using full_join:
runners.race12.all <- full_join(race1,race2,by='runner.id')
runners.race12.all
## race.id.x runner.id race.time.x race.id.y race.time.y
## 1 1 1 100 NA NA
## 2 1 2 120 2 115
## 3 1 3 90 NA NA
## 4 1 4 140 2 145
## 5 NA 5 NA 2 105
## 6 NA 6 NA 2 95
Notice that we now have a bunch of NAs in the merged result. That is because the result - being the union of the two race tables - has one row with all race results for each runner and several runners only ran in one race.
Suppose instead you wanted to extract the race data for all races that runners in race1 has ever run. This can be accomplished by a left_join operation:
runners.race1.all <- left_join(race1,race2,by='runner.id')
runners.race1.all
## race.id.x runner.id race.time.x race.id.y race.time.y
## 1 1 1 100 NA NA
## 2 1 2 120 2 115
## 3 1 3 90 NA NA
## 4 1 4 140 2 145
Suppose now you wanted to compare race results by age group for race1. How should we do that? Well, now we need to join the race1 and runners table:
race1.groups <- inner_join(race1,runners,by='runner.id')
race1.groups
## race.id runner.id race.time name gender age
## 1 1 1 100 Peter M 35-39
## 2 1 2 120 Jenny F 30-34
## 3 1 3 90 Heather F 25-29
## 4 1 4 140 John M 30-34
Suppose instead you wanted results for runners participating in both races and the associated race names. We can do this as
runners.race12 <- inner_join(
inner_join(race1,races,by='race.id'),
inner_join(race2,races,by='race.id'),
by='runner.id'
)
runners.race12
## race.id.x runner.id race.time.x race.name.x race.id.y
## 1 1 2 120 La Jolla Half Marathon 2
## 2 1 4 140 La Jolla Half Marathon 2
## race.time.y race.name.y
## 1 115 Silver Strand Marathon
## 2 145 Silver Strand Marathon
Finally, you may sometimes be interested in extracting the records that do not overlap between two tables rather than the ones that do. For example, suppose you were interested in the runners who ran in race1 but not in race2. You can think of this as the complement of the runners.race12 table when compared to the original race1 table. This can be generated by using an anti_join operation:
race1.only <- anti_join(race1,race2,by='runner.id')
race1.only
## race.id runner.id race.time
## 1 1 1 100
## 2 1 3 90
Before considering any join operation, first think about what you are trying to accomplish - what is the end result you are looking for? Then find the appropriate join operation consistent with you goal. When you are working with large databases, the wrong type of join method can be disastrous. For example, wrongly using a full_join when you should be using an inner_join or left_join can lead to enormous data frames being generated which will often crash you computer.
The classic example of a relational database is a transactional database. This database will contain one table with the actual transactions, i.e., one row for each transaction, where a transaction is an interaction between a certain customer in a certain store buying a certain product at a certain date. An “order” is then a series of transactions made by the same customer on a certain date. The product file will contain one row for each product available with a set of variables capturing various product characteristics. The customer file will contain one row for each customer (with associated customer information), while the store file will contain one row per store (with store information).
We can illustrate the structure of the relations between the files as:
Transaction.File | Product.File | Customer.File | Store.File | |
---|---|---|---|---|
Order ID | X | |||
Order Date | X | |||
Product ID | X | X | ||
Customer ID | X | X | ||
Store ID | X | X | ||
Volume | X | |||
Total Price | X | |||
Product Info | X | |||
Customer Info | X | |||
Store Info | X |
The specific joins to perform in this appliation will depend on your objectives. For example, if you wanted to investigate how sales of a certain product group varied with the regional location of stores you would need to join both the product file (to get information about which product IDs belong to the group of interest) and the store file (to get the regional location of each store) with the transaction file. On the other hand, if you wanted to know the characteristics of your top customers, you would need to join the transaction file with the customer file.
DonorsChoose is a non-profit organization that raises money for school projects. Potential donors can go to the website and pick among active projects and donate to any project of their choosing. When a project is funded the organization channels the funds to the school in question.
The data for this case study is not included in the Rstudio project file you downloaded above. There are two data files (or tables) for in this database - donations and projects. You can download the files in R format for 2015 and 2016 (here for donations and here for projects). These are large files: 155mb for the donation file and 34mb for the projects file. If you wish to have a go at the full files, you can get the full donation file (552 mb) here and the full projects file (115mb) here. You should download the files and put them in the data folder for the current Rstudio project. In the following we will use the full database.
The full database contains a total of 1,108,217 projects dating from 2002 to 2016 to which a total of 5,820,478 donations have been made.
Let’s load the data:
donations <- readRDS('data/donations.rds')
projects <- readRDS('data/projects.rds')
There is a lot of information about both donors and projects (you can see a sample of the data if you write glimpse(projects)) in R. Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the table command:
table(projects$school_state)
##
## AK AL AR AZ CA CO CT DC DE FL
## 2819 11860 10636 20449 195055 11967 13532 9180 3004 54826
## GA HI IA ID IL IN KS KY La LA
## 32121 5472 5860 4527 58909 28412 5190 8602 3 19672
## MA MD ME MI MN MO MS MT NC ND
## 18925 15867 5544 24069 8618 21478 11599 1820 63154 974
## NE NH NJ NM NV NY OH OK OR PA
## 2873 2824 18602 4749 13745 103999 16221 25144 12398 26277
## RI SC SD TN TX UT VA VT WA WI
## 3356 33700 2006 22225 70754 16061 19316 1022 21606 12024
## WV WY
## 4471 700
California, New York and Texas has the most number of projects. This makes sense - these are also the three most populous states in the US. You can easily convert the counts into proportions:
prop.table(table(projects$school_state))
##
## AK AL AR AZ CA
## 2.543726e-03 1.070188e-02 9.597398e-03 1.845216e-02 1.760079e-01
## CO CT DC DE FL
## 1.079843e-02 1.221060e-02 8.283576e-03 2.710660e-03 4.947226e-02
## GA HI IA ID IL
## 2.898440e-02 4.937661e-03 5.287773e-03 4.084940e-03 5.315656e-02
## IN KS KY La LA
## 2.563758e-02 4.683198e-03 7.762018e-03 2.707051e-06 1.775104e-02
## MA MD ME MI MN
## 1.707698e-02 1.431759e-02 5.002630e-03 2.171867e-02 7.776455e-03
## MO MS MT NC ND
## 1.938068e-02 1.046636e-02 1.642278e-03 5.698703e-02 8.788892e-04
## NE NH NJ NM NV
## 2.592453e-03 2.548237e-03 1.678552e-02 4.285262e-03 1.240281e-02
## NY OH OK OR PA
## 9.384353e-02 1.463703e-02 2.268870e-02 1.118734e-02 2.371106e-02
## RI SC SD TN TX
## 3.028288e-03 3.040921e-02 1.810115e-03 2.005474e-02 6.384490e-02
## UT VA VT WA WI
## 1.449265e-02 1.742980e-02 9.222021e-04 1.949618e-02 1.084986e-02
## WV WY
## 4.034408e-03 6.316452e-04
Consider now the following question: How local is giving? In other words, do donors mainly care about their own state or do the give to projects in many states? We will later return to this question in more detail, but for now let’s focus on donors in New York state. Where do these donors give? We can start by only considering the donation data for donors who reside in New York:
ny.donations <- donations %>%
filter(donor_state=='NY')
This leaves us with a donation file with 710100 donations. Now we need to calculate to where these donations were made. To do this we need information on the location on each project. This is in the projects file. Therefore we need to join the ny.donations file with the projects file:
ny.donations.projects <- ny.donations %>%
inner_join(projects,by='_projectid')
This file will have the same number of rows as the ny.donations file, but will now have add columns with information on the project that each of the donations were made to - including the state id of the school for each project. We can now simply count the project locations:
table(ny.donations.projects$school_state)
##
## AK AL AR AZ CA CO CT DC DE FL
## 1335 5097 3523 8771 98895 6984 8736 5519 1675 29020
## GA HI IA ID IL IN KS KY La LA
## 16434 2128 1872 2246 29174 12261 1791 3599 1 8747
## MA MD ME MI MN MO MS MT NC ND
## 13115 9839 3181 13929 4393 9691 3914 1069 27534 304
## NE NH NJ NM NV NY OH OK OR PA
## 941 1831 10736 2346 5418 225965 8393 8358 5831 17490
## RI SC SD TN TX UT VA VT WA WI
## 2584 12514 882 11003 28324 5394 10165 717 8333 5791
## WV WY
## 1746 561
prop.table(table(ny.donations.projects$school_state))
##
## AK AL AR AZ CA
## 1.880017e-03 7.177862e-03 4.961273e-03 1.235178e-02 1.392691e-01
## CO CT DC DE FL
## 9.835234e-03 1.230249e-02 7.772145e-03 2.358823e-03 4.086748e-02
## GA HI IA ID IL
## 2.314322e-02 2.996761e-03 2.636248e-03 3.162935e-03 4.108435e-02
## IN KS KY La LA
## 1.726658e-02 2.522180e-03 5.068300e-03 1.408252e-06 1.231798e-02
## MA MD ME MI MN
## 1.846923e-02 1.385579e-02 4.479651e-03 1.961555e-02 6.186453e-03
## MO MS MT NC ND
## 1.364737e-02 5.511900e-03 1.505422e-03 3.877482e-02 4.281087e-04
## NE NH NJ NM NV
## 1.325165e-03 2.578510e-03 1.511900e-02 3.303760e-03 7.629911e-03
## NY OH OK OR PA
## 3.182157e-01 1.181946e-02 1.177017e-02 8.211520e-03 2.463033e-02
## RI SC SD TN TX
## 3.638924e-03 1.762287e-02 1.242079e-03 1.549500e-02 3.988734e-02
## UT VA VT WA WI
## 7.596113e-03 1.431489e-02 1.009717e-03 1.173497e-02 8.155189e-03
## WV WY
## 2.458809e-03 7.900296e-04
So 32 percent of the donations originating in New York are made to school projects in New York. What would you say this says about how local giving is?
Copyright © 2016 Karsten T. Hansen. All rights reserved.