In many situations your data will be stored in separate files with dimensions defined by the nature of the objects that the data represents. For example, a large transactional database might contain four individual files (or “tables”): (1) a transaction file (2) a product file (3) a customer file and (4) a store file. Each of these will contain different information that can be linked to each other. This is also called a relational database since the objects stored in it can be linked or related to each other.

You can download data and code for this module as an Rstudio project here. You should unzip this file somewhere on you computer’s hard drive and then open the file Relational Databases.Rproj in Rstudio.

Here is a simple example of a relational database: Suppose you have data on the results of two running races, race1 and race2. Each set of race results form a table. A third table contains a list of runners, while a fourth contains the list of races. These four tables define a relational database:

load('data/run_race_database.rda')

race1

race2

runners

races
##   race.id runner.id race.time
## 1       1         1       100
## 2       1         2       120
## 3       1         3        90
## 4       1         4       140
##   race.id runner.id race.time
## 1       2         2       115
## 2       2         4       145
## 3       2         5       105
## 4       2         6        95
##   runner.id    name gender   age
## 1         1   Peter      M 35-39
## 2         2   Jenny      F 30-34
## 3         3 Heather      F 25-29
## 4         4    John      M 30-34
## 5         5     Max      M 35-39
## 6         6   Julie      F 40-44
##   race.id              race.name
## 1       1 La Jolla Half Marathon
## 2       2 Silver Strand Marathon

There are multiple relationships across tables in this database. For example, we can link the same runners across races or we can link runner characteristics to race results or race name to results. All of these operations require us to join tables which is the standard operation done on relational databases. Once the join operation is done, we can continue with other summary operations, visualizations or other more advanced analytics tools.

There are different join operations available depending on what your objective is. First, suppose you are interested in comparing the same runner’s performance across different races. In this case we need to extract results for runners who show up in both the race1 and race2 table. You can think of this as the intersection between the two tables. This is done by using the inner_join function:

library(tidyverse)
runners.race12 <- inner_join(race1,race2,by='runner.id')
runners.race12
##   race.id.x runner.id race.time.x race.id.y race.time.y
## 1         1         2         120         2         115
## 2         1         4         140         2         145

Notice that R renamed the race specific variables by adding an “x” and “y”. This is because the race.id and race.time variable has the same name in race1 and race2 and we obviously need to distinguish between them. The “by” option informs R on which variable the join operation is to be performed.

Suppose instead that you were interested in a “master” race file that had information on all races for all runners. You can think of this as the union of the race1 and race2 table. You can do this operation by using full_join:

runners.race12.all <- full_join(race1,race2,by='runner.id')
runners.race12.all
##   race.id.x runner.id race.time.x race.id.y race.time.y
## 1         1         1         100        NA          NA
## 2         1         2         120         2         115
## 3         1         3          90        NA          NA
## 4         1         4         140         2         145
## 5        NA         5          NA         2         105
## 6        NA         6          NA         2          95

Notice that we now have a bunch of NAs in the merged result. That is because the result - being the union of the two race tables - has one row with all race results for each runner and several runners only ran in one race.

Suppose instead you wanted to extract the race data for all races that runners in race1 has ever run. This can be accomplished by a left_join operation:

runners.race1.all <- left_join(race1,race2,by='runner.id')
runners.race1.all
##   race.id.x runner.id race.time.x race.id.y race.time.y
## 1         1         1         100        NA          NA
## 2         1         2         120         2         115
## 3         1         3          90        NA          NA
## 4         1         4         140         2         145

Suppose now you wanted to compare race results by age group for race1. How should we do that? Well, now we need to join the race1 and runners table:

race1.groups <- inner_join(race1,runners,by='runner.id')
race1.groups
##   race.id runner.id race.time    name gender   age
## 1       1         1       100   Peter      M 35-39
## 2       1         2       120   Jenny      F 30-34
## 3       1         3        90 Heather      F 25-29
## 4       1         4       140    John      M 30-34

Suppose instead you wanted results for runners participating in both races and the associated race names. We can do this as

runners.race12 <- inner_join(
  inner_join(race1,races,by='race.id'),
  inner_join(race2,races,by='race.id'),
  by='runner.id'
)

runners.race12
##   race.id.x runner.id race.time.x            race.name.x race.id.y
## 1         1         2         120 La Jolla Half Marathon         2
## 2         1         4         140 La Jolla Half Marathon         2
##   race.time.y            race.name.y
## 1         115 Silver Strand Marathon
## 2         145 Silver Strand Marathon

Finally, you may sometimes be interested in extracting the records that do not overlap between two tables rather than the ones that do. For example, suppose you were interested in the runners who ran in race1 but not in race2. You can think of this as the complement of the runners.race12 table when compared to the original race1 table. This can be generated by using an anti_join operation:

race1.only <- anti_join(race1,race2,by='runner.id')
race1.only
##   race.id runner.id race.time
## 1       1         1       100
## 2       1         3        90

Before considering any join operation, first think about what you are trying to accomplish - what is the end result you are looking for? Then find the appropriate join operation consistent with you goal. When you are working with large databases, the wrong type of join method can be disastrous. For example, wrongly using a full_join when you should be using an inner_join or left_join can lead to enormous data frames being generated which will often crash you computer.

The Structure of a Transactional Database

The classic example of a relational database is a transactional database. This database will contain one table with the actual transactions, i.e., one row for each transaction, where a transaction is an interaction between a certain customer in a certain store buying a certain product at a certain date. An “order” is then a series of transactions made by the same customer on a certain date. The product file will contain one row for each product available with a set of variables capturing various product characteristics. The customer file will contain one row for each customer (with associated customer information), while the store file will contain one row per store (with store information).

We can illustrate the structure of the relations between the files as:

Transaction.File Product.File Customer.File Store.File
Order ID X
Order Date X
Product ID X X
Customer ID X X
Store ID X X
Volume X
Total Price X
Product Info X
Customer Info X
Store Info X

The specific joins to perform in this appliation will depend on your objectives. For example, if you wanted to investigate how sales of a certain product group varied with the regional location of stores you would need to join both the product file (to get information about which product IDs belong to the group of interest) and the store file (to get the regional location of each store) with the transaction file. On the other hand, if you wanted to know the characteristics of your top customers, you would need to join the transaction file with the customer file.

Example: DonorsChoose

WWW

DonorsChoose is a non-profit organization that raises money for school projects. Potential donors can go to the website and pick among active projects and donate to any project of their choosing. When a project is funded the organization channels the funds to the school in question.

The data for this case study is not included in the Rstudio project file you downloaded above. There are two data files (or tables) for in this database - donations and projects. You can download the files in R format for 2015 and 2016 (here for donations and here for projects). These are large files: 155mb for the donation file and 34mb for the projects file. If you wish to have a go at the full files, you can get the full donation file (552 mb) here and the full projects file (115mb) here. You should download the files and put them in the data folder for the current Rstudio project. In the following we will use the full database.

The full database contains a total of 1,108,217 projects dating from 2002 to 2016 to which a total of 5,820,478 donations have been made.

Let’s load the data:

donations <- readRDS('data/donations.rds')
projects <- readRDS('data/projects.rds')

There is a lot of information about both donors and projects (you can see a sample of the data if you write glimpse(projects)) in R. Let us first look at where the projects are located: How many projects are there for each state? We can calculate that by using the table command:

table(projects$school_state)
## 
##     AK     AL     AR     AZ     CA     CO     CT     DC     DE     FL 
##   2819  11860  10636  20449 195055  11967  13532   9180   3004  54826 
##     GA     HI     IA     ID     IL     IN     KS     KY     La     LA 
##  32121   5472   5860   4527  58909  28412   5190   8602      3  19672 
##     MA     MD     ME     MI     MN     MO     MS     MT     NC     ND 
##  18925  15867   5544  24069   8618  21478  11599   1820  63154    974 
##     NE     NH     NJ     NM     NV     NY     OH     OK     OR     PA 
##   2873   2824  18602   4749  13745 103999  16221  25144  12398  26277 
##     RI     SC     SD     TN     TX     UT     VA     VT     WA     WI 
##   3356  33700   2006  22225  70754  16061  19316   1022  21606  12024 
##     WV     WY 
##   4471    700

California, New York and Texas has the most number of projects. This makes sense - these are also the three most populous states in the US. You can easily convert the counts into proportions:

prop.table(table(projects$school_state))
## 
##           AK           AL           AR           AZ           CA 
## 2.543726e-03 1.070188e-02 9.597398e-03 1.845216e-02 1.760079e-01 
##           CO           CT           DC           DE           FL 
## 1.079843e-02 1.221060e-02 8.283576e-03 2.710660e-03 4.947226e-02 
##           GA           HI           IA           ID           IL 
## 2.898440e-02 4.937661e-03 5.287773e-03 4.084940e-03 5.315656e-02 
##           IN           KS           KY           La           LA 
## 2.563758e-02 4.683198e-03 7.762018e-03 2.707051e-06 1.775104e-02 
##           MA           MD           ME           MI           MN 
## 1.707698e-02 1.431759e-02 5.002630e-03 2.171867e-02 7.776455e-03 
##           MO           MS           MT           NC           ND 
## 1.938068e-02 1.046636e-02 1.642278e-03 5.698703e-02 8.788892e-04 
##           NE           NH           NJ           NM           NV 
## 2.592453e-03 2.548237e-03 1.678552e-02 4.285262e-03 1.240281e-02 
##           NY           OH           OK           OR           PA 
## 9.384353e-02 1.463703e-02 2.268870e-02 1.118734e-02 2.371106e-02 
##           RI           SC           SD           TN           TX 
## 3.028288e-03 3.040921e-02 1.810115e-03 2.005474e-02 6.384490e-02 
##           UT           VA           VT           WA           WI 
## 1.449265e-02 1.742980e-02 9.222021e-04 1.949618e-02 1.084986e-02 
##           WV           WY 
## 4.034408e-03 6.316452e-04

Consider now the following question: How local is giving? In other words, do donors mainly care about their own state or do the give to projects in many states? We will later return to this question in more detail, but for now let’s focus on donors in New York state. Where do these donors give? We can start by only considering the donation data for donors who reside in New York:

ny.donations <- donations %>%
  filter(donor_state=='NY')

This leaves us with a donation file with 710100 donations. Now we need to calculate to where these donations were made. To do this we need information on the location on each project. This is in the projects file. Therefore we need to join the ny.donations file with the projects file:

ny.donations.projects <- ny.donations %>%
  inner_join(projects,by='_projectid')

This file will have the same number of rows as the ny.donations file, but will now have add columns with information on the project that each of the donations were made to - including the state id of the school for each project. We can now simply count the project locations:

table(ny.donations.projects$school_state)
## 
##     AK     AL     AR     AZ     CA     CO     CT     DC     DE     FL 
##   1335   5097   3523   8771  98895   6984   8736   5519   1675  29020 
##     GA     HI     IA     ID     IL     IN     KS     KY     La     LA 
##  16434   2128   1872   2246  29174  12261   1791   3599      1   8747 
##     MA     MD     ME     MI     MN     MO     MS     MT     NC     ND 
##  13115   9839   3181  13929   4393   9691   3914   1069  27534    304 
##     NE     NH     NJ     NM     NV     NY     OH     OK     OR     PA 
##    941   1831  10736   2346   5418 225965   8393   8358   5831  17490 
##     RI     SC     SD     TN     TX     UT     VA     VT     WA     WI 
##   2584  12514    882  11003  28324   5394  10165    717   8333   5791 
##     WV     WY 
##   1746    561
prop.table(table(ny.donations.projects$school_state))
## 
##           AK           AL           AR           AZ           CA 
## 1.880017e-03 7.177862e-03 4.961273e-03 1.235178e-02 1.392691e-01 
##           CO           CT           DC           DE           FL 
## 9.835234e-03 1.230249e-02 7.772145e-03 2.358823e-03 4.086748e-02 
##           GA           HI           IA           ID           IL 
## 2.314322e-02 2.996761e-03 2.636248e-03 3.162935e-03 4.108435e-02 
##           IN           KS           KY           La           LA 
## 1.726658e-02 2.522180e-03 5.068300e-03 1.408252e-06 1.231798e-02 
##           MA           MD           ME           MI           MN 
## 1.846923e-02 1.385579e-02 4.479651e-03 1.961555e-02 6.186453e-03 
##           MO           MS           MT           NC           ND 
## 1.364737e-02 5.511900e-03 1.505422e-03 3.877482e-02 4.281087e-04 
##           NE           NH           NJ           NM           NV 
## 1.325165e-03 2.578510e-03 1.511900e-02 3.303760e-03 7.629911e-03 
##           NY           OH           OK           OR           PA 
## 3.182157e-01 1.181946e-02 1.177017e-02 8.211520e-03 2.463033e-02 
##           RI           SC           SD           TN           TX 
## 3.638924e-03 1.762287e-02 1.242079e-03 1.549500e-02 3.988734e-02 
##           UT           VA           VT           WA           WI 
## 7.596113e-03 1.431489e-02 1.009717e-03 1.173497e-02 8.155189e-03 
##           WV           WY 
## 2.458809e-03 7.900296e-04

So 32 percent of the donations originating in New York are made to school projects in New York. What would you say this says about how local giving is?

Copyright © 2016 Karsten T. Hansen. All rights reserved.