There are times where you are faced with data of such a magnitude that reading the entire dataset into R - which requires that it fits in your computers memory - becomes problematic. One radical solution to this is: buy more memory! Ok - but what if you don’t want to do that? Is there a way where we can just leave the data in an external file and then pull the relevant sections of it whenever we need it? Yes. The standard here is a SQL database. Technically, “SQL” refers to the language used to make requests to the database but - using the power of R - we actually don’t need to know any SQL syntax to use a SQL database. There are different formats of SQL databases - in the following we will use a sqlite database as an example. This is becoming a very popular database format - in part because it ports easily to mobile platforms.
You can download all the code and data for this module as an Rstudio project here.
The file soccer.zip is a zip file containing a sqlite database on over 25,000 soccer matches for 11 European leagues in the period 2008-2016. This database has been assembled by Hugo Mathien - you can read more about the data here. If you want to try out the following code you need to first unzip the zip file (total size is around 311 MB).
You first need to install the RSQLite library (if you haven’t already). Then you need to tell R where the sqlite data is on your computer:
install.packages(RSQLite) #only do this once
library(tidyverse)
# connect to sqlite database
my_db <- src_sqlite("data/database.sqlite",create = F)
That’s pretty much it - R now has a direct line to the database and you can start making requests.
A SQL database will typically contain multiple “tables”. You can think of these as R data frames. What are the tables in the soccer database?
my_db
## src: sqlite 3.8.6 [../data/database.sqlite]
## tbls: Country, League, Match, Player, Player_Stats, sqlite_sequence, Team
Ok - we have tables for country, league, match, player and so on. Let’s look at the country and league tables:
country = tbl(my_db,"Country")
league = tbl(my_db,"League")
as.data.frame(country)
## id name
## 1 1 Belgium
## 2 1729 England
## 3 4735 France
## 4 7775 Germany
## 5 10223 Italy
## 6 13240 Netherlands
## 7 15688 Poland
## 8 17608 Portugal
## 9 19660 Scotland
## 10 21484 Spain
## 11 24524 Switzerland
as.data.frame(league)
## id country_id name
## 1 1 1 Belgium Jupiler League
## 2 1729 1729 England Premier League
## 3 4735 4735 France Ligue 1
## 4 7775 7775 Germany 1. Bundesliga
## 5 10223 10223 Italy Serie A
## 6 13240 13240 Netherlands Eredivisie
## 7 15688 15688 Poland Ekstraklasa
## 8 17608 17608 Portugal Liga ZON Sagres
## 9 19660 19660 Scotland Premier League
## 10 21484 21484 Spain LIGA BBVA
## 11 24524 24524 Switzerland Super League
So each country has its own league. Let’s do a quick data summary: calculate the average number of goals per league (i.e., country) and plot the averages in a chart:
# calculate average number of goals per country league
goals.match.country = tbl(my_db,"Match") %>%
group_by(country_id) %>%
summarise(avg.goal=mean(home_team_goal+away_team_goal)) %>%
left_join(country,by=c("country_id"="id")) %>%
arrange(desc(avg.goal))
# plot results
library(forcats)
as.data.frame(goals.match.country) %>%
ggplot(aes(x=fct_reorder(name,avg.goal),y=avg.goal)) + geom_point() +
ylab('Average Number of Goals per Match') + xlab('National League') + coord_flip()
Copyright © 2016 Karsten T. Hansen. All rights reserved.