You will on occasion have to change the organization of your data in order to accomplish a certain goal. The most common problem is the following: Your data looks like this:
Customer | V1 | V2 | V3 | V4 | V5 |
---|---|---|---|---|---|
ID1 | 5 | 5 | 4 | 4 | 8 |
ID2 | 9 | 4 | 3 | 1 | 9 |
ID3 | 1 | 8 | 7 | 2 | 6 |
…but you want it to look like this:
Customer | Variable | Value |
---|---|---|
ID1 | V1 | 5 |
ID1 | V2 | 4 |
ID1 | V3 | 8 |
. | . | . |
. | . | . |
The first version of the data is “short and wide”, while the second is “long and thin”. For example, when you import data from an Excel sheet it will often be organized with one row per customer/user and one column for each measurement. However, this is often not the ideal way to organize your data. Suppose, for example, that you are analyzing a customer satisfaction survey. Each row is a customer and you may have measured 20 to 30 different dimensions of satisfaction (e.g., ordering process, product, packaging, etc.). Now you want to summarize mean satisfaction for each dimension and different customer segments. The most elegant and efficient way of doing this is first to make the data long and thin and then do a group-by statement. Let’s look at some examples. You can download all code data for this lesson here.
load('data/states.rda')
states
## state Population Income Life.Exp Murder HS.Grad Frost
## 1 California 21198 5114 71.71 10.3 62.6 20
## 2 Illinois 11197 5107 70.14 10.3 52.6 127
## 3 Rhode Island 931 4558 71.90 2.4 46.4 127
This data is not tidy: We want each row to be a (state,measurement) combination. To make it tidy we use the appropriately named tidyr library:
library(tidyr)
library(dplyr)
states.t <- states %>%
gather(variable, value, Population:Frost)
This will take all the columns from Population to Frost and stack them into one column called “variable” and all the values into another column called “value”. The result is
states.t
## state variable value
## 1 California Population 21198.00
## 2 Illinois Population 11197.00
## 3 Rhode Island Population 931.00
## 4 California Income 5114.00
## 5 Illinois Income 5107.00
## 6 Rhode Island Income 4558.00
## 7 California Life.Exp 71.71
## 8 Illinois Life.Exp 70.14
## 9 Rhode Island Life.Exp 71.90
## 10 California Murder 10.30
## 11 Illinois Murder 10.30
## 12 Rhode Island Murder 2.40
## 13 California HS.Grad 62.60
## 14 Illinois HS.Grad 52.60
## 15 Rhode Island HS.Grad 46.40
## 16 California Frost 20.00
## 17 Illinois Frost 127.00
## 18 Rhode Island Frost 127.00
We can now easily summarize measures by state:
states.stats <- states.t %>%
group_by(variable) %>%
summarise(mean = mean(value))
states.stats
## Source: local data frame [6 x 2]
##
## variable mean
## (fctr) (dbl)
## 1 Population 11108.666667
## 2 Income 4926.333333
## 3 Life.Exp 71.250000
## 4 Murder 7.666667
## 5 HS.Grad 53.866667
## 6 Frost 91.333333
If you don’t need intermediate result later you could just use
states.stats <- states %>%
gather(variable, value, Population:Frost) %>%
group_by(variable) %>%
summarise(mean = mean(value))
This data contain measures of music attitudes for 48,645 users (This data is a subset of the EMI One-Million Interview dataset). For each user there is a small set of demographic information plus answers to music attitudinal questions. Attiudes are scored on a scale of 1 to 100 with higher numbers indicating stronger agreement with the corresponding attitude. First load the data:
load('data/music.rda')
There are two data frames in the file: users and questions. The users data frame contains the actual survey data, while questions contains labels for the survey questions. Let’s take a look at the first three rows of the survey data:
users[1:3,]
## RESPID GENDER AGE WORKING REGION
## 1 36927 Female 60 Other South
## 2 3566 Female 36 Full-time housewife / househusband South
## 3 20054 Female 52 Employed 30+ hours a week Midlands
## MUSIC
## 1 Music is important to me but not necessarily more important than other hobbies or interests
## 2 Music is important to me but not necessarily more important than other hobbies or interests
## 3 I like music but it does not feature heavily in my life
## LIST_OWN LIST_BACK Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
## 1 1 hour 49 50 49 50 32 33 32 0 74 50 50 71 52
## 2 1 hour 1 hour 55 55 62 9 9 9 10 11 55 12 65 65 80
## 3 1 hour Less than an hour 11 50 9 8 45 10 30 29 8 50 94 51 74
## Q14 Q15 Q16 Q17 Q18 Q19
## 1 71 9 7 72 49 26
## 2 79 51 31 68 54 33
## 3 66 27 46 73 8 31
This data is clearly short and wide: there is one row per respondent. We want to make it long and thin instead, with each row corresponding to one (user,measurement) combination. The 19 attitudinal questions are labelled Q1-Q19. You can see their labels in the questions data frame.
To reshape data we can use the tidyr package. The main command we will use is gather. Try out the following:
library(dplyr)
library(tidyr)
users.tidy <- users %>%
gather(attitude,value,Q1:Q19)
What did this do? If you check the dimensions of the new data frame users.tidy, you will see that it has 924,255 rows. This is exactly 19 times the number of rows of the users data frame. This happened because the gather command stacked all the 19 attitudes into one column (that we decided to call attitude).
It is now very easy to summarize attitudes by numerical summaries or visualizations. For example, what is the average attitude score for each of the 19 attitudes? Easy:
users.tidy %>%
group_by(attitude) %>%
summarize(mean=mean(value,na.rm=TRUE))
## Source: local data frame [19 x 2]
##
## attitude mean
## (fctr) (dbl)
## 1 Q1 49.11357
## 2 Q2 54.62442
## 3 Q3 51.28445
## 4 Q4 37.30913
## 5 Q5 34.58543
## 6 Q6 39.33361
## 7 Q7 33.84533
## 8 Q8 29.16174
## 9 Q9 47.83174
## 10 Q10 55.01103
## 11 Q11 58.63643
## 12 Q12 53.66590
## 13 Q13 46.96266
## 14 Q14 53.44644
## 15 Q15 39.66456
## 16 Q16 35.57926
## 17 Q17 53.82629
## 18 Q18 42.23245
## 19 Q19 41.36263
Do these averages differ for men and women? Let’s check by doing a visualization:
library(ggplot2)
users.tidy %>%
group_by(GENDER,attitude) %>%
summarize(mean=mean(value,na.rm=TRUE)) %>%
ggplot(aes(x=attitude,fill=GENDER,y=mean)) + geom_bar(stat='identity',position='dodge')
Let’s plot the average scores for each gender for the attitudes where the gender difference is the biggest, using the correct attitude labels:
users.tidy %>%
group_by(GENDER,attitude) %>%
summarize(mean=mean(value,na.rm=TRUE)) %>%
filter(attitude %in% c('Q1','Q11','Q14')) %>%
left_join(questions,by=c('attitude'='variable')) %>%
ggplot(aes(x=label,fill=GENDER,y=mean)) + geom_bar(stat='identity',position='dodge') + coord_flip()
The variable MUSIC capures how important music is to the respondent overall. Summarize the average attitude Q1-Q19 for each level of MUSIC. Do you see any differences? Do the differences make sense?