Text data is messy and to make sense of it you often have to clean it a bit first. For example, do you want “Tuesday” and “Tuesdays” to count as separate words or the same word? Most of the time we would want to count this as the same word. Similarly with “run” and “running”. Furthermore, we often are not interested in including punctuation in the analysis - we just want to treat the text as a “bag of words”. There are libraries in R that will help you do this.

In the following let’s look at Yelp reviews of Las Vegas hotels:

load('vegas_hotels.rda')

This data contains customer reviews of 18 hotels in Las Vegas. We can use the ggmap library to plot the hotel locations:

library(ggmap)
ggmap(get_map("The Strip, Las Vegas, Nevada",zoom=15,color = "bw")) +   
    geom_text(data=business,
              aes(x=longitude,y=latitude,label=name),
              size=3,color='red')  

In addition to text, each review also contain a star rating from 1 to 5. Let’s see how the 18 hotels fare on average in terms of star ratings:

library(dplyr)

reviews %>%
  left_join(select(business,business_id,name),
             by='business_id') %>%
  group_by(name) %>%
  summarize(n = n(),
            mean.star = mean(as.numeric(stars))) %>%
  arrange(desc(mean.star)) %>%
  ggplot() + 
  geom_point(aes(x=reorder(name,mean.star),y=mean.star,size=n))+
  coord_flip() +
  ylab('Mean Star Rating (1-5)') + 
  xlab('Hotel')

So The Venetian, Bellagio and The Cosmopolitan are clearly the highest rated hotels, while Luxor and LVH are the lowest rated. Ok, but what is behind these ratings? What are customers actually saying about these hotels? This is what we can hope to find through a text analysis.

Constructing a Document Term Matrix

The foundation of a text analysis is a document term matrix. This is an array where each row corresponds to a document and each column corresponds to a word. The entries of the array are simply counts of how many times a certain word occurs in a certain document. Let’s look at a simple example:

example <- data.frame(document=c(1:4),
                the.text=c("I have a brown dog. My dog loves walks.",
                       "My dog likes food.",
                       "I like food.",
                       "Some dogs are black."))

This data contains four documents. The first document contains 8 unique words (or “terms”). The word “brown” occurs once while “dog” occurs twice.

library(tm)

text.c <- VCorpus(DataframeSource(select(example,the.text)))
DTM <- DocumentTermMatrix(text.c,
                          control=list(removePunctuation=TRUE,
                                       wordLengths=c(1, Inf)))
inspect(DTM)
## <<DocumentTermMatrix (documents: 4, terms: 15)>>
## Non-/sparse entries: 19/41
## Sparsity           : 68%
## Maximal term length: 5
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs a are black brown dog dogs food have i like likes loves my some walks
##    1 1   0     0     1   2    0    0    1 1    0     0     1  1    0     1
##    2 0   0     0     0   1    0    1    0 0    0     1     0  1    0     0
##    3 0   0     0     0   0    0    1    0 1    1     0     0  0    0     0
##    4 0   1     1     0   0    1    0    0 0    0     0     0  0    1     0

The first statement creates a corpus from the data frame example using the the.text variable. Then we create a document term matrix using this corpus. The control statement tells R to keep terms of any length (without this short words will be dropped) and to remove punctuation before creating the DTM.

In addition to removing punctuation (and lower-casing terms which is done by default) there are two other standard “cleaning” operations which are usually done. The first is to removed stopwords from the corpus. Stopwords are common words in a language that (usually) doesn’t carry any important significance in the analysis. For example, we could replace the first document in the example with “brown dog loves walks” and still be able to infer that the person writing this document talks about a brown dog. Of course, some information is lost in this process but for many applications this is not really an issue. The second standard operation is stemming. This creates root words, e.g., turns dogs into dog and loves into love:

text.c <- VCorpus(DataframeSource(select(example,the.text)))
DTM <- DocumentTermMatrix(text.c,
                          control=list(removePunctuation=TRUE,
                                       wordLengths=c(1, Inf),
                                       stopwords=TRUE,
                                       stemming=TRUE
                                       ))
inspect(DTM)
## <<DocumentTermMatrix (documents: 4, terms: 7)>>
## Non-/sparse entries: 11/17
## Sparsity           : 61%
## Maximal term length: 5
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs black brown dog food like love walk
##    1     0     1   2    0    0    1    1
##    2     0     0   1    1    1    0    0
##    3     0     0   0    1    1    0    0
##    4     1     0   1    0    0    0    0

Now let’s return to the hotel reviews. Let’s try to summarize the reviews for the Aria hotel:

aria.id <-  filter(business, 
        name=='Aria Hotel & Casino')$business_id
aria.reviews <- filter(reviews, 
         business_id==aria.id)

Next, we construct the DTM by using the operations described above (and we also remove numbers from the reviews). We inspect the first 10 documents and 10 terms:

text.c <- VCorpus(DataframeSource(select(aria.reviews,text)))

DTM.aria <- DocumentTermMatrix(text.c,
                          control=list(removePunctuation=TRUE,
                                       wordLengths=c(3, Inf),
                                       stopwords=TRUE,
                                       stemming=TRUE,
                                       removeNumbers=TRUE
                                       ))
inspect(DTM.aria[1:10, 1:10])
## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 0/100
## Sparsity           : 100%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs aaa aaah aaauugghh aaay aah aback abandon abbrevi abc aberr
##   1    0    0         0    0   0     0       0       0   0     0
##   2    0    0         0    0   0     0       0       0   0     0
##   3    0    0         0    0   0     0       0       0   0     0
##   4    0    0         0    0   0     0       0       0   0     0
##   5    0    0         0    0   0     0       0       0   0     0
##   6    0    0         0    0   0     0       0       0   0     0
##   7    0    0         0    0   0     0       0       0   0     0
##   8    0    0         0    0   0     0       0       0   0     0
##   9    0    0         0    0   0     0       0       0   0     0
##   10   0    0         0    0   0     0       0       0   0     0

Ok - here we can already see a problem: Users are writing all kinds of weird stuff in their reviews! You know - stuff like “aaauugghh”. Terms like these are likely to occur in only a very few documents, which means that we should probably just get rid of them. To see how many unique terms actually are in the DTM we can just print it:

print(DTM.aria)
## <<DocumentTermMatrix (documents: 2011, terms: 10197)>>
## Non-/sparse entries: 158219/20347948
## Sparsity           : 99%
## Maximal term length: 117
## Weighting          : term frequency (tf)

There are a total of 10,197 terms. That’s a lot and many of them are meaningless and sparse, i.e., they only occur in a few documents. The following command will remove terms that doesn’t occur in 99.5% of documents

DTM.aria.sp <- removeSparseTerms(DTM.aria,0.995)
inspect(DTM.aria.sp[1:10, 1:10])
## <<DocumentTermMatrix (documents: 10, terms: 10)>>
## Non-/sparse entries: 2/98
## Sparsity           : 98%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs abl absolut accept access accident accommod account acknowledg across
##   1    0       0      0      0        0        0       0          0      0
##   2    0       0      0      0        0        0       0          0      0
##   3    0       0      0      0        0        0       0          0      0
##   4    0       0      0      0        0        0       0          0      0
##   5    0       0      0      0        0        0       0          0      0
##   6    0       0      0      0        0        0       0          0      0
##   7    0       0      0      0        0        0       0          0      0
##   8    0       0      0      0        0        1       0          0      0
##   9    0       0      0      0        0        0       0          0      0
##   10   0       0      0      0        0        0       0          0      0
##     Terms
## Docs act
##   1    0
##   2    0
##   3    0
##   4    0
##   5    1
##   6    0
##   7    0
##   8    0
##   9    0
##   10   0

This looks much better. We now have a lot fewer terms in the DTM:

print(DTM.aria.sp)
## <<DocumentTermMatrix (documents: 2011, terms: 1788)>>
## Non-/sparse entries: 140761/3454907
## Sparsity           : 96%
## Maximal term length: 13
## Weighting          : term frequency (tf)

So by removing all the sparse terms we want from 10,197 to 1,788 terms.

Summarizing a Document Term Matrix

Now that we have creating our document term matrix, we can start summarizing the corpus. What are the most frequent terms? Easy:

term.count <- as.data.frame(as.table(DTM.aria.sp)) %>%
  group_by(Terms) %>%
  summarize(n=sum(Freq))

term.count %>% 
  filter(cume_dist(n) > 0.99) %>%
  ggplot(aes(x=reorder(Terms,n),y=n)) + geom_bar(stat='identity') + 
  coord_flip() + xlab('Counts') + ylab('')

A popular visualization tools of term counts is a word cloud. This is simply an alternative visual representation of the counts. We can easily make one:

library(wordcloud)
popular.terms <- filter(term.count,n > 200)
wordcloud(popular.terms$Terms,popular.terms$n,colors=brewer.pal(8,"Dark2"))

Ok, now let’s look at associations between terms. When people talk about “room”, what other terms are used?

room <- data.frame(findAssocs(DTM.aria.sp, "room", 0.35)) # find terms correlated with "room" 

room %>%
  add_rownames() %>%
  ggplot(aes(x=reorder(rowname,room),y=room)) + geom_point(size=4) + 
  coord_flip() + ylab('Correlation') + xlab('Term') + 
  ggtitle('Terms correlated with Room')

This seems to be terms related to the check-in process. This is something that can be explored in more detail using topic models.

How about “bathroom”?

bathroom <- data.frame(findAssocs(DTM.aria.sp, "bathroom", 0.2))

bathroom %>%
  add_rownames() %>%
  ggplot(aes(x=reorder(rowname,bathroom),y=bathroom)) + geom_point(size=4) + 
  coord_flip() + ylab('Correlation') + xlab('Term') + 
  ggtitle('Terms correlated with Bathroom')

When talking about the bathroom users are most likely to mention “tub”“,”shower“,”light“,”toilet" and “room”.

Exercise

In the analysis above we pooled all users into the same document term matrix. This may be quite misleading for a heterogenous set of users. For example, we might suspect that satisfied and non-satisified users talk about different things - or talk differently about the same things.