Text analysis

# Text analysis
### Becky Tang
### 07.08.2021

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Packages

In addition to `tidyverse` we will be using a few other packages today.

```r
library(tidyverse)
library(tidytext)
library(wordcloud)
```

---

## Tidy Data

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

---

## Tidytext

- Using tidy data principles can make many text mining tasks easier, more 
effective, and consistent with tools already in wide use.

- Learn more at https://www.tidytextmining.com/.

---

## What is tidy text?

- Thus far in class you have been employing tidy data principles, and we will continue to do so here!

- Specifically, we will work in the 'tidy text format': a table with one token per row

```r
text <- c("Thus far in class you have been employing tidy data principles, and we will continue to do so here!", 
          "Specifically, we will work in the 'tidy text format': a table with one token per row")
text
```

```
## [1] "Thus far in class you have been employing tidy data principles, and we will continue to do so here!"
## [2] "Specifically, we will work in the 'tidy text format': a table with one token per row"
```

- How can we make this text into a format we can work with?

---

## What is tidy text?

```r
text_df <- tibble(line = 1:2, text = text)
text_df
```

```
## # A tibble: 2 x 2
##    line text                                                                    
##   <int> <chr>                                                                   
## 1     1 Thus far in class you have been employing tidy data principles, and we …
## 2     2 Specifically, we will work in the 'tidy text format': a table with one …
```

---

## Tokens

- A token is a meaningful unit of text. For us, that will be a single word
- Tokenizing is simply splitting a body of text into tokens, which can achieved using the function `unnest_tokens()`

```r
tidy_text <- text_df %>%
  unnest_tokens(word, text)
tidy_text
```

```
## # A tibble: 35 x 2
##     line word     
##    <int> <chr>    
##  1     1 thus     
##  2     1 far      
##  3     1 in       
##  4     1 class    
##  5     1 you      
##  6     1 have     
##  7     1 been     
##  8     1 employing
##  9     1 tidy     
## 10     1 data     
## # … with 25 more rows
```

---

## Tokenizing

- With the `unnest_tokens()` function, we can easily format any body of text into a user-friendly data frame
- First argument: name of column/variable that text is being unnested into
- Second argument: name of column/variable that is to be unnested 
- Other details:
  - All other columns kept
  - Punctuation removed
  - Defaults is to convert tokens to lowercase

```
## # A tibble: 3 x 2
##    line word 
##   <int> <chr>
## 1     1 thus 
## 2     1 far  
## 3     1 in
```

---

## Democratic candidate tweets

- Time to work with some fun data!
- Tweets from two Democratic candidates during the 2019 campaign: Joe Biden and Elizabeth Warren

```
##   screen_name
## 1    JoeBiden
## 2    JoeBiden
## 3    JoeBiden
##                                                                                                                                                                                                                                                                         text
## 1 .@DrBiden and I are sending our best wishes to @BernieSanders, Jane, and the whole Sanders family. Anyone who knows Bernie understands what a force he is. We are confident that he will have a full and speedy recovery and look forward to seeing him on the trail soon.
## 2                                                                                                                                                                                              President Trump is morally unfit to lead our country. https://t.co/r3SKYqK26K
## 3     President Trump asked a foreign government to interfere in our elections, and now he's spending $10 million on attack ads against me. It's clear that he is trying to influence the primary and pick his opponent.\n\nWhy? Because he knows I’ll beat him like a drum.
```

---

## Tidy tweets

```r
remove_reg <- "&amp;|&lt;|&gt;"
tidy_tweets <- tweets %>%
  mutate(text = str_remove_all(text, remove_reg))%>%
  unnest_tokens(word, text, token = "tweets")
tidy_tweets %>% 
  slice(1:5)
```

```
##               timestamp screen_name    word
## 1   2019-10-02 15:28:06    JoeBiden drbiden
## 1.1 2019-10-02 15:28:06    JoeBiden     and
## 1.2 2019-10-02 15:28:06    JoeBiden       i
## 1.3 2019-10-02 15:28:06    JoeBiden     are
## 1.4 2019-10-02 15:28:06    JoeBiden sending
```

---

## Word counts

What are the most commonly tweeted words by these candidates?

```r
counts <- tidy_tweets %>%
  count(word, sort = T)
counts %>%
  slice(1:10)
```

```
##    word    n
## 1    to 3212
## 2   the 3183
## 3   and 2276
## 4    of 1402
## 5     a 1378
## 6    in 1309
## 7   for 1255
## 8    we 1202
## 9   our 1073
## 10   is  822
```

---

## Stop words

- In computing, stop words are words which are filtered out before or after 
processing of natural language data (text).

- They usually refer to the most common words in a language, but there is not a 
single list of stop words used by all natural language processing tools.

```r
data("stop_words")
head(stop_words)
```

```
## # A tibble: 6 x 2
##   word      lexicon
##   <chr>     <chr>  
## 1 a         SMART  
## 2 a's       SMART  
## 3 able      SMART  
## 4 about     SMART  
## 5 above     SMART  
## 6 according SMART
```

---
## Stop words

Let's remove the stop words and find the most common words again:

```r
tidy_tweets <- tidy_tweets %>%
  anti_join(stop_words)
tidy_tweets %>%
  count(word, sort = T) %>%
  slice(1:10)
```

```
##         word   n
## 1  president 381
## 2         im 334
## 3      fight 267
## 4     people 261
## 5      trump 258
## 6       plan 231
## 7    country 218
## 8       time 211
## 9     change 205
## 10      care 197
```

---

## Word frequency

While it's nice to get raw counts, we may prefer to know which words are used more often *relative* to other words. For this we can look towards word frequencies:

```r
frequency_all <-  tidy_tweets %>%
  count(word, sort = T) %>%
  mutate(freq = n / sum(n)) 
```

---

## Frequency by candidate

```
## # A tibble: 8 x 5
## # Groups:   screen_name [2]
##   screen_name word          n total    freq
##   <fct>       <chr>     <int> <int>   <dbl>
## 1 ewarren     im          224 18125 0.0124 
## 2 ewarren     fight       221 18125 0.0122 
## 3 ewarren     people      150 18125 0.00828
## 4 ewarren     plan        146 18125 0.00806
## 5 JoeBiden    president   288 19255 0.0150 
## 6 JoeBiden    trump       179 19255 0.00930
## 7 JoeBiden    country     150 19255 0.00779
## 8 JoeBiden    care        119 19255 0.00618
```

---

## Sentiment analysis

---

## Sentiment analysis

- One way to analyze the sentiment of a text is to consider the text as a 
combination of its individual words

- The sentiment content of the whole text is the sum of the sentiment content of
the individual words

- The sentiment attached to each word is given by a *lexicon*, which may be 
downloaded from external sources

---

## Sentiment lexicons

- AFINN assigns each word an integer score between -5 and 5. Negative scores indicate negative sentiment, and positive scores indicate the opposite

- The bing lexicon categorizes words into one of two categories: 'positive' or 'negative'

```r
get_sentiments("afinn") %>%
  slice(1:6)
```

```
## # A tibble: 6 x 2
##   word       value
##   <chr>      <dbl>
## 1 abandon       -2
## 2 abandoned     -2
## 3 abandons      -2
## 4 abducted      -2
## 5 abduction     -2
## 6 abductions    -2
```
]

```r
get_sentiments("bing") %>%
  slice(1:6)
```

```
## # A tibble: 6 x 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative
```
]

---

## Notes about sentiment lexicons

Not every word is in a lexicon!

```r
get_sentiments("bing") %>% 
  filter(word == "data")
```

```
## # A tibble: 0 x 2
## # … with 2 variables: word <chr>, sentiment <chr>
```

Lexicons do not account for qualifiers before a word (e.g., "not happy") because they were constructed for one-word tokens only

- Summing up each word's sentiment may result in a neutral sentiment, even if there are strong positive and negative sentiments in the body

---

## Sentiment of Elizabeth Warren's tweets: bing lexicon

We will use the bing lexicon to estimate the sentiment of Elizabeth Warren's tweets:

```r
(warren_sent <- tidy_tweets %>%
  filter(screen_name == "ewarren") %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(sentiment))
```

```
##   sentiment    n
## 1  negative 1120
## 2  positive 1130
```

```r
warren_sent%>%
  pull(n) %>%
  diff()
```

```
## [1] 10
```

It appears that Warren is pretty neutral...or is she?

---

## Sentiment of Elizabeth Warren's tweets by month

What if we look at Warren's tweets by month?

```
##       month negative positive sentiment
## 1    August      386      391         5
## 2      July      322      312       -10
## 3   October       27       27         0
## 4 September      385      400        15
```

---

## Most common positive/negative words

```r
bing_counts <- tidy_tweets %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  count(word, sentiment, sort = T)

bing_counts %>%
  slice(1:4)
```

```
##       word sentiment   n
## 1    trump  positive 258
## 2  protect  positive  96
## 3  support  positive  92
## 4 grateful  positive  80
```

---

## Most common positive/negative words

<img src="21-text-analysis_files/figure-html/pos_neg_plot-1.png" style="display: block; margin: auto;" />
---

## Customize stop words

- For this analysis, we should consider removing the word 'trump' as a word with positive connotation
- To do so, we can make a custom list of stop-words:

```r
my_stop_words <- tibble(word = c("trump"), lexicon = c("custom"))
custom_stop_words <- bind_rows(my_stop_words, stop_words)
custom_stop_words 
```

```
## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 trump       custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # … with 1,140 more rows
```

---

##  Most common positive/negative words

---

## Wordcloud

We can visualize the frequencies using a word cloud:

```r
tidy_tweets %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 40,scale=c(3,.5)))
```

---

## Wordcloud

---

## Sentiment Wordcloud

---

## Word frequencies of Biden vs. Warren

We can visualize the differences and similarities in word frequencies between two candidates:

- Words near the blue line are used with roughly equal frequencies between Joe Biden and Elizabeth Warren
- Words far away from the blue line are favored much more by one candidate than the other

---

## Probability of using a word

- We can also explore questions such as: given a word, which candidate is more likely to use that word in a tweet?

- We will utilize the log-odds ratio. Odds ratio for Candidate A versus Candidate B
`$$\text{OR}_{A:B}(word) = \dfrac{\text{odds}_A}{\text{odds}_B} = \dfrac{\text{Prob(A uses word)}}{\text{Prob(B uses word)}}$$`

- The probability that Candidate A uses the word is the number of times A used the word, divided by the total number of words used by A: 
`$$\text{Prob(A uses word)} = \dfrac{n_A}{total_A}$$`

---

## Log-odds ratio

$$ \log \text{OR}_{A:B}(word)  = \log \left( \dfrac{n_A / total_A}{n_B / total_B}\right) $$

- Equal odds corresponds to OR = 1, which corresponds to log(OR) = 0

- If candidate A uses the word with higher probability, then log(OR) > 0

- We use the following approximation in case a candidate does not use the word at all:

$$ \log \text{OR}_{A:B}(word) \approx \log \left( \dfrac{(n+1)_A / (total+1)_A}{(n+1)_B / (total+1)_B}\right)$$
---

## Word usage: equally likely

Log-odds ratios for Biden versus Warren displayed in ascending order of absolute value of log-odds ratio

```
## # A tibble: 721 x 2
##    word       logratio
##    <chr>         <dbl>
##  1 funding    -0.00461
##  2 support    -0.00647
##  3 climate    -0.00707
##  4 actions     0.0137 
##  5 agree       0.0137 
##  6 attacks     0.0137 
##  7 commitment  0.0137 
##  8 supremacy   0.0137 
##  9 love       -0.0147 
## 10 millions   -0.0175 
## # … with 711 more rows
```

- Words about equally likely to be tweeted from the two candidates (log(OR) `$\approx$` 0) are non-buzzwords

---

## Word usage: most distinctive

- The words that are most likely to be from Biden and not Warren will have the largest, most positive ratios
- The words that are most likely to be from Warren and not Biden will have the smallest, most negative ratios
- The plot on the next slides shows the 16 most positive and negative ratios

---

## Word usage: most distinctive cont.

---

## Additional resources

[Text Mining with R](https://www.tidytextmining.com/)
- Chapter 1: The tidy text format
- Chapter 2: Sentiment analysis with tidy data

---

## Announcements

- Exam 02 released tomorrow, July 9. Due Sunday, July 11 at 11:59pm. No late work will be accepted.

- Project proposal is due tomorrow, July 9 at 11:59pm. Upload a PDF to Gradescope as well.