class: center, middle, inverse, title-slide # Text analysis ### Becky Tang ### 07.08.2021 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Packages In addition to `tidyverse` we will be using a few other packages today. ```r library(tidyverse) library(tidytext) library(wordcloud) ``` --- ## Tidy Data .question[ What makes a data frame tidy? ] -- 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. --- ## Tidytext - Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. - Learn more at https://www.tidytextmining.com/. --- ## What is tidy text? - Thus far in class you have been employing tidy data principles, and we will continue to do so here! - Specifically, we will work in the 'tidy text format': a table with one token per row ```r text <- c("Thus far in class you have been employing tidy data principles, and we will continue to do so here!", "Specifically, we will work in the 'tidy text format': a table with one token per row") text ``` ``` ## [1] "Thus far in class you have been employing tidy data principles, and we will continue to do so here!" ## [2] "Specifically, we will work in the 'tidy text format': a table with one token per row" ``` - How can we make this text into a format we can work with? --- ## What is tidy text? ```r text_df <- tibble(line = 1:2, text = text) text_df ``` ``` ## # A tibble: 2 x 2 ## line text ## <int> <chr> ## 1 1 Thus far in class you have been employing tidy data principles, and we … ## 2 2 Specifically, we will work in the 'tidy text format': a table with one … ``` --- ## Tokens - A token is a meaningful unit of text. For us, that will be a single word - Tokenizing is simply splitting a body of text into tokens, which can achieved using the function `unnest_tokens()` ```r tidy_text <- text_df %>% unnest_tokens(word, text) tidy_text ``` ``` ## # A tibble: 35 x 2 ## line word ## <int> <chr> ## 1 1 thus ## 2 1 far ## 3 1 in ## 4 1 class ## 5 1 you ## 6 1 have ## 7 1 been ## 8 1 employing ## 9 1 tidy ## 10 1 data ## # … with 25 more rows ``` --- ## Tokenizing - With the `unnest_tokens()` function, we can easily format any body of text into a user-friendly data frame - First argument: name of column/variable that text is being unnested into - Second argument: name of column/variable that is to be unnested - Other details: - All other columns kept - Punctuation removed - Defaults is to convert tokens to lowercase ``` ## # A tibble: 3 x 2 ## line word ## <int> <chr> ## 1 1 thus ## 2 1 far ## 3 1 in ``` --- ## Democratic candidate tweets - Time to work with some fun data! - Tweets from two Democratic candidates during the 2019 campaign: Joe Biden and Elizabeth Warren ``` ## screen_name ## 1 JoeBiden ## 2 JoeBiden ## 3 JoeBiden ## text ## 1 .@DrBiden and I are sending our best wishes to @BernieSanders, Jane, and the whole Sanders family. Anyone who knows Bernie understands what a force he is. We are confident that he will have a full and speedy recovery and look forward to seeing him on the trail soon. ## 2 President Trump is morally unfit to lead our country. https://t.co/r3SKYqK26K ## 3 President Trump asked a foreign government to interfere in our elections, and now he's spending $10 million on attack ads against me. It's clear that he is trying to influence the primary and pick his opponent.\n\nWhy? Because he knows I’ll beat him like a drum. ``` --- ## Tidy tweets ```r remove_reg <- "&|<|>" tidy_tweets <- tweets %>% mutate(text = str_remove_all(text, remove_reg))%>% unnest_tokens(word, text, token = "tweets") tidy_tweets %>% slice(1:5) ``` ``` ## timestamp screen_name word ## 1 2019-10-02 15:28:06 JoeBiden drbiden ## 1.1 2019-10-02 15:28:06 JoeBiden and ## 1.2 2019-10-02 15:28:06 JoeBiden i ## 1.3 2019-10-02 15:28:06 JoeBiden are ## 1.4 2019-10-02 15:28:06 JoeBiden sending ``` --- ## Word counts What are the most commonly tweeted words by these candidates? ```r counts <- tidy_tweets %>% count(word, sort = T) counts %>% slice(1:10) ``` ``` ## word n ## 1 to 3212 ## 2 the 3183 ## 3 and 2276 ## 4 of 1402 ## 5 a 1378 ## 6 in 1309 ## 7 for 1255 ## 8 we 1202 ## 9 our 1073 ## 10 is 822 ``` --- ## Stop words - In computing, stop words are words which are filtered out before or after processing of natural language data (text). - They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools. ```r data("stop_words") head(stop_words) ``` ``` ## # A tibble: 6 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ``` --- ## Stop words Let's remove the stop words and find the most common words again: ```r tidy_tweets <- tidy_tweets %>% anti_join(stop_words) tidy_tweets %>% count(word, sort = T) %>% slice(1:10) ``` ``` ## word n ## 1 president 381 ## 2 im 334 ## 3 fight 267 ## 4 people 261 ## 5 trump 258 ## 6 plan 231 ## 7 country 218 ## 8 time 211 ## 9 change 205 ## 10 care 197 ``` --- ## Word frequency While it's nice to get raw counts, we may prefer to know which words are used more often *relative* to other words. For this we can look towards word frequencies: ```r frequency_all <- tidy_tweets %>% count(word, sort = T) %>% mutate(freq = n / sum(n)) ``` <img src="21-text-analysis_files/figure-html/freq-plot-1.png" style="display: block; margin: auto;" /> --- ## Frequency by candidate ``` ## # A tibble: 8 x 5 ## # Groups: screen_name [2] ## screen_name word n total freq ## <fct> <chr> <int> <int> <dbl> ## 1 ewarren im 224 18125 0.0124 ## 2 ewarren fight 221 18125 0.0122 ## 3 ewarren people 150 18125 0.00828 ## 4 ewarren plan 146 18125 0.00806 ## 5 JoeBiden president 288 19255 0.0150 ## 6 JoeBiden trump 179 19255 0.00930 ## 7 JoeBiden country 150 19255 0.00779 ## 8 JoeBiden care 119 19255 0.00618 ``` --- class: middle, center ## Sentiment analysis --- ## Sentiment analysis - One way to analyze the sentiment of a text is to consider the text as a combination of its individual words - The sentiment content of the whole text is the sum of the sentiment content of the individual words - The sentiment attached to each word is given by a *lexicon*, which may be downloaded from external sources --- ## Sentiment lexicons - AFINN assigns each word an integer score between -5 and 5. Negative scores indicate negative sentiment, and positive scores indicate the opposite - The bing lexicon categorizes words into one of two categories: 'positive' or 'negative' .pull-left[ ```r get_sentiments("afinn") %>% slice(1:6) ``` ``` ## # A tibble: 6 x 2 ## word value ## <chr> <dbl> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ``` ] -- .pull-right[ ```r get_sentiments("bing") %>% slice(1:6) ``` ``` ## # A tibble: 6 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faces negative ## 2 abnormal negative ## 3 abolish negative ## 4 abominable negative ## 5 abominably negative ## 6 abominate negative ``` ] --- ## Notes about sentiment lexicons Not every word is in a lexicon! ```r get_sentiments("bing") %>% filter(word == "data") ``` ``` ## # A tibble: 0 x 2 ## # … with 2 variables: word <chr>, sentiment <chr> ``` -- Lexicons do not account for qualifiers before a word (e.g., "not happy") because they were constructed for one-word tokens only -- - Summing up each word's sentiment may result in a neutral sentiment, even if there are strong positive and negative sentiments in the body --- ## Sentiment of Elizabeth Warren's tweets: bing lexicon We will use the bing lexicon to estimate the sentiment of Elizabeth Warren's tweets: ```r (warren_sent <- tidy_tweets %>% filter(screen_name == "ewarren") %>% inner_join(get_sentiments("bing"), by = "word") %>% count(sentiment)) ``` ``` ## sentiment n ## 1 negative 1120 ## 2 positive 1130 ``` ```r warren_sent%>% pull(n) %>% diff() ``` ``` ## [1] 10 ``` It appears that Warren is pretty neutral...or is she? --- ## Sentiment of Elizabeth Warren's tweets by month What if we look at Warren's tweets by month? ``` ## month negative positive sentiment ## 1 August 386 391 5 ## 2 July 322 312 -10 ## 3 October 27 27 0 ## 4 September 385 400 15 ``` --- ## Most common positive/negative words ```r bing_counts <- tidy_tweets %>% inner_join(get_sentiments("bing"), by = "word") %>% count(word, sentiment, sort = T) bing_counts %>% slice(1:4) ``` ``` ## word sentiment n ## 1 trump positive 258 ## 2 protect positive 96 ## 3 support positive 92 ## 4 grateful positive 80 ``` --- ## Most common positive/negative words <img src="21-text-analysis_files/figure-html/pos_neg_plot-1.png" style="display: block; margin: auto;" /> --- ## Customize stop words - For this analysis, we should consider removing the word 'trump' as a word with positive connotation - To do so, we can make a custom list of stop-words: ```r my_stop_words <- tibble(word = c("trump"), lexicon = c("custom")) custom_stop_words <- bind_rows(my_stop_words, stop_words) custom_stop_words ``` ``` ## # A tibble: 1,150 x 2 ## word lexicon ## <chr> <chr> ## 1 trump custom ## 2 a SMART ## 3 a's SMART ## 4 able SMART ## 5 about SMART ## 6 above SMART ## 7 according SMART ## 8 accordingly SMART ## 9 across SMART ## 10 actually SMART ## # … with 1,140 more rows ``` --- ## Most common positive/negative words <img src="21-text-analysis_files/figure-html/pos_neg_custom-1.png" style="display: block; margin: auto;" /> --- ## Wordcloud We can visualize the frequencies using a word cloud: ```r tidy_tweets %>% count(word) %>% with(wordcloud(word, n, max.words = 40,scale=c(3,.5))) ``` --- ## Wordcloud <img src="21-text-analysis_files/figure-html/wordcloud-1.png" style="display: block; margin: auto;" /> --- ## Sentiment Wordcloud <img src="21-text-analysis_files/figure-html/sent_cloud-1.png" style="display: block; margin: auto;" /> --- ## Word frequencies of Biden vs. Warren We can visualize the differences and similarities in word frequencies between two candidates: <img src="21-text-analysis_files/figure-html/word_freq-1.png" style="display: block; margin: auto;" /> - Words near the blue line are used with roughly equal frequencies between Joe Biden and Elizabeth Warren - Words far away from the blue line are favored much more by one candidate than the other --- ## Probability of using a word - We can also explore questions such as: given a word, which candidate is more likely to use that word in a tweet? - We will utilize the log-odds ratio. Odds ratio for Candidate A versus Candidate B `$$\text{OR}_{A:B}(word) = \dfrac{\text{odds}_A}{\text{odds}_B} = \dfrac{\text{Prob(A uses word)}}{\text{Prob(B uses word)}}$$` - The probability that Candidate A uses the word is the number of times A used the word, divided by the total number of words used by A: `$$\text{Prob(A uses word)} = \dfrac{n_A}{total_A}$$` --- ## Log-odds ratio $$ \log \text{OR}_{A:B}(word) = \log \left( \dfrac{n_A / total_A}{n_B / total_B}\right) $$ - Equal odds corresponds to OR = 1, which corresponds to log(OR) = 0 - If candidate A uses the word with higher probability, then log(OR) > 0 - We use the following approximation in case a candidate does not use the word at all: $$ \log \text{OR}_{A:B}(word) \approx \log \left( \dfrac{(n+1)_A / (total+1)_A}{(n+1)_B / (total+1)_B}\right)$$ --- ## Word usage: equally likely Log-odds ratios for Biden versus Warren displayed in ascending order of absolute value of log-odds ratio ``` ## # A tibble: 721 x 2 ## word logratio ## <chr> <dbl> ## 1 funding -0.00461 ## 2 support -0.00647 ## 3 climate -0.00707 ## 4 actions 0.0137 ## 5 agree 0.0137 ## 6 attacks 0.0137 ## 7 commitment 0.0137 ## 8 supremacy 0.0137 ## 9 love -0.0147 ## 10 millions -0.0175 ## # … with 711 more rows ``` - Words about equally likely to be tweeted from the two candidates (log(OR) `\(\approx\)` 0) are non-buzzwords --- ## Word usage: most distinctive - The words that are most likely to be from Biden and not Warren will have the largest, most positive ratios - The words that are most likely to be from Warren and not Biden will have the smallest, most negative ratios - The plot on the next slides shows the 16 most positive and negative ratios --- ## Word usage: most distinctive cont. <img src="21-text-analysis_files/figure-html/word-usage-diff-1.png" style="display: block; margin: auto;" /> --- ## Additional resources [Text Mining with R](https://www.tidytextmining.com/) - Chapter 1: The tidy text format - Chapter 2: Sentiment analysis with tidy data --- ## Announcements - Exam 02 released tomorrow, July 9. Due Sunday, July 11 at 11:59pm. No late work will be accepted. - Project proposal is due tomorrow, July 9 at 11:59pm. Upload a PDF to Gradescope as well.