Web scraping

# Web scraping
### Becky Tang
### 07.15.2021

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

# Scraping the web

---

## Scraping the web: what? why?

- Increasing amount of data is available on the web
--

- These data are provided in an unstructured format: you can always copy&paste, 
but it's time-consuming and prone to errors

- Web scraping is the process of extracting this information automatically and transform it into a structured dataset

- Two different scenarios:
    - Screen scraping: extract data from source code of website, with html 
    parser (easy) or regular expression matching (less easy).
    - Web APIs (application programming interface): website offers a set of 
    structured http requests that return JSON or XML files.

---

# Web Scraping with rvest

---

## Hypertext Markup Language (HTML)

.midi[
- HTML describes the structure of a web page; your browser interprets the 
  structure and contents and displays the results.
  
- The basic building blocks include elements, tags, and attributes.
    - an element is a component of an HTML document
    - elements contain tags (start and end tag)
    - attributes provide additional information about HTML elements
]

---

## Hypertext Markup Language (HTML)

---

## Simple HTML document

```html
<html>
<head>
<title>Web Scraping</title>
</head>
<body>

<h1>Using rvest</h1>
<p>To get started...</p>

</body>
</html>
```

We can visualize this in a tree-like structure.

---

## HTML tree-like structure

---

### If we have access to an HTML document, then how can we easily extract information and get it in a format that's useful for anlaysis?

---

## rvest

.pull-left[
- The **rvest** package makes basic processing and manipulation of HTML data straight forward
- It's designed to work with pipelines built with `%>%`
]
.pull-right[
<img src="img/25/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" />
]

---

## Core rvest functions

- `read_html`   - Read HTML data from a url or character string
- `html_node `  - Select a specified node from HTML document
- `html_nodes`  - Select specified nodes from HTML document
- `html_table`  - Parse an HTML table into a data frame
- `html_text`   - Extract tag pairs' content
- `html_name`   - Extract tags' names
- `html_attrs`  - Extract all of each tag's attributes
- `html_attr`   - Extract tags' attribute value by name

---

## Example: simple_html

Let's suppose we have the following HTML document from the example website with the URL `simple_html`

```html 
<html>
<head>
<title>Web Scraping</title>
</head>
<body>
<h1>Using rvest</h1>
<p>To get started...</p>
</body>
</html>
```

---

## HTML in R

Read in the document with `read_html()`.

```r
page <- read_html(simple_html) #replace with URL in practice
```

What does this look like?

```r
page
```

```
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n<h1>Using rvest</h1>\n<p>To get started...</p>\n</body>
```

---

## Subset with `html_nodes()`

Let's extract the highlighted component below.

```r
<html>
<head>
<title>Web Scraping</title>
</head>
<body>
*<h1>Using rvest</h1>
<p>To get started...</p>
</body>
</html>
```
]
--

```r
h1_nodes <-page %>%
  html_nodes(css = "h1")
h1_nodes
```

```
## {xml_nodeset (1)}
## [1] <h1>Using rvest</h1>
```
]
---

## Extract contents and tag name

Let's extract "Using rvest" and `h1`.

```r
<html>
<head>
<title>Web Scraping</title>
</head>
<body>
*<h1>Using rvest</h1>
<p>To get started...</p>
</body>
</html>
```
]
--

```r
h1_nodes %>% 
  html_text()
```

```
## [1] "Using rvest"
```
]
]

```r
h1_nodes %>% 
  html_name()
```

```
## [1] "h1"
```
]
]
---

## Scaling up

Most HTML documents are not as simple as what we just examined. There may be tables, hundreds of links, paragraphs of text, and more. Naturally, we may wonder:

1. How do we handle larger HTML documents?

2. How do we know what to provide to `css` in function `html_nodes()` when
   we attempt to subset the HTML document?
   
3. Are these functions in `rvest` vectorized? For instance, are we able to get 
   all the content in the `td` tags on the slide that follows?

In Chrome, you can view the HTML document associated with a web page by going
to `View > Developer > View Source`.
---

## SelectorGadget

.pull-left[
- Open source tool that eases CSS selector generation and discovery
- Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) 
- Find out more on the [SelectorGadget vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html)
]
.pull-right[
<img src="img/25/selector-gadget.png" width="1901" style="display: block; margin: auto;" />
]

---

## Using the SelectorGadget

.pull-left[
- .midi[Click on the app logo next to the search bar. A box will open in the bottom right of the website.]
- .midi[Click on a page element (it will turn green), SelectorGadget will generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.]
]

.pull-right[
<img src="img/25/selector-gadget.gif" height="250" style="display: block; margin: auto;" />
]

- .midi[Click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector.]

- .midi[Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs.]

---

# Top 250 movies on IMDB

---

## Top 250 movies on IMDB

Take a look at the source code, look for the tag `table` tag:
<br>
http://www.imdb.com/chart/top

---

## First check if you're allowed!

```r
library(robotstxt)
paths_allowed("http://www.imdb.com")
```

```
## [1] TRUE
```

vs. e.g.

```r
paths_allowed("http://www.facebook.com")
```

```
## [1] FALSE
```

---

## Select and format pieces

```r
page <- read_html("http://www.imdb.com/chart/top")
```
]

```r
titles <- page %>%
  html_nodes(".titleColumn a") %>%
  html_text()
```
]

```r
years <- page %>%
  html_nodes(".secondaryInfo") %>%
  html_text() %>%
  str_replace("\$", "") %>% # remove (
  str_replace("\$", "") %>% # remove )
  as.numeric()
```
]

```r
scores <- page %>%
  html_nodes("#main strong") %>%
  html_text() %>%
  as.numeric()
```
]

```r
imdb_top_250 <- tibble(
  title = titles, 
  year = years, 
  score = scores)
```
]

---

```r
imdb_top_250
```

```
## # A tibble: 250 x 3
##    title                                              year score
##    <chr>                                             <dbl> <dbl>
##  1 The Shawshank Redemption                           1994   9.2
##  2 The Godfather                                      1972   9.1
##  3 The Godfather: Part II                             1974   9  
##  4 The Dark Knight                                    2008   9  
##  5 12 Angry Men                                       1957   8.9
##  6 Schindler's List                                   1993   8.9
##  7 The Lord of the Rings: The Return of the King      2003   8.9
##  8 Pulp Fiction                                       1994   8.8
##  9 The Good, the Bad and the Ugly                     1966   8.8
## 10 The Lord of the Rings: The Fellowship of the Ring  2001   8.8
## # … with 240 more rows
```

---
.small[

```r
imdb_top_250 %>%
* DT::datatable(options(list(dom = "p")))
```
]

---

.small[
<div id="htmlwidget-0e71f80f00073b272572" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-0e71f80f00073b272572">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150","151","152","153","154","155","156","157","158","159","160","161","162","163","164","165","166","167","168","169","170","171","172","173","174","175","176","177","178","179","180","181","182","183","184","185","186","187","188","189","190","191","192","193","194","195","196","197","198","199","200","201","202","203","204","205","206","207","208","209","210","211","212","213","214","215","216","217","218","219","220","221","222","223","224","225","226","227","228","229","230","231","232","233","234","235","236","237","238","239","240","241","242","243","244","245","246","247","248","249","250"],["The Shawshank Redemption","The Godfather","The Godfather: Part II","The Dark Knight","12 Angry Men","Schindler's List","The Lord of the Rings: The Return of the King","Pulp Fiction","The Good, the Bad and the Ugly","The Lord of the Rings: The Fellowship of the Ring","Fight Club","Forrest Gump","Inception","The Lord of the Rings: The Two Towers","Star Wars: Episode V - The Empire Strikes Back","The Matrix","Goodfellas","One Flew Over the Cuckoo's Nest","Seven Samurai","Se7en","The Silence of the Lambs","City of God","Life Is Beautiful","It's a Wonderful Life","Star Wars: Episode IV - A New Hope","Saving Private Ryan","Spirited Away","Interstellar","The Green Mile","Parasite","Léon: The Professional","Hara-Kiri","The Usual Suspects","The Pianist","Back to the Future","Terminator 2: Judgment Day","Psycho","Modern Times","The Lion King","American History X","City Lights","Gladiator","Grave of the Fireflies","Whiplash","The Departed","The Intouchables","The Prestige","Casablanca","Once Upon a Time in the West","Rear Window","Cinema Paradiso","Alien","Apocalypse Now","Memento","Indiana Jones and the Raiders of the Lost Ark","The Great Dictator","The Lives of Others","Django Unchained","Paths of Glory","Sunset Blvd.","WALL·E","The Shining","Avengers: Infinity War","Witness for the Prosecution","Joker","Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb","Spider-Man: Into the Spider-Verse","Oldboy","Princess Mononoke","Hamilton","Your Name.","Once Upon a Time in America","The Dark Knight Rises","Aliens","Coco","Das Boot","Capharnaüm","Avengers: Endgame","High and Low","American Beauty","Toy Story","Pather Panchali","Braveheart","Amadeus","3 Idiots","Inglourious Basterds","Good Will Hunting","Star Wars: Episode VI - Return of the Jedi","2001: A Space Odyssey","Reservoir Dogs","M","Like Stars on Earth","Vertigo","Citizen Kane","The Hunt","Come and See","Requiem for a Dream","Singin' in the Rain","North by Northwest","Eternal Sunshine of the Spotless Mind","Bicycle Thieves","Ikiru","Lawrence of Arabia","The Kid","Full Metal Jacket","Dangal","The Father","A Clockwork Orange","Metropolis","Taxi Driver","Double Indemnity","The Sting","The Apartment","A Separation","1917","Incendies","Amélie","Snatch","Scarface","Toy Story 3","To Kill a Mockingbird","For a Few Dollars More","Up","Indiana Jones and the Last Crusade","L.A. Confidential","Heat","Rashomon","Yojimbo","Ran","Die Hard","Monty Python and the Holy Grail","Green Book","Downfall","Batman Begins","All About Eve","Some Like It Hot","Unforgiven","Children of Heaven","Howl's Moving Castle","The Wolf of Wall Street","The Great Escape","Judgment at Nuremberg","Casino","The Treasure of the Sierra Madre","Pan's Labyrinth","There Will Be Blood","A Beautiful Mind","The Secret in Their Eyes","Raging Bull","My Neighbor Totoro","Chinatown","Lock, Stock and Two Smoking Barrels","The Gold Rush","Dial M for Murder","Three Billboards Outside Ebbing, Missouri","No Country for Old Men","Shutter Island","The Seventh Seal","The Elephant Man","The Thing","The Sixth Sense","Klaus","Inside Out","V for Vendetta","The Third Man","Wild Strawberries","Blade Runner","The Bridge on the River Kwai","Trainspotting","The Truman Show","Jurassic Park","Warrior","Memories of Murder","Fargo","My Father and My Son","Finding Nemo","Gone with the Wind","Kill Bill: Vol. 1","Tokyo Story","On the Waterfront","Stalker","The General","Wild Tales","The Deer Hunter","Sherlock Jr.","Gran Torino","Room","The Grand Budapest Hotel","Before Sunrise","Persona","Mary and Max","Prisoners","In the Name of the Father","Mr. Smith Goes to Washington","Catch Me If You Can","Gone Girl","Hacksaw Ridge","To Be or Not to Be","Andhadhun","Barry Lyndon","The Big Lebowski","Ford v Ferrari","12 Years a Slave","The Passion of Joan of Arc","How to Train Your Dragon","Mad Max: Fury Road","The Wages of Fear","Ben-Hur","Million Dollar Baby","Dead Poets Society","Network","Harry Potter and the Deathly Hallows: Part 2","Autumn Sonata","Stand by Me","The Handmaiden","Cool Hand Luke","The 400 Blows","Logan","The Bandit","Hachi: A Dog's Tale","Platoon","La Haine","Monty Python's Life of Brian","Spotlight","Hotel Rwanda","Rebecca","Rush","Into the Wild","Andrei Rublev","Monsters, Inc.","Gangs of Wasseypur","Amores Perros","Rocky","A Silent Voice: The Movie","In the Mood for Love","Raatchasan","Demon Slayer: Mugen Train","Nausicaä of the Valley of the Wind","It Happened One Night","The Battle of Algiers","Before Sunset","Drishyam 2","Rififi","The Princess Bride","Three Colors: Red","Fanny and Alexander","Neon Genesis Evangelion: The End of Evangelion","Soul","Sunrise","Tumbbad"],[1994,1972,1974,2008,1957,1993,2003,1994,1966,2001,1999,1994,2010,2002,1980,1999,1990,1975,1954,1995,1991,2002,1997,1946,1977,1998,2001,2014,1999,2019,1994,1962,1995,2002,1985,1991,1960,1936,1994,1998,1931,2000,1988,2014,2006,2011,2006,1942,1968,1954,1988,1979,1979,2000,1981,1940,2006,2012,1957,1950,2008,1980,2018,1957,2019,1964,2018,2003,1997,2020,2016,1984,2012,1986,2017,1981,2018,2019,1963,1999,1995,1955,1995,1984,2009,2009,1997,1983,1968,1992,1931,2007,1958,1941,2012,1985,2000,1952,1959,2004,1948,1952,1962,1921,1987,2016,2020,1971,1927,1976,1944,1973,1960,2011,2019,2010,2001,2000,1983,2010,1962,1965,2009,1989,1997,1995,1950,1961,1985,1988,1975,2018,2004,2005,1950,1959,1992,1997,2004,2013,1963,1961,1995,1948,2006,2007,2001,2009,1980,1988,1974,1998,1925,1954,2017,2007,2010,1957,1980,1982,1999,2019,2015,2005,1949,1957,1982,1957,1996,1998,1993,2011,2003,1996,2005,2003,1939,2003,1953,1954,1979,1926,2014,1978,1924,2008,2015,2014,1995,1966,2009,2013,1993,1939,2002,2014,2016,1942,2018,1975,1998,2019,2013,1928,2010,2015,1953,1959,2004,1989,1976,2011,1978,1986,2016,1967,1959,2017,1996,2009,1986,1995,1979,2015,2004,1940,2013,2007,1966,2001,2012,2000,1976,2016,2000,2018,2020,1984,1934,1966,2004,2021,1955,1987,1994,1982,1997,2020,1927,2018],[9.2,9.1,9,9,8.9,8.9,8.9,8.8,8.8,8.8,8.8,8.7,8.7,8.7,8.7,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>title<\/th>\n      <th>year<\/th>\n      <th>score<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[2,3]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]

---

## Clean up / enhance

May or may not be a lot of work depending on how messy the data are

**See if you like what you got:**

```r
glimpse(imdb_top_250)
```

```
## Rows: 250
## Columns: 3
## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfather: Par…
## $ year  <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 2001, 1999…
## $ score <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8, 8.7, 8.7,…
```
]

**Add a variable for rank**

```r
imdb_top_250 <- imdb_top_250 %>%
  mutate(rank = 1:nrow(imdb_top_250))
```
]

---

```r
imdb_top_250 %>%
  DT::datatable(options(list(dom = "p")), height = 350)
```

<div id="htmlwidget-3e3345e34ed5104efa98" style="width:100%;height:350px;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-3e3345e34ed5104efa98">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150","151","152","153","154","155","156","157","158","159","160","161","162","163","164","165","166","167","168","169","170","171","172","173","174","175","176","177","178","179","180","181","182","183","184","185","186","187","188","189","190","191","192","193","194","195","196","197","198","199","200","201","202","203","204","205","206","207","208","209","210","211","212","213","214","215","216","217","218","219","220","221","222","223","224","225","226","227","228","229","230","231","232","233","234","235","236","237","238","239","240","241","242","243","244","245","246","247","248","249","250"],["The Shawshank Redemption","The Godfather","The Godfather: Part II","The Dark Knight","12 Angry Men","Schindler's List","The Lord of the Rings: The Return of the King","Pulp Fiction","The Good, the Bad and the Ugly","The Lord of the Rings: The Fellowship of the Ring","Fight Club","Forrest Gump","Inception","The Lord of the Rings: The Two Towers","Star Wars: Episode V - The Empire Strikes Back","The Matrix","Goodfellas","One Flew Over the Cuckoo's Nest","Seven Samurai","Se7en","The Silence of the Lambs","City of God","Life Is Beautiful","It's a Wonderful Life","Star Wars: Episode IV - A New Hope","Saving Private Ryan","Spirited Away","Interstellar","The Green Mile","Parasite","Léon: The Professional","Hara-Kiri","The Usual Suspects","The Pianist","Back to the Future","Terminator 2: Judgment Day","Psycho","Modern Times","The Lion King","American History X","City Lights","Gladiator","Grave of the Fireflies","Whiplash","The Departed","The Intouchables","The Prestige","Casablanca","Once Upon a Time in the West","Rear Window","Cinema Paradiso","Alien","Apocalypse Now","Memento","Indiana Jones and the Raiders of the Lost Ark","The Great Dictator","The Lives of Others","Django Unchained","Paths of Glory","Sunset Blvd.","WALL·E","The Shining","Avengers: Infinity War","Witness for the Prosecution","Joker","Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb","Spider-Man: Into the Spider-Verse","Oldboy","Princess Mononoke","Hamilton","Your Name.","Once Upon a Time in America","The Dark Knight Rises","Aliens","Coco","Das Boot","Capharnaüm","Avengers: Endgame","High and Low","American Beauty","Toy Story","Pather Panchali","Braveheart","Amadeus","3 Idiots","Inglourious Basterds","Good Will Hunting","Star Wars: Episode VI - Return of the Jedi","2001: A Space Odyssey","Reservoir Dogs","M","Like Stars on Earth","Vertigo","Citizen Kane","The Hunt","Come and See","Requiem for a Dream","Singin' in the Rain","North by Northwest","Eternal Sunshine of the Spotless Mind","Bicycle Thieves","Ikiru","Lawrence of Arabia","The Kid","Full Metal Jacket","Dangal","The Father","A Clockwork Orange","Metropolis","Taxi Driver","Double Indemnity","The Sting","The Apartment","A Separation","1917","Incendies","Amélie","Snatch","Scarface","Toy Story 3","To Kill a Mockingbird","For a Few Dollars More","Up","Indiana Jones and the Last Crusade","L.A. Confidential","Heat","Rashomon","Yojimbo","Ran","Die Hard","Monty Python and the Holy Grail","Green Book","Downfall","Batman Begins","All About Eve","Some Like It Hot","Unforgiven","Children of Heaven","Howl's Moving Castle","The Wolf of Wall Street","The Great Escape","Judgment at Nuremberg","Casino","The Treasure of the Sierra Madre","Pan's Labyrinth","There Will Be Blood","A Beautiful Mind","The Secret in Their Eyes","Raging Bull","My Neighbor Totoro","Chinatown","Lock, Stock and Two Smoking Barrels","The Gold Rush","Dial M for Murder","Three Billboards Outside Ebbing, Missouri","No Country for Old Men","Shutter Island","The Seventh Seal","The Elephant Man","The Thing","The Sixth Sense","Klaus","Inside Out","V for Vendetta","The Third Man","Wild Strawberries","Blade Runner","The Bridge on the River Kwai","Trainspotting","The Truman Show","Jurassic Park","Warrior","Memories of Murder","Fargo","My Father and My Son","Finding Nemo","Gone with the Wind","Kill Bill: Vol. 1","Tokyo Story","On the Waterfront","Stalker","The General","Wild Tales","The Deer Hunter","Sherlock Jr.","Gran Torino","Room","The Grand Budapest Hotel","Before Sunrise","Persona","Mary and Max","Prisoners","In the Name of the Father","Mr. Smith Goes to Washington","Catch Me If You Can","Gone Girl","Hacksaw Ridge","To Be or Not to Be","Andhadhun","Barry Lyndon","The Big Lebowski","Ford v Ferrari","12 Years a Slave","The Passion of Joan of Arc","How to Train Your Dragon","Mad Max: Fury Road","The Wages of Fear","Ben-Hur","Million Dollar Baby","Dead Poets Society","Network","Harry Potter and the Deathly Hallows: Part 2","Autumn Sonata","Stand by Me","The Handmaiden","Cool Hand Luke","The 400 Blows","Logan","The Bandit","Hachi: A Dog's Tale","Platoon","La Haine","Monty Python's Life of Brian","Spotlight","Hotel Rwanda","Rebecca","Rush","Into the Wild","Andrei Rublev","Monsters, Inc.","Gangs of Wasseypur","Amores Perros","Rocky","A Silent Voice: The Movie","In the Mood for Love","Raatchasan","Demon Slayer: Mugen Train","Nausicaä of the Valley of the Wind","It Happened One Night","The Battle of Algiers","Before Sunset","Drishyam 2","Rififi","The Princess Bride","Three Colors: Red","Fanny and Alexander","Neon Genesis Evangelion: The End of Evangelion","Soul","Sunrise","Tumbbad"],[1994,1972,1974,2008,1957,1993,2003,1994,1966,2001,1999,1994,2010,2002,1980,1999,1990,1975,1954,1995,1991,2002,1997,1946,1977,1998,2001,2014,1999,2019,1994,1962,1995,2002,1985,1991,1960,1936,1994,1998,1931,2000,1988,2014,2006,2011,2006,1942,1968,1954,1988,1979,1979,2000,1981,1940,2006,2012,1957,1950,2008,1980,2018,1957,2019,1964,2018,2003,1997,2020,2016,1984,2012,1986,2017,1981,2018,2019,1963,1999,1995,1955,1995,1984,2009,2009,1997,1983,1968,1992,1931,2007,1958,1941,2012,1985,2000,1952,1959,2004,1948,1952,1962,1921,1987,2016,2020,1971,1927,1976,1944,1973,1960,2011,2019,2010,2001,2000,1983,2010,1962,1965,2009,1989,1997,1995,1950,1961,1985,1988,1975,2018,2004,2005,1950,1959,1992,1997,2004,2013,1963,1961,1995,1948,2006,2007,2001,2009,1980,1988,1974,1998,1925,1954,2017,2007,2010,1957,1980,1982,1999,2019,2015,2005,1949,1957,1982,1957,1996,1998,1993,2011,2003,1996,2005,2003,1939,2003,1953,1954,1979,1926,2014,1978,1924,2008,2015,2014,1995,1966,2009,2013,1993,1939,2002,2014,2016,1942,2018,1975,1998,2019,2013,1928,2010,2015,1953,1959,2004,1989,1976,2011,1978,1986,2016,1967,1959,2017,1996,2009,1986,1995,1979,2015,2004,1940,2013,2007,1966,2001,2012,2000,1976,2016,2000,2018,2020,1984,1934,1966,2004,2021,1955,1987,1994,1982,1997,2020,1927,2018],[9.2,9.1,9,9,8.9,8.9,8.9,8.8,8.8,8.8,8.8,8.7,8.7,8.7,8.7,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.6,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.5,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.4,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.3,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.2,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8.1,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>title<\/th>\n      <th>year<\/th>\n      <th>score<\/th>\n      <th>rank<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"p","columnDefs":[{"className":"dt-right","targets":[2,3,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
]

---

## Analyze

```r
imdb_top_250 %>% 
  filter(year == 1995)
```

```
## # A tibble: 8 x 4
##   title               year score  rank
##   <chr>              <dbl> <dbl> <int>
## 1 Se7en               1995   8.6    20
## 2 The Usual Suspects  1995   8.5    33
## 3 Toy Story           1995   8.3    81
## 4 Braveheart          1995   8.3    83
## 5 Heat                1995   8.2   126
## 6 Casino              1995   8.2   143
## 7 Before Sunrise      1995   8.1   189
## 8 La Haine            1995   8     222
```
]

---

## Analyze

.question[
How would you go about answering this question: Which years have the most movies on the list?
]

```r
imdb_top_250 %>% 
  group_by(year) %>%
  summarise(total = n()) %>%
  arrange(desc(total)) %>%
  head(5)
```

```
## # A tibble: 5 x 2
##    year total
##   <dbl> <int>
## 1  1995     8
## 2  2018     7
## 3  1957     6
## 4  1994     6
## 5  1997     6
```
]

---

.question[
How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time.
]

Code:

```r
imdb_top_250 %>% 
  group_by(year) %>%
  summarise(avg_score = mean(score)) %>%
  ggplot(aes(y = avg_score, x = year)) +
    geom_point() +
    geom_smooth(method = "lm") +
    labs(x = "Year", y = "Average score")
```

---

---

## Potential challenges

- Unreliable formatting at the source
- Data broken into many pages
- ...

.question[
Compare the display of information at [raleigh.craigslist.org/search/apa](https://raleigh.craigslist.org/search/apa) to the list on the IMDB top 250 list. What challenges can you foresee in scraping a list of the available apartments?
]