Tidy data and data wrangling

# Tidy data and data wrangling
### Becky Tang

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

# Tidy data

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy
--

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
]
--
.pull-right[
**Characteristics of untidy data:**

!@#$%^&*()
]

---

<div class="figure" style="text-align: center">
<img src="img/04/untidy-data/hyperwar-airplanes-on-hand.png" alt="WW2 Army Air Force combat aircraft from [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)" width="80%" />
<p class="caption">WW2 Army Air Force combat aircraft from [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)</p>
</div>

---

<br>

<div class="figure" style="text-align: center">
<img src="img/04/untidy-data/us-general-economic-characteristic-acs-2017.png" alt="[US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&amp;src=pt)" width="90%" />
<p class="caption">[US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)</p>
</div>

---

## Summary tables

```
## # A tibble: 87 x 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # … with 77 more rows
```
]
.pull-right[

```
## # A tibble: 3 x 2
##   gender    avg_height
##   <chr>          <dbl>
## 1 feminine        165.
## 2 masculine       177.
## 3 <NA>            181.
```
]
]

---

## Displaying data

```r
starwars %>%
  select(name, height, mass)
```

<br>

## Summarizing data

```r
starwars %>%
  group_by(gender) %>%
  summarize(
    avg_height = mean(height, na.rm = TRUE) %>% round(2)
  )
```

---

# Grammar of data wrangling

---

## A grammar of data wrangling...

... based on the concepts of functions as verbs that manipulate data frames

.pull-left[
<img src="img/04/dplyr-part-of-tidyverse.png" width="70%" style="display: block; margin: auto;" />
]
.pull-right[
.small[
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows using index(es)
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarise`: reduce variables to values
- `group_by`: for grouped operations
- ... (many more)
]
]

---

## Rules of **dplyr** functions

- First argument is *always* a data frame

- Subsequent arguments say what to do with that data frame

- Always return a data frame

---

## Data: Hotel bookings

- Data from two hotels: one resort and one city hotel

- Observations: Each row represents a hotel booking

- Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled ([Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5))

- Featured in [TidyTuesday](https://github.com/rfordatascience/tidytuesday)!

```r
hotels <- read_csv("data/hotels.csv")
```

---

## First look: Variables

```r
names(hotels)
```

```
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"
```

---

## Second look: Overview

```r
glimpse(hotels)
```

```
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <chr> "Transient", "Transient", "Transient", …
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
```

---

## Select a single column

View only the `lead_time` (number of days between booking and arrival date):

--
.pull-left[

```r
hotels %>%
  select(lead_time)
```

```
## # A tibble: 119,390 x 1
##    lead_time
##        <dbl>
##  1       342
##  2       737
##  3         7
##  4        13
##  5        14
##  6        14
##  7         0
##  8         9
##  9        85
## 10        75
## # … with 119,380 more rows
```
]
--
.pull-right[
- Start with a data frame
- Pass it to the `select()` function.
- Argument is variable we want to select: `lead_time`
- The result is a data frame with 119,300 and 1 column: --dplyr functions always 
expect a data frame and always yield a data frame.
]

---

## Select multiple columns

View only the `hotel` type and `lead_time`:

```r
hotels %>%
  select(hotel, lead_time)
```

```
## # A tibble: 119,390 x 2
##    hotel        lead_time
##    <chr>            <dbl>
##  1 Resort Hotel       342
##  2 Resort Hotel       737
##  3 Resort Hotel         7
##  4 Resort Hotel        13
##  5 Resort Hotel        14
##  6 Resort Hotel        14
##  7 Resort Hotel         0
##  8 Resort Hotel         9
##  9 Resort Hotel        85
## 10 Resort Hotel        75
## # … with 119,380 more rows
```
]
--
.pull-right[
.discussion[
What if we wanted to select these columns, and then arrange the data in descending order of lead time?
]
]

---

## Data wrangling, step-by-step

```r
hotels %>%
  select(hotel, lead_time)
```

--
.pull-right[
.vocab[Select, then arrange:]

```r
hotels %>%
  select(hotel, lead_time) %>%
  arrange(desc(lead_time))
```

```
## # A tibble: 119,390 x 2
##    hotel        lead_time
##    <chr>            <dbl>
##  1 Resort Hotel       737
##  2 Resort Hotel       709
##  3 City Hotel         629
##  4 City Hotel         629
##  5 City Hotel         629
##  6 City Hotel         629
##  7 City Hotel         629
##  8 City Hotel         629
##  9 City Hotel         629
## 10 City Hotel         629
## # … with 119,380 more rows
```
]

---

## What's happening here?

```r
hotels %>%
  select(hotel, lead_time) %>%
  arrange(hotel)
```

```
## # A tibble: 119,390 x 2
##    hotel      lead_time
##    <chr>          <dbl>
##  1 City Hotel         6
##  2 City Hotel        88
##  3 City Hotel        65
##  4 City Hotel        92
##  5 City Hotel       100
##  6 City Hotel        79
##  7 City Hotel         3
##  8 City Hotel        63
##  9 City Hotel        62
## 10 City Hotel        62
## # … with 119,380 more rows
```

---

# Pipes

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
]
.pull-right[
.small[

```r
*hotels %>%
  select(hotel, lead_time) %>%
  arrange(desc(lead_time))
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
- then we select the variables `hotel` and `lead_time`,
]
.pull-right[
.small[

```r
hotels %>%
* select(hotel, lead_time) %>%
  arrange(desc(lead_time))
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `hotels`, and pass it to the `select()` function,
- then we select the variables `hotel` and `lead_time`,
- and then we arrange the data frame by `lead_time` in descending order.
]
.pull-right[
.small[

```r
hotels %>%
  select(hotel, lead_time) %>% 
* arrange(desc(lead_time))
```

---

## Aside

The pipe operator is implemented in the package **magrittr**, though we don't need to load this package explicitly since **tidyverse** does this for us.

.pull-left[
<img src="img/04/magritte.jpg" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
<img src="img/04/magrittr.jpg" width="100%" style="display: block; margin: auto;" />
]

---

## How does a pipe work?

- You can think about the following sequence of actions - find keys,  start car, drive to work, park.

--
- Expressed as a set of nested functions in R pseudocode this would look like:

```r
park(drive(start_car(find("keys")), to = "work"))
```

--
- Writing it out using pipes give it a more natural (and easier to read)
structure:
  - Read the pipe as "and then"

```r
find("keys") %>%
  start_car() %>%
  drive(to = "work") %>%
  park()
```

---

## What about other arguments?

Use the dot to

- send results to a function argument other than first one or 
- use the previous result for multiple arguments

```r
hotels %>%
  filter(hotel == "Resort Hotel") %>%
* ggplot(., aes(x = lead_time)) +
  geom_histogram(bins = 10)
```

---

# Working with a single data frame

---

### You have a single data frame, and you want to process it and prepare it for anlaysis!

---

## `select` to keep variables

```r
hotels %>%
* select(hotel, lead_time)
```

---

## `select` to exclude variables

```r
hotels %>%
* select(-agent)
```

```
## # A tibble: 119,390 x 31
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl>             <dbl> <chr>             
##  1 Resort Hotel           0       342              2015 July              
##  2 Resort Hotel           0       737              2015 July              
##  3 Resort Hotel           0         7              2015 July              
##  4 Resort Hotel           0        13              2015 July              
##  5 Resort Hotel           0        14              2015 July              
##  6 Resort Hotel           0        14              2015 July              
##  7 Resort Hotel           0         0              2015 July              
##  8 Resort Hotel           0         9              2015 July              
##  9 Resort Hotel           1        85              2015 July              
## 10 Resort Hotel           1        75              2015 July              
## # … with 119,380 more rows, and 26 more variables:
## #   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, company <chr>, days_in_waiting_list <dbl>,
## #   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## #   total_of_special_requests <dbl>, reservation_status <chr>,
## #   reservation_status_date <date>
```
]

---

## `select` a range of variables

```r
hotels %>%
* select(hotel:arrival_date_month)
```

```
## # A tibble: 119,390 x 5
##    hotel        is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>              <dbl>     <dbl>             <dbl> <chr>             
##  1 Resort Hotel           0       342              2015 July              
##  2 Resort Hotel           0       737              2015 July              
##  3 Resort Hotel           0         7              2015 July              
##  4 Resort Hotel           0        13              2015 July              
##  5 Resort Hotel           0        14              2015 July              
##  6 Resort Hotel           0        14              2015 July              
##  7 Resort Hotel           0         0              2015 July              
##  8 Resort Hotel           0         9              2015 July              
##  9 Resort Hotel           1        85              2015 July              
## 10 Resort Hotel           1        75              2015 July              
## # … with 119,380 more rows
```

---

## `arrange` in ascending / descending order

```r
hotels %>%
  select(adults, children, babies) %>%
* arrange(babies)
```

```
## # A tibble: 119,390 x 3
##    adults children babies
##     <dbl>    <dbl>  <dbl>
##  1      2        0      0
##  2      2        0      0
##  3      1        0      0
##  4      1        0      0
##  5      2        0      0
##  6      2        0      0
##  7      2        0      0
##  8      2        0      0
##  9      2        0      0
## 10      2        0      0
## # … with 119,380 more rows
```
]
.pull-right[

```r
hotels %>%
  select(adults, children, babies) %>%
* arrange(desc(babies))
```

```
## # A tibble: 119,390 x 3
##    adults children babies
##     <dbl>    <dbl>  <dbl>
##  1      2        0     10
##  2      1        0      9
##  3      2        0      2
##  4      2        0      2
##  5      2        0      2
##  6      2        0      2
##  7      2        0      2
##  8      2        0      2
##  9      2        0      2
## 10      2        0      2
## # … with 119,380 more rows
```
]

---

## `slice` for certain row numbers

```r
# first five
hotels %>%
* slice(1:5)
```

```
## # A tibble: 5 x 32
##   hotel is_canceled lead_time arrival_date_ye… arrival_date_mo… arrival_date_we…
##   <chr>       <dbl>     <dbl>            <dbl> <chr>                       <dbl>
## 1 Reso…           0       342             2015 July                           27
## 2 Reso…           0       737             2015 July                           27
## 3 Reso…           0         7             2015 July                           27
## 4 Reso…           0        13             2015 July                           27
## 5 Reso…           0        14             2015 July                           27
## # … with 26 more variables: arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## #   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## #   total_of_special_requests <dbl>, reservation_status <chr>,
## #   reservation_status_date <date>
```
]

---

.tip[
In R, you can use the `#` (hashtag or pound sign, depending on your age for adding comments to your code. 
Any text following `#` will be printed as is, and won't be run as R code.
This is useful for leaving comments in your code and for temporarily disabling 
certain lines of code while debugging.
]

```r
hotels %>%
  # slice the first five rows  # this line is a comment
  #select(hotel) %>%           # this one doesn't run
  slice(1:5)                   # this line runs
```

---

## `slice` for certain row numbers

```r
# last five
last_row <- nrow(hotels) # nrow() gives the number of rows in a data frame
hotels %>%
* slice((last_row - 4):last_row)
```

```
## # A tibble: 5 x 32
##   hotel is_canceled lead_time arrival_date_ye… arrival_date_mo… arrival_date_we…
##   <chr>       <dbl>     <dbl>            <dbl> <chr>                       <dbl>
## 1 City…           0        23             2017 August                         35
## 2 City…           0       102             2017 August                         35
## 3 City…           0        34             2017 August                         35
## 4 City…           0       109             2017 August                         35
## 5 City…           0       205             2017 August                         35
## # … with 26 more variables: arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## #   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## #   total_of_special_requests <dbl>, reservation_status <chr>,
## #   reservation_status_date <date>
```
]

---

## `filter` to select a subset of rows

```r
# bookings in City Hotels
hotels %>%
* filter(hotel == "City Hotel")
```

```
## # A tibble: 79,330 x 32
##    hotel      is_canceled lead_time arrival_date_year arrival_date_month
##    <chr>            <dbl>     <dbl>             <dbl> <chr>             
##  1 City Hotel           0         6              2015 July              
##  2 City Hotel           1        88              2015 July              
##  3 City Hotel           1        65              2015 July              
##  4 City Hotel           1        92              2015 July              
##  5 City Hotel           1       100              2015 July              
##  6 City Hotel           1        79              2015 July              
##  7 City Hotel           0         3              2015 July              
##  8 City Hotel           1        63              2015 July              
##  9 City Hotel           1        62              2015 July              
## 10 City Hotel           1        62              2015 July              
## # … with 79,320 more rows, and 27 more variables:
## #   arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## #   stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## #   children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## #   market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## #   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## #   total_of_special_requests <dbl>, reservation_status <chr>,
## #   reservation_status_date <date>
```
]

---

## `filter` for many conditions at once

```r
hotels %>%
  filter( 
*   adults == 0,
*   children >= 1
    ) %>% 
  select(adults, babies, children)
```

```
## # A tibble: 223 x 3
##    adults babies children
##     <dbl>  <dbl>    <dbl>
##  1      0      0        3
##  2      0      0        2
##  3      0      0        2
##  4      0      0        2
##  5      0      0        2
##  6      0      0        3
##  7      0      1        2
##  8      0      0        2
##  9      0      0        2
## 10      0      0        2
## # … with 213 more rows
```

---

## `filter` for more complex conditions

```r
# bookings with no adults and some children or babies in the room
hotels %>%
  filter( 
    adults == 0,     
*   children >= 1 | babies >= 1     # | means or
    ) %>%
  select(adults, babies, children)
```

---

## Logical operators in R

<br>

operator    | definition                   || operator     | definition
------------|------------------------------||--------------|----------------
`<`         | less than                   ||`x`&nbsp;&#124;&nbsp;`y`     | `x` OR `y` 
`<=`        |	less than or equal to        ||`is.na(x)`    | test if `x` is `NA`
`>`         | greater than                 ||`!is.na(x)`   | test if `x` is not `NA`
`>=`        |	greater than or equal to     ||`x %in% y`    | test if `x` is in `y`
`==`        |	exactly equal to             ||`!(x %in% y)` | test if `x` is not in `y`
`!=`        |	not equal to                 ||`!x`          | not `x`
`x & y`     | `x` AND `y`                  ||              |

---

## Demo

---

## `distinct` to filter for unique rows

... and `arrange` to order alphabetically. Note which variables are kept in the resulting data frame!

```r
hotels %>% 
* distinct(market_segment) %>%
  arrange(market_segment)
```

```
## # A tibble: 8 x 1
##   market_segment
##   <chr>         
## 1 Aviation      
## 2 Complementary 
## 3 Corporate     
## 4 Direct        
## 5 Groups        
## 6 Offline TA/TO 
## 7 Online TA     
## 8 Undefined
```
]

```r
hotels %>% 
* distinct(hotel, market_segment) %>%
  arrange(hotel, market_segment)
```

```
## # A tibble: 14 x 2
##    hotel        market_segment
##    <chr>        <chr>         
##  1 City Hotel   Aviation      
##  2 City Hotel   Complementary 
##  3 City Hotel   Corporate     
##  4 City Hotel   Direct        
##  5 City Hotel   Groups        
##  6 City Hotel   Offline TA/TO 
##  7 City Hotel   Online TA     
##  8 City Hotel   Undefined     
##  9 Resort Hotel Complementary 
## 10 Resort Hotel Corporate     
## 11 Resort Hotel Direct        
## 12 Resort Hotel Groups        
## 13 Resort Hotel Offline TA/TO 
## 14 Resort Hotel Online TA
```
]
]

---

## `count` to create frequency tables

```r
# alphabetical order by default
hotels %>%
* count(market_segment)
```

```
## # A tibble: 8 x 2
##   market_segment     n
##   <chr>          <int>
## 1 Aviation         237
## 2 Complementary    743
## 3 Corporate       5295
## 4 Direct         12606
## 5 Groups         19811
## 6 Offline TA/TO  24219
## 7 Online TA      56477
## 8 Undefined          2
```
]
--
.pull-right[

```r
# descending frequency order
hotels %>%
  count(market_segment, 
*       sort = TRUE)
```

```
## # A tibble: 8 x 2
##   market_segment     n
##   <chr>          <int>
## 1 Online TA      56477
## 2 Offline TA/TO  24219
## 3 Groups         19811
## 4 Direct         12606
## 5 Corporate       5295
## 6 Complementary    743
## 7 Aviation         237
## 8 Undefined          2
```
]

---

## `count` and `arrange`

```r
# ascending frequency order
hotels %>%
  count(market_segment) %>%
* arrange(n)
```

```
## # A tibble: 8 x 2
##   market_segment     n
##   <chr>          <int>
## 1 Undefined          2
## 2 Aviation         237
## 3 Complementary    743
## 4 Corporate       5295
## 5 Direct         12606
## 6 Groups         19811
## 7 Offline TA/TO  24219
## 8 Online TA      56477
```
]
.pull-right[

```r
# descending frequency order
# just like adding sort = TRUE
hotels %>%
  count(market_segment) %>%
* arrange(desc(n))
```

---

## `count` for multiple variables

```r
hotels %>%
* count(hotel, market_segment)
```

```
## # A tibble: 14 x 3
##    hotel        market_segment     n
##    <chr>        <chr>          <int>
##  1 City Hotel   Aviation         237
##  2 City Hotel   Complementary    542
##  3 City Hotel   Corporate       2986
##  4 City Hotel   Direct          6093
##  5 City Hotel   Groups         13975
##  6 City Hotel   Offline TA/TO  16747
##  7 City Hotel   Online TA      38748
##  8 City Hotel   Undefined          2
##  9 Resort Hotel Complementary    201
## 10 Resort Hotel Corporate       2309
## 11 Resort Hotel Direct          6513
## 12 Resort Hotel Groups          5836
## 13 Resort Hotel Offline TA/TO   7472
## 14 Resort Hotel Online TA      17729
```

---

## order matters when you `count`

```r
# hotel type first
hotels %>%
* count(hotel, market_segment)
```

```r
# market segment first
hotels %>%
* count(market_segment, hotel)
```

```
## # A tibble: 14 x 3
##    market_segment hotel            n
##    <chr>          <chr>        <int>
##  1 Aviation       City Hotel     237
##  2 Complementary  City Hotel     542
##  3 Complementary  Resort Hotel   201
##  4 Corporate      City Hotel    2986
##  5 Corporate      Resort Hotel  2309
##  6 Direct         City Hotel    6093
##  7 Direct         Resort Hotel  6513
##  8 Groups         City Hotel   13975
##  9 Groups         Resort Hotel  5836
## 10 Offline TA/TO  City Hotel   16747
## 11 Offline TA/TO  Resort Hotel  7472
## 12 Online TA      City Hotel   38748
## 13 Online TA      Resort Hotel 17729
## 14 Undefined      City Hotel       2
```
]
]

---

## Demo

---

## `mutate` to add a new variable

```r
hotels %>%
* mutate(little_ones = children + babies) %>%
  select(children, babies, little_ones) %>%
  arrange(desc(little_ones))
```

```
## # A tibble: 119,390 x 3
##    children babies little_ones
##       <dbl>  <dbl>       <dbl>
##  1       10      0          10
##  2        0     10          10
##  3        0      9           9
##  4        2      1           3
##  5        2      1           3
##  6        2      1           3
##  7        3      0           3
##  8        2      1           3
##  9        2      1           3
## 10        3      0           3
## # … with 119,380 more rows
```

---

## Little ones in resort and city hotels

```r
# Resort Hotel
hotels %>%
  mutate(little_ones = children + babies) %>%
  filter(
    little_ones >= 1,
    hotel == "Resort Hotel"
    ) %>%
  select(hotel, little_ones)
```

```
## # A tibble: 3,929 x 2
##    hotel        little_ones
##    <chr>              <dbl>
##  1 Resort Hotel           1
##  2 Resort Hotel           2
##  3 Resort Hotel           2
##  4 Resort Hotel           2
##  5 Resort Hotel           1
##  6 Resort Hotel           1
##  7 Resort Hotel           2
##  8 Resort Hotel           2
##  9 Resort Hotel           1
## 10 Resort Hotel           1
## # … with 3,919 more rows
```
]
.pull-right[

```r
# City Hotel
hotels %>%
  mutate(little_ones = children + babies) %>%
  filter(
    little_ones > 1,
    hotel == "City Hotel"
    )  %>%
  select(hotel, little_ones)
```

```
## # A tibble: 2,140 x 2
##    hotel      little_ones
##    <chr>            <dbl>
##  1 City Hotel           2
##  2 City Hotel           2
##  3 City Hotel           2
##  4 City Hotel           2
##  5 City Hotel           2
##  6 City Hotel           2
##  7 City Hotel           2
##  8 City Hotel           2
##  9 City Hotel           2
## 10 City Hotel           3
## # … with 2,130 more rows
```
]
]

---

```r
hotels %>%
  mutate(little_ones = children + babies) %>%
  count(hotel, little_ones) %>%
  mutate(prop = n / sum(n))
```

```
## # A tibble: 12 x 4
##    hotel        little_ones     n       prop
##    <chr>              <dbl> <int>      <dbl>
##  1 City Hotel             0 73923 0.619     
##  2 City Hotel             1  3263 0.0273    
##  3 City Hotel             2  2056 0.0172    
##  4 City Hotel             3    82 0.000687  
##  5 City Hotel             9     1 0.00000838
##  6 City Hotel            10     1 0.00000838
##  7 City Hotel            NA     4 0.0000335 
##  8 Resort Hotel           0 36131 0.303     
##  9 Resort Hotel           1  2183 0.0183    
## 10 Resort Hotel           2  1716 0.0144    
## 11 Resort Hotel           3    29 0.000243  
## 12 Resort Hotel          10     1 0.00000838
```
]

---

## `summarise` for summary stats

```r
# mean average daily rate for all bookings
hotels %>%
* summarise(mean_adr = mean(adr))
```

```
## # A tibble: 1 x 1
##   mean_adr
##      <dbl>
## 1     102.
```

.tip[
`summarise()` changes the data frame entirely, it collapses rows down to a single 
summary statistics, and removes all columns that are irrelevant to the calculation.
]

---

.tip[
`summarise()` also lets you get away with being sloppy and not naming your new 
column, but that's not recommended!
]

```r
hotels %>%
  summarise(mean(adr))
```

```
## # A tibble: 1 x 1
##   `mean(adr)`
##         <dbl>
## 1        102.
```
]

```r
hotels %>%
  summarise(mean_adr = mean(adr))
```

```
## # A tibble: 1 x 1
##   mean_adr
##      <dbl>
## 1     102.
```
]
]
---

## `group_by` for grouped operations

```r
# mean average daily rate for all booking at city and resort hotels
hotels %>%
* group_by(hotel) %>%
  summarise(mean_adr =  mean(adr))
```

```
## # A tibble: 2 x 2
##   hotel        mean_adr
##   <chr>           <dbl>
## 1 City Hotel      105. 
## 2 Resort Hotel     95.0
```

---

## Calculating frequencies

The following two give the same result, so `count` is simply short for `group_by` then determine frequencies

```r
hotels %>%
  group_by(hotel) %>%
  summarise(n = n())
```

```
## # A tibble: 2 x 2
##   hotel            n
##   <chr>        <int>
## 1 City Hotel   79330
## 2 Resort Hotel 40060
```
]
.pull-right[

```r
hotels %>%
  count(hotel)
```

```
## # A tibble: 2 x 2
##   hotel            n
##   <chr>        <int>
## 1 City Hotel   79330
## 2 Resort Hotel 40060
```
]

---

## Multiple summary statistics

`summarise` can be used for multiple summary statistics as well

```r
hotels %>%
  summarise(
    min_adr = min(adr),
    mean_adr = mean(adr),
    median_adr = median(adr),
    max_adr = max(adr)
    )
```

```
## # A tibble: 1 x 4
##   min_adr mean_adr median_adr max_adr
##     <dbl>    <dbl>      <dbl>   <dbl>
## 1   -6.38     102.       94.6    5400
```

---

## Demo