Simulation-based testing

# Simulation-based testing
## Part 2
### Becky Tang
### 06.10.2021

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Terminology

- .vocab[Population]: a group of individuals or objects we are interested in studying

- .vocab[Parameter]: a numerical quantity derived from the population
(almost always unknown)

- .vocab[Statistical inference] is the process of using sample data to make 
  conclusions about the underlying population the sample came from.

- .vocab[Testing]: evaluating whether our observed sample provides evidence 
for or against some claim about the population

---

## The hypothesis testing framework

1. Start with two hypotheses about the population: the null hypothesis and the 
alternative hypothesis.

2. Choose a (representative) sample, collect data, and analyze the data.

3. Figure out how likely it is to see data like what we observed, IF the null 
hypothesis were in fact true (called a **p-value**).

4.  If our data would have been extremely unlikely if the null hypothesis were true, 
then we reject it in favor of the alternative hypothesis. Otherwise, we cannot reject the null hypothesis. Define "unlikely" via `$\alpha$` level.

---

## What can go wrong?

Suppose we test a certain null hypothesis, which can be either true or false
(we never know for sure!). We make one of two decisions given our data: either
reject or fail to reject `$H_0$`.

We have the following four scenarios:

| Decision             | `$H_0$` is true | `$H_0$` is false |
|----------------------|---------------|----------------|
| Fail to reject `$H_0$` | Correct decision    | **Type II Error** |
| Reject `$H_0$`         | **Type I Error**  | Correct decision     |

It is important to weigh the consequences of making each type of error.

---

## What can go wrong?

- `$\alpha$` is the probability of making a Type I error.

- `$\beta$` is the probability of making a Type II error.

- The .vocab[power] of a test is 1 - `$\beta$`: the probability that, if the null
hypothesis is actually false, we correctly reject it.

Though we'd like to know if we're making a correct decision or making a Type I or Type II error, hypothesis testing does **NOT** give us the tools to determine this.

---

## Equivalency of confidence and significance levels

- In the previous lecture, our hypotheses were `$H_0: p = 0.10$` and `$H_a: p < 0.10$`.

- This form of `$H_a$` is a .vocab[one sided] hypothesis

- If instead `$H_a: p \neq 0.10$`, then we have a .vocab[two sided] hypothesis.

---

## Hypothesis and p-value

Recall the organ donor example, where we observed `$\hat{p} = \frac{3}{62} \approx 0.048$`. When calculating the p-value, how does the meaning of "as or more extreme" mean differ for a one sided vs two sided alternative?

- `$H_a: p < 0.10$`: find proportion of simulations that resulted in sample proportion `$\leq 0.048$`.

- `$H_a: p \neq 0.10$`: find proportion of simulations that resulted in sample proportion `$\leq 0.048$` OR `$\geq 0.10 + (0.10 - 0.048) = 0.152$`

---

## Equivalency of confidence and significance levels

- One sided alternative hypothesis with `$\alpha$` `$\rightarrow$` `$CL = 1 - (2 \times \alpha)$`

- Two sided alternative hypothesis test with `$\alpha$` `$\rightarrow$` `$CL = 1 - \alpha$`

---

## Back to Asheville!

Your friend claims that the mean price per guest per night for Airbnbs in
Asheville is $100. **What do you make of this statement?**

Let's use hypothesis testing to assess this claim!

---

## 1. Defining the hypotheses

Remember, the null and alternative hypotheses are defined for **parameters,**
not statistics

- `$H_0$`: the true mean price per guest is $100 per night
- `$H_a$`: the true mean price per guest is NOT $100 per night

Expressed in symbols:

- `$H_0: \mu = 100$`
- `$H_a: \mu \neq 100$`

---

## 2. Collecting and summarizing data

With these two hypotheses, we now take our sample and summarize the data.

The choice of summary statistic calculated depends on the type of data. In our 
example, we use the sample mean: `$\bar{x} = 76.6$`:

```r
asheville <- read_csv("data/asheville.csv")
mean_ppg <- asheville %>%
  summarise(mean_ppg = mean(ppg)) %>%
  pull()
mean_ppg
```

```
## [1] 76.58667
```

---

## `pull()`

- `pull()` extracts a single column from a data frame
- Why do we use it? Remember that in tidyverse, a data frame is always returned
but sometimes we want a number.

```r
asheville %>%
  summarise(mean_ppg = mean(ppg))
```

```
## # A tibble: 1 x 1
##   mean_ppg
##      <dbl>
## 1     76.6
```
]

```r
asheville %>%
  summarise(mean_ppg = mean(ppg)) %>%
  pull()
```

```
## [1] 76.58667
```
]

---

## 3. Assessing the evidence

Next, we calculate the probability of getting data like ours, *<u>or more extreme</u>*, 
if `$H_0$` were in fact actually true.

This is a conditional probability: 
> Given that `$H_0$` is true (i.e., if `$\mu$` were *actually* 100), what would 
> be the probability of observing `$\bar{x} = 76.6$` or more extreme?
.question[
This probability is known as the **p-value**.
]

---

## Simulating the null distribution

We know that our sample mean was 76.6, but
we also know that if we were to take another random sample of size 50 from all
Airbnb listings, we might get a different sample mean.

There is some variability in the .vocab[sampling distribution] of the mean, and
we want to make sure we quantify this.

.question[
How might we quantify the sampling distribution of the mean using only the data
that we have from our original sample?
]

---

## Bootstrap distribution of the mean

```r
set.seed(12345)
library(infer)
boot_means <- asheville %>% 
* specify(response = ppg) %>%
* generate(reps = 5000, type =  "bootstrap") %>%
* calculate(stat = "mean")
```

```r
boot_means %>%
 slice(1:6)
```

```
## # A tibble: 6 x 2
##   replicate  stat
##       <int> <dbl>
## 1         1  81.0
## 2         2  63.2
## 3         3  81.2
## 4         4  76.1
## 5         5  81.3
## 6         6  84.6
```

---

## Bootstrap distribution of the mean

```r
ggplot(data = boot_means, aes(stat)) +
  geom_histogram(binwidth = 2, color = "darkblue", fill = "skyblue") + 
  labs(x = "Price per night", y = "Count") +
  geom_vline(xintercept = mean(boot_means$stat), 
             lwd = 2, color = "red")
```

---

## Bootstrap distribution of the mean

---

## Shifting the distribution

We've captured the variability in the sample mean among samples of size 50 from
Asheville area Airbnbs, but remember that in the hypothesis testing paradigm,
we must assess our observed evidence under the assumption that the null 
hypothesis is true.

```r
boot_means %>% 
  summarize(mean(stat))
```

```
## # A tibble: 1 x 1
##   `mean(stat)`
##          <dbl>
## 1         76.6
```
]

`$H_0: \mu = 100$`

`$H_a: \mu \neq 100$`
]

---

Where should the bootstrap distribution of means be centered if in fact `$H_0$` 
were actually true?
]

---

## Shifting the distribution

```r
ash_boot_mean <- boot_means %>% 
  summarize(mean = mean(stat)) %>% 
  pull()

ash_boot_mean
```

```
## [1] 76.59556
```

---

## Shifting the distribution

If we shifted the bootstrap distribution by `offset`, then it will be centered
at `$\mu_0$`: the null-hypothesized value for the mean.

```r
*offset <- 100 - ash_boot_mean
offset
```

```
## [1] 23.40444
```

```r
boot_means <- boot_means %>% 
* mutate(null_dist_stat = stat + offset )
boot_means %>%
 slice(1:6)
```

```
## # A tibble: 6 x 3
##   replicate  stat null_dist_stat
##       <int> <dbl>          <dbl>
## 1         1  81.0          104. 
## 2         2  63.2           86.6
## 3         3  81.2          105. 
## 4         4  76.1           99.5
## 5         5  81.3          105. 
## 6         6  84.6          108.
```

---

## Shifting the distribution

```r
*ggplot(data = boot_means, aes(x = null_dist_stat)) +
  geom_histogram(binwidth = 2, color = "darkblue", fill = "skyblue") + 
  labs(x = "Price per night", y = "Count") +
    geom_vline(xintercept = mean(boot_means$null_dist_stat), lwd = 2, color = "red")
```

---

## Distribution of `$\bar{x}$` under `$H_0$`

---

## Simulating the null distribution with infer

Rather than `mutate`-ing to shift the bootstrap distribution of `$\bar{x}$` to be 
centered at `$\mu_0$`, we can simulate the null distribution automatically:

```r
boot_means <- asheville %>% 
  specify(response = ppg) %>% 
  generate(reps = 5000, type = "bootstrap") %>% 
  calculate(stat = "mean")
# then shift by offset
```
]

```r
null_dist <- asheville %>%
  specify(response = ppg) %>%
* hypothesize(null = "point", mu = 100) %>%
  generate(reps = 5000, type = "bootstrap") %>%  
  calculate(stat = "mean")
```
]

---

## Simulating the null distribution with infer

```r
null_dist <- asheville %>%
  specify(response = ppg) %>%
* hypothesize(null = "point", mu = 100) %>%
  generate(reps = 5000, type = "bootstrap") %>%  
  calculate(stat = "mean")
```

```r
null_dist
```

```
## # A tibble: 5,000 x 2
##    replicate  stat
##        <int> <dbl>
##  1         1 104. 
##  2         2 112. 
##  3         3  92.7
##  4         4 102. 
##  5         5  93.8
##  6         6 123. 
##  7         7 104. 
##  8         8 109. 
##  9         9 106. 
## 10        10 102. 
## # … with 4,990 more rows
```
]

```r
null_dist %>%
  summarise(mean = mean(stat))
```

```
## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1  100.
```
]

---

## 3. Assessing the evidence

---

## 3. Assessing the evidence

Remember, `$H_0: \mu = 100$` and `$H_a: \mu \neq 100$`

```r
null_dist %>%
  # everything as or more extreme as what we observed
  filter(stat <= mean_ppg | stat >= (100 + (100 - mean_ppg))) %>% 
  # number n() of such simulated values, divided by total number of simulations
  summarise(p_value = n()/nrow(null_dist)) %>%
  pull(p_value)
```

```
## [1] 8e-04
```

---

## 4. Make conclusion

The p-value, 0.0008 is less than 0.05, so we .vocab[reject] `$H_0$`. The data provide sufficient evidence that the true mean price per guest per night for Airbnbs in Asheville is not equal to $100.

---

## Discussion questions

- `$H_a$` here was a .vocab[two-sided] hypothesis `$(H_a: \mu \neq 100)$`. How does this compare to the .vocab[one-sided] hypothesis from last time `$(H_a: p < 0.1)$`?

- How might the p-value change depending on what type of alternative hypothesis
is specified?

- Why did we need to "shift" the bootstrap distribution when we generated the null distribution in this example, but we didn't need shift the distribution last time when we  generated the null distribution for inference on the population proportion?