Two-sample inference

# Two-sample inference
### Becky Tang
### 06.21.2021

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Recap

So far, we've talked about performing interval estimation and hypothesis 
testing for means using

- simulation-based methods, such as bootstrap or direct simulation, and
- the Central Limit Theorem

In all cases so far, we've only compared one sample against a hypothesized 
value.

---

# Difference in means

---

## Two-sample inference for means

Suppose we have two (representative)
samples, and wanted to either

- estimate the .vocab[difference in means] in the two
populations 
  - confidence interval for `$\mu_1 - \mu_2$`
  
- Test the hypotheses

`\begin{align*}
H_0: \mu_1 = \mu_2 \\
H_a: \mu_1 \neq \mu_2,
\end{align*}`

where `$\mu_1$` and `$\mu_2$` are the population means in groups 1 and 2.

---

.question[
How might you calculate a confidence interval and address the above hypothesis test using simulation-based methods? How about the CLT?
]

---

## Data

.footnote[Adapted from Erdogdu Sakar, B., et al. *Collection and Analysis of a Parkinson*
*Speech Dataset with Multiple Types of Sound Recordings*, IEEE Journal of 
Biomedical and Health Informatics, vol. 17(4), pp. 828-834, 2013
(image from [Wikipedia](https://en.wikipedia.org/wiki/Spectrogram))]

---

## Some voice analysis terminology

- .vocab[Jitter]: frequency variation from cycle to cycle
- .vocab[Shimmer]: amplitude variation of the sound wave

Jitter and shimmer are affected by lack of control of vocal cord vibration, and
pathological differences from average values may be indicative of Parkinson's
Disease (PD).

(from Teixeira, Oliveira, and Lopes, 2013)

---

## Question of interest

Is there a difference in average voice jitter between patients with Parkinson's disease (PD) and those who don't have Parkinson's disease (control group)?

The `parkinsons.csv` we looked at in lab contains repeated voice recordings from a number of patients,
some with PD and some serving as non-PD controls (Erdogdu B et al.). For now,
**assume that all samples were taken independently from each other** (this is
not actually the case, but we'll make this assumption).

Jitter is given in milliseconds (ms), and shimmer is given in decibels (dB).

---

## Bootstrap estimation

Let's construct the bootstrap distribution for the **difference in means**.

```r
set.seed(2020)
parkinsons <- read_csv("data/parkinsons.csv")
library(infer)
boot_diffs <- parkinsons %>% 
* specify(jitter ~ status) %>%
  generate(reps = 1000, type = "bootstrap") %>% 
* calculate(stat = "diff in means",
*           order = c("Healthy", "PD"))
```

---

## Bootstrap estimation

Let's construct the bootstrap distribution for the **difference in means**.

---

## CI for difference in means

Let's construct the bootstrap distribution for the **difference in means**.

```r
boot_diffs %>% 
  summarize(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975))
```

```
## # A tibble: 1 x 2
##      lower    upper
##      <dbl>    <dbl>
## 1 -0.00412 -0.00222
```

---

## CI for difference in means

```
## # A tibble: 1 x 2
##      lower    upper
##      <dbl>    <dbl>
## 1 -0.00412 -0.00222
```

.vocab[Interpretation: ]We are 95% confident that the mean voice jitter for people without Parkinson's disease is about 0.002 to 0.004 ms less than the mean voice jitter for those with Parkinson's disease.

.question[
Is there evidence that there is a difference in mean voice jitter between PD patients and healthy patients?
]

---

## Hypothesis testing

Let `$\mu_P$` be the mean voice jitter among PD patients, and `$\mu_H$` be the mean
voice jitter among healthy patients. Let's test

`\begin{align*}
H_0: \mu_P = \mu_H\\
H_a: \mu_P \neq \mu_H
\end{align*}`

If the two means are truly equal (i.e., if `$H_0$` is true), then the difference, `$\mu_H - \mu_P$`, should be **zero**.

---

## Hypothesis testing

Let's construct the simulated .vocab[null distribution] for the difference in means, `$\mu_H - \mu_P$`. If the two means are truly equal (i.e., if `$H_0$` is true), then this difference should be zero.

```r
null_dist <- parkinsons %>%
  specify(jitter ~ status) %>%
* hypothesize(null = "independence") %>%
* generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", 
*           order = c("Healthy", "PD"))
```

---

## Hypothesis testing
  
<img src="15-two-samp-inf_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" />

---

## Hypothesis testing

```r
obs_diff <- parkinsons %>% 
  specify(jitter ~ status) %>% 
  calculate(stat = "diff in means", order = c("Healthy", "PD")) %>% 
  pull()
obs_diff
```

```
## [1] -0.00312321
```

```r
null_dist %>% 
  filter(abs(stat) >= abs(obs_diff)) %>%
  summarise(p_val = n() / nrow(null_dist))
```

```
## # A tibble: 1 x 1
##   p_val
##   <dbl>
## 1     0
```

---

## Conclusion

The p-value is very small, so we reject `$H_0$`. The data provide sufficient evidence that there is a difference in the mean voice jitter between patients who have Parkinson's disease and those who don't have the disease.

---

## Difference in means using CLT
--

CLT-based inference for a difference in means relies on the 
.vocab[two-sample t-test for independent samples]. Like the t-test we've seen
before, the **test statistic** takes on the following form:

`\begin{align*}
T = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2) }{\widehat{SE}_{diff}}
\end{align*}`

The test statistic depends on whether we can assume that the two groups have the 
same underlying variability in their observations.

The exact form of the test statistic under the null hypothesis, including the
degrees of freedom, are a complicated fraction that no one calculates by hand. Let's let R handle this!

---

## CLT: Difference in means

```r
parkinsons %>%
  t_test(jitter ~ status, 
         mu = 0,
         order = c("Healthy", "PD"), 
         alternative = "two-sided", 
         conf_int = TRUE, conf_level = 0.95)
```

```
## # A tibble: 1 x 6
##   statistic  t_df      p_value alternative lower_ci upper_ci
##       <dbl> <dbl>        <dbl> <chr>          <dbl>    <dbl>
## 1     -5.96  187. 0.0000000124 two.sided   -0.00416 -0.00209
```

---

## CLT: Difference in means

.question[
Comprehensively evaluate the research question by specifying the hypotheses, 
the test statistic and its the distribution under the null, the p-value,
and decision at the `$\alpha = 0.05$` significance level. Interpret the 
conclusions from your hypothesis test in context of the original research 
question.
]

---

# Permutation tests for independence

---

## Is yawning contagious?

---

## Is yawning contagious?

An experiment conducted by the MythBusters tested if a person can be subconsciously influenced into yawning if another person near them yawns.

They recruited 50 people, spoke to each subject one-one-one, and intentially either yawned
or did not yawn during the session. Each subject sat in a small room for a fixed amount of time, and the Mythbusters secretly observed to see whether they yawned.

---

## Description

Randomly assigned people to two groups: 34 to a group where a person near them yawned (treatment) and 16 to a control group where they didn't see someone yawn (control).

```r
glimpse(mb_yawn)
```

```
## Rows: 50
## Columns: 2
## $ group   <fct> treatment, treatment, treatment, control, control, control, tr…
## $ outcome <fct> yawn, yawn, no yawn, no yawn, yawn, yawn, yawn, no yawn, no ya…
```

```r
mb_yawn %>%
  count(group, outcome)
```

```
##       group outcome  n
## 1   control no yawn 12
## 2   control    yawn  4
## 3 treatment no yawn 24
## 4 treatment    yawn 10
```

---

## Proportion of yawners

```r
mb_yawn %>%
  count(group, outcome) %>%
  group_by(group) %>%
  mutate(rel_prop = round(n / sum(n), 2))
```

```
## # A tibble: 4 x 4
## # Groups:   group [2]
##   group     outcome     n rel_prop
##   <fct>     <fct>   <int>    <dbl>
## 1 control   no yawn    12     0.75
## 2 control   yawn        4     0.25
## 3 treatment no yawn    24     0.71
## 4 treatment yawn       10     0.29
```

The Mythbusters claimed that the difference in proportion of yawners between the two groups (0.04) was significant, based on intuition. Let's see if hypothesis testing agrees...

---

## Independence?

Based on the proportions we calculated, do you think yawning is really contagious, i.e. are seeing someone yawn and yawning dependent?

```
## # A tibble: 4 x 4
## # Groups:   group [2]
##   group     outcome     n p_hat
##   <fct>     <fct>   <int> <dbl>
## 1 control   no yawn    12  0.75
## 2 control   yawn        4  0.25
## 3 treatment no yawn    24  0.71
## 4 treatment yawn       10  0.29
```

---

## Possible explanations

- The observed differences might suggest that yawning is contagious, i.e. seeing someone yawn and yawning are dependent

- But the differences are small enough that we might wonder if they might simple be **due to chance**

- If we were to repeat the experiment on another group of 50 people, we may see different results. So let's simulate using a **permutation test**

---

## Hypotheses

`$H_0$`: yawning (outcome) and seeimg someone yawn (treatment vs. control) are independent:

`$$p_{treat} = p_{control}$$`

`$H_{a}$`: yawning and seeing someone yawn are not independent (in fact, we will specify that the proportion of yawners is greater in the treatment group than in the control group:

`$$p_{treat} > p_{control}$$`

where `$p_{treat}$` is the true proportion of yawners among those who saw someone else yawn, and similarly for 
`$p_{control}$`.

Note, these are equivalent to `$H_0: p_{treat} - p_{control} = 0$` and `$H_a: p_{treat} - p_{control} > 0$`.

---

## Permuting the treatment assignments

We know from our observed data that 14 people yawned and 36 people didn't yawn.

We also know that there were 16 people in the control group (didn't see someone yawn) and 34 people in the treatment group (saw someone yawn), resulting in a test statistic of 0.0441.

```r
p_hat_diff <- mb_yawn %>%
  count(group, outcome) %>%
  group_by(group) %>%
  mutate(p_hat = n / sum(n), 2) %>%
  filter(outcome == "yawn") %>%
  pull(p_hat) %>%
  diff()
p_hat_diff
```

```
## [1] 0.04411765
```

---

##  Permuting the treatment assignments

While keeping the responses (yawn vs. no yawn) fixed at what we observed, we will *permute* or shuffle the treatment assignments of the observations and recalculate the difference in proportions.

That is, we will recalculate the difference in proportions as if some of the yawners and some of the non-yawners perhaps might have been in a different treatment group.

Why do we do this?

---

## The null hypothesis

Recall that under `$H_0$`, there is no association between seeing someone yawn (treatment vs control) and the act of yawning (outcome).

If there truly is no association between treatment and control, then shuffling whether someone was assigned to watch somebody yawn or not yawn shouldn't matter -- we would expect similar proportions of people who yawn in each group.

Because we are in the hypothesis testing framework, we generate our null distribution assuming `$H_0$` true. How can we obtain that null distribution?

---

## Repeated permutations

We will do this treatment-shuffling again and again, calculate the difference in proportion for each simulation, and use this as an approximation to the null distribution.

This distribution approximates all the possible differences in proportion we could have seen if 
`$H_0$` were in fact true.

This is similar in spirit to bootstrap estimation, but here we **sample without replacement**; we merely permute/shuffle the treatment labels of each of our outcomes.

---

## Using `infer`

```r
null_dist <- mb_yawn %>%
* specify(response = outcome, explanatory = group, success = "yawn") %>%
  hypothesize(null = "independence") %>%
  generate(1000, type = "permute") %>%
  calculate(stat = "diff in props", order = c("treatment", "control"))
```

Because we are interested in the whether or not someone yawned, `response = outcome`.

We are comparing the proportions between two levels of the variable `group`, so `explanatory = group`.

Since the response variable is categorical, specify which level should be success

---

## Using `infer`

```r
null_dist <- mb_yawn %>%
  specify(response = outcome, explanatory = group, success = "yawn") %>%  
* hypothesize(null = "independence") %>%
  generate(1000, type = "permute") %>% 
  calculate(stat = "diff in props", order = c("treatment", "control"))
```

Null hypothesis: the outcome and treatment are independent.

---

## Using `infer`

```r
null_dist <- mb_yawn %>%
  specify(response = outcome, explanatory = group, success = "yawn") %>%  
  hypothesize(null = "independence") %>% 
* generate(1000, type = "permute") %>%
  calculate(stat = "diff in props", order = c("treatment", "control"))
```

Generate simulated results via pemutation

---

## Using `infer`

```r
set.seed(100)
null_dist <- mb_yawn %>%
  specify(response = outcome, explanatory = group, success = "yawn") %>%  
  hypothesize(null = "independence") %>% 
  generate(1000, type = "permute") %>% 
* calculate(stat = "diff in props", order = c("treatment", "control"))
```

Calculate the sample statistic of interest (difference in proportions).

Because the explanatory variable `group` is categorical, we need to tell R the
order in which to take the different in proportions for the calculation: `$(\hat{p}_{treat} - \hat{p}_{control})$`.

---

## Visualizing the null distirbution

What would you expect the center of the null distribution to be?

```r
visualize(null_dist) 
```

---

## Calculating the p-value

```r
null_dist %>%
  filter(stat >= p_hat_diff) %>%
  summarise(p_val = n()/nrow(null_dist))
```

```
## # A tibble: 1 x 1
##   p_val
##   <dbl>
## 1  0.53
```

```r
null_dist %>%
  get_p_value(obs_stat = p_hat_diff, direction = "greater")
```

```
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1    0.53
```
--

What is the conclusion of our hypothesis test?