class: center, middle, inverse, title-slide # ANOVA ### Becky Tang ### 07.12.2021 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Data ```r rat_feed %>% head() ``` ``` ## wt_gain type ## 1 57 grain ## 2 118 beef ## 3 82 grain ## 4 86 beef ## 5 82 pork ## 6 97 grain ``` .question[How do we compare across three groups? Which groups are different?] --- ## Group means ```r rat_feed %>% group_by(type) %>% summarise(xbar = mean(wt_gain)) ``` ``` ## # A tibble: 3 x 2 ## type xbar ## <fct> <dbl> ## 1 beef 89.6 ## 2 grain 74.9 ## 3 pork 89.1 ``` <img src="23-anova_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Hypotheses We are interested in testing `$$H_{0}: \mu_{b} = \mu_{g} = \mu_{p}$$` against the alternative `\(H_{a}\)`: at least one mean is different, where `\(\mu_{*}\)` is the average weight gain under treatment `\(*\)`. -- We could use `\(t\)`-tests (or simulation-based analogue) on all possible pairs of tests. In this case, there would be three pairs: `\((\mu_{b}\)` vs `\(\mu_{g})\)`, `\((\mu_{b}\)` vs `\(\mu_{p})\)`, `\((\mu_{g}\)` vs `\(\mu_{p})\)`. But what if we had 10 different groups? That would be `\(\binom{10}{2} = 45\)` tests! --- ## Multiple comparisons - Aside from being tedious, as you've investigated in homework/lab, carrying out multiple tests can lead to an inflated Type I error rates. - We would need some way to account for the fact that multiple comparisons are being performed. --- ## Multiple comparisons - Suppose all means are actually equal, i.e. `\(H_0\)` is true, and we conduct all three pairwise tests - Suppose the tests are independent, and conducted at `\(\alpha = 0.05\)` significance level - Making the "correct decision" means failing to reject in all three tests. This probability is `\((1-0.05)^3 = 0.95^3 = 0.857\)`. - Thus, the probability of reject **at least** one of the three nulls = 1 - probability of failing to reject all three = `\(1-0.857 = 0.143 > 0.05 = \alpha\)`. This is called the .vocab[family-wise error rate] --- ## Multiple comparisons - In reality, this is a little more complicated, but the issue of an inflated Type I error rate still exists - ANOVA extends the `\(t\)`-test and is one way to control the overall Type I error rate at a fixed level `\(\alpha\)` --- ## ANOVA .vocab[ANOVA] is short for .vocab[analysis of variance]. We use ANOVA to compare means for more than two groups. ANOVA extends the two-sample `\(t\)`-test to more than `\(K \geq 2\)` groups. `\begin{align} & H_{0}: \mu_{1} = \mu_{2} = \ldots = \mu_{K}\\ & H_{a}: \text{ at least one group's mean is different} \end{align}` --- ## ANOVA testing procedure In ANOVA, we typically follow this testing procedure. First, we conduct an *overall* test of the null hypothesis that the means of all of the groups are equal. - If we reject this null, then we usually proceed to see which means are different from each other. A .vocab[multiple comparisons correction] is sometimes used for these pairwise comparisons of means. -- - If we fail to reject this null, we usually stop testing. --- ## ANOVA alternate hypothesis In our example, our null is `\(H_{0}: \mu_{b} = \mu_{g} = \mu_{p}\)`. What could happen under the alternative? - `\(\mu_{b} \neq \mu_{g} \neq \mu_{p}\)` - `\(\mu_{b} = \mu_{g}\)`, but `\(\mu_{p}\)` different - `\(\mu_{b} = \mu_{p}\)`, but `\(\mu_{g}\)` different - `\(\mu_{g} = \mu_{p}\)`, but `\(\mu_{b}\)` different -- If we reject the null hypothesis, any of these situations could be true, and we may wish to conduct further tests to discover what setting we are in. Conducting further tests without rejecting the overall test of `\(H_{0}\)` will lead to an inflated Type I error rate unless we use another method to adjust for multiple comparisons. --- ## Setup - `\(H_{0}: \mu_{1} = \mu{2} = \ldots = \mu_{K}\)` - `\(H_{a}\)`: at least one of the means is different - `\(n_{k}\)`: number of subjects/observations in the `\(k\)`-th group - `\(y_{ik}\)`: response of subject `\(i\)` in group `\(k\)`, where `\(i = 1,\ldots, n_{k}\)` and `\(k = 1, \ldots, K\)`. --- ## Setup - The ANOVA model is `$$y_{ik} = \mu_{k} + \epsilon_{ik}$$` where we assume the `\(\epsilon_{ik}\)` are independent and normally distributed with mean 0 and variance `\(\sigma^2\)` - If `\(y_{ik}\)` comes from group `\(k\)`, we expect its observation to be similar to the observations of other subjects in the same group `\(k\)`. However, there will be some variability within the group. - This is equivalent to saying the `\(y_{ik} \sim N(\mu_{k}, \sigma)\)` --- ## Assumptions 1. Outcomes within groups are normally distributed 2. Homoscedastic variance (the within-group variance is the same for all groups) 3. Independent samples --- ## Rat feed data: assumptions <img src="23-anova_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Rat feed data: assumptions ```r rat_feed %>% group_by(type) %>% summarise(n = n(), var = var(wt_gain)) ``` ``` ## # A tibble: 3 x 3 ## type n var ## <fct> <int> <dbl> ## 1 beef 20 314. ## 2 grain 20 225. ## 3 pork 20 300. ``` In our example, `\(K = 3\)` and `\(n_{1} = n_{2} = n_{3} = 20\)` subjects in each group. --- ## ANOVA Obtain `\(\bar{y}\)`, the overall mean of all observations. Then for each group `\(k\)`, calculate group mean and variance: - beef: `\(\bar{y}_{1}\)` and `\(s_{1}^2\)` - grain: `\(\bar{y}_{2}\)` and `\(s_{2}^2\)` - pork: `\(\bar{y}_{3}\)` and `\(s_{3}^2\)` -- .question[Why is ANOVA called analysis of variance?] - .vocab[Within-groups variance]: variance of individual observations around their respective group means - .vocab[Between-groups variance]: variance of the group means around the overall mean of all observations `\(\bar{y}\)` --- ## Within-groups variance `$$s_{k}^2 = \dfrac{\sum_{i=1}^{n_{k}} (y_{ik} - \bar{y}_{k})^2}{n_{k}-1}$$` `\(s_{k}^2\)` is a measure of the variance of the individuals around the group mean. To get a pooled estimate of the common variance of individuals around their respective group means, we calculate `$$s_{W}^2 = \dfrac{(n_{1} - 1) s_{1}^2 + (n_{2}-1)s_{2}^2 + \ldots + (n_{K} - 1)s_{K}^2}{N-K}$$` where `\(N = n_{1} + n_{2} + \ldots + n_{K}\)`. We can think of the within-groups variance as the inherent variability in the population. --- ## Between-groups variance `$$s_{B}^2 = \dfrac{n_{1}(\bar{y}_{1} - \bar{y})^2 + n_{2}(\bar{y}_{2} - \bar{y})^2 + \ldots + n_{K}(\bar{y}_{K} - \bar{y})^2}{K-1}$$` We can think of the between-groups variance as the sum of inherent variability **and** any kind of systematic variability due to the group effect. --- ## Analysis of variance to compare means - If the sample means `\(\bar{y}_{k}\)` vary around the overall mean `\(\bar{y}\)` more than the individual observations `\(y_{ik}\)` vary around the sample means `\(\bar{y}_{k}\)`, then we have evidence that the corresponding population means are in fact different - This is achieved by comparing the between-groups variance `\(s_{B}^2\)` to the within-groups variance `\(s_{W}^2\)` - Want to see if the between-groups variance is much larger than the within-groups variance `\((s_{B}^2 > > s_{W}^2)\)` --- ## `\(F\)`-test - Consider the `\(F\)` statistic given by `$$F = \dfrac{s_{B}^2}{s_{W}^2}$$` - If `\(H_{0}\)` true, this `\(F\)` statistic has an `\(F\)` distribution with `\(K-1\)` and `\(N-1\)` degrees of freedom, associated with `\(s_{B}^2\)` and `\(s_{W}^2\)`, respectively. - The `\(F\)`-distribution only takes on non-negative values. <img src="23-anova_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## `\(F\)`-test - The F-test for ANOVA is inherently one-tailed. We reject `\(H_{0}\)` only if our `\(F\)` statistics is considerably larger than one `\((s_{B}^2 > > s_{W}^2)\)` - If we only have `\(K=2\)` groups, then the F-test yields the same result as the two-sample `\(t\)`-test - This does not mean we have a one-sided alternative; we just look at one tail of the `\(F\)` distribution to get the p-value --- ## `aov()` ```r mod_aov <- aov(wt_gain ~ type, data = rat_feed) tidy(mod_aov) ``` ``` ## # A tibble: 2 x 6 ## term df sumsq meansq statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 type 2 2787. 1393. 4.98 0.0101 ## 2 Residuals 57 15932. 280. NA NA ``` Here, the grouping variable `type` is the "between", and `Residuals` is the "within". Why? --- ## `\(F\)`-test for rat feed data ```r tidy(mod_aov) ``` ``` ## # A tibble: 2 x 6 ## term df sumsq meansq statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 type 2 2787. 1393. 4.98 0.0101 ## 2 Residuals 57 15932. 280. NA NA ``` Our `\(F\)` statistic is `\(s_{B}^2/s_{W}^2 = 1393/280 = 4.98\)` with `\(3-1 = 2\)` and `\(60-3 = 57\)` degrees of freedom. <img src="23-anova_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## `\(F\)`-test for rat feed data ```r tidy(mod_aov) ``` ``` ## # A tibble: 2 x 6 ## term df sumsq meansq statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 type 2 2787. 1393. 4.98 0.0101 ## 2 Residuals 57 15932. 280. NA NA ``` This corresponds to a p-value of 0.01. At `\(\alpha = 0.05\)`, we reject the null hypothesis. There is sufficient evidence that at least one of the three types of feed comes from a population with a different mean from the others. -- .question[What group/groups is/are different?] --- ## Bonferroni correction As we showed earlier, conducting multiple tests on a data set increases the **family-wise error rate**. One very conservative way to ensure this is not the case is to simply divide `\(\alpha\)` by the number of tests to be done, and to use that as the significance level. This is known as the .vocab[Bonferroni correction]. --- ## Bonferroni correction For example, if we want to conduct three pairwise tests, in order to preserve an overall `\(0.05\)` Type I error rate, the Bonferroni correction would use `\(\alpha^* = 0.05/3 = 0.017\)` for each test. Bonferroni is a conservative correction, making it harder to reject the null hypothesis, but it is a safe bet in controlling the Type I error rate. --- ## Rat feed data: group differences Using a Bonferroni correction, we can conduct three pairwise tests to test the following null hypotheses: 1. `\(H_{0}: \mu_{B} = \mu_{G}\)` 2. `\(H_{0}: \mu_{B} = \mu_{P}\)` 3. `\(H_{0}: \mu_{P} = \mu_{G}\)` ``` ## # A tibble: 3 x 3 ## test statistic p_value ## <chr> <dbl> <dbl> ## 1 B vs. G 2.83 0.00743 ## 2 B vs. P 0.0903 0.929 ## 3 G vs. P -2.77 0.00865 ``` These are the raw/uncorrected p-values for the pairwise t-tests. What do we conclude?