class: center, middle, inverse, title-slide # Inference with the CLT ### Becky Tang ### 06.17.2021 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## The Central Limit Theorem For a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample average `\(\bar{X}\)`, assuming certain conditions hold: 1. The distribution of the sample statistic is nearly normal 2. The distribution is centered at the (often unknown) population parameter 3. The variability of the distribution is inversely proportional to the square root of the sample size --- ## Why do we care? Knowing the distribution of the sample statistic `\(\bar{X}\)` can help us -- - estimate a population parameter as point **estimate** `\(\boldsymbol{\pm}\)` **margin of error** - the .vocab[margin of error] is comprised of a measure of how confident we want to be and how variable the sample statistic is -- <br> - test for a population parameter by evaluating how likely it is to obtain to observed sample statistic when assuming that the null hypothesis is true - this probability will depend on how variable the sampling distribution is --- class: center, middle ## Inference based on the CLT --- ## The Central Limit Theorem If necessary conditions are met, for a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample average `\(\bar{X}\)`, assuming certain conditions hold: -- 1. The mean of the sampling distribution of the mean is identical to the population mean `\(\mu\)`. -- 2. The standard deviation of the distribution of the sample averages is `\(\sigma/\sqrt{n}\)`. - This is called the .vocab[standard error] (SE) of the mean. -- 3. For `\(n\)` large enough, the shape of the sampling distribution of means is approximately .vocab[normally distributed]. -- That is, the CLT tells us that `\(\bar{X}\)` approximately has the distribution `\(N\left(\mu, \sigma/\sqrt{n}\right)\)`. --- ## Inference based on the CLT If the CLT tells us that `\(\bar{X}\)` approximately has the distribution `\(N\left(\mu, \sigma/\sqrt{n}\right)\)`, then by properties of the normal distribution, we have `$$Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)$$` --- ## What if `\(\sigma\)` isn't known? <img src="img/14/guinness.jpg" width="80%" style="display: block; margin: auto;" /> --- ## T distribution In practice, we never know the true value of `\(\sigma\)`, and so we estimate it from our data with `\(s\)`, the .vocab[sample standard deviation]. -- We cannot directly substitute `\(s\)` for `\(\sigma\)` and expect to have the same result as in the CLT. After all, `\(s\)` is just an estimate. -- But we can make the following test statistic for testing a single sample's population mean, which has a .vocab[t-distribution with n-1 degrees of freedom]: .question[ $$ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1}$$ ] --- ## T distribution The t-distribution is also unimodal and symmetric, and is centered at 0. -- It has one parameter: the degrees of freedom. -- Thicker tails than the normal distribution - This is to make up for additional variability introduced by using `\(s\)` instead of `\(\sigma\)` in calculation of the SE --- ## T vs Z distributions <img src="14-clt-inference_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## T distribution .pull-left[ .vocab[Finding probabilities under the t curve:] ```r #P(t < -1.96) pt(-1.96, df = 9) ``` ``` ## [1] 0.0408222 ``` ```r #P(t > -1.96) pt(-1.96, df = 9, lower.tail = FALSE) ``` ``` ## [1] 0.9591778 ``` ] -- .pull-right[ .vocab[Finding cutoff values under the t curve:] ```r # Find Q1 qt(0.25, df = 9) ``` ``` ## [1] -0.7027221 ``` ```r # Q3 qt(0.75, df = 9) ``` ``` ## [1] 0.7027221 ``` ] --- ## Resident satisfaction in Durham `durham_survey` contains resident responses to a survey given by the City of Durham in 2018. These are a randomly selected, representative sample of Durham residents. Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied." -- .question[ Is there evidence that, on average, Durham residents are generally satisfied (score greater than 3) with the quality of the public library system? ] --- ## Exploratory Data Analysis ```r durham <- read_csv("data/durham_survey.csv") %>% filter(quality_library != 9) ``` ```r durham %>% summarise(x_bar = mean(quality_library), med = median(quality_library), sd = sd(quality_library), n = n()) ``` ``` ## # A tibble: 1 x 4 ## x_bar med sd n ## <dbl> <dbl> <dbl> <int> ## 1 3.97 4 0.900 521 ``` --- ## Exploratory Data Analysis <img src="14-clt-inference_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Hypotheses .question[ What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system? ] -- `$$H_0: \mu = 3$$` `$$H_a: \mu > 3$$` --- ## Conditions .question[ What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied? ] -- **Independence?** -- The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000). -- **Sample size / distribution?** -- 521 > 30, so the sample is large enough to apply the Central Limit Theorem. --- ## Calculating the test statistic Summary statistics from the sample: ``` ## # A tibble: 1 x 3 ## xbar s n ## <dbl> <dbl> <int> ## 1 3.97 0.900 521 ``` -- And the CLT says: `$$\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)$$` -- .question[ How many standard errors away from the population mean is the observed sample mean? ] .question[ How likely are we to observe a sample mean that is at least as extreme as the observed sample mean, if in fact the null hypothesis is true? ] --- ## Calculations As we do not know the true `\(\sigma\)` and we only have `\(s\)`, use the `\(t\)`-distribution. -- `\(SE = s / \sqrt{n}\)` ```r (se <- durham_summary$s / sqrt(durham_summary$n)) # SE ``` ``` ## [1] 0.03944416 ``` -- `\(t = \frac{\bar{X} - \mu}{s / \sqrt{n}}\)` ```r (t <- (durham_summary$xbar - 3) / se) # Test statistic ``` ``` ## [1] 24.57372 ``` --- ## Calculations Degrees of freedom for this test is `\(n-1\)` ```r (df <- durham_summary$n - 1) # Degrees of freedom ``` ``` ## [1] 520 ``` -- P-value ```r pt(t, df, lower.tail = FALSE) # P-value, P(T > t |H_0 true) ``` ``` ## [1] 2.247911e-89 ``` --- ## Conclusion The p-value is very small, so we reject `\(H_0\)`. -- The data provide sufficient evidence at the `\(\alpha = 0.05\)` level that Durham residents, on average, are satisfied with the quality of the public library system. -- .question[ Would you expect a 95% confidence interval to include 3? ] --- ## Confidence interval for a mean .alert[ **General form of the confidence interval** `$$point~estimate \pm critical~value \times SE$$` ] -- .alert[ **Confidence interval for the mean** `$$\bar{x} \pm t^*_{n-1} \times \frac{s}{\sqrt{n}}$$` ] --- ## Calculate 95% confidence interval .alert[ `$$\bar{x} \pm t^*_{n-1} \times \frac{s}{\sqrt{n}}$$` ] -- ```r # Critical value t_star <- qt(0.975, df) ``` -- ```r # Point estimate point_est <- durham_summary$xbar ``` -- ```r # Confidence interval round(point_est + c(-1,1) * t_star * se, 2) ``` ``` ## [1] 3.89 4.05 ``` --- ## Interpret 95% confidence interval ```r round(point_est + c(-1,1) * t_star * se, 2) ``` ``` ## [1] 3.89 4.05 ``` .question[ Interpret this interval in context of the data. ] -- **We are 95% confident that the true mean rating for Durham residents' satisfaction with the library system is between 3.89 and 4.05.** --- class: middle, center ## Inference with the CLT using `infer` --- ## CLT-based hypothesis testing in `infer` `$$H_0: \mu = 3 \text{ vs }H_a: \mu > 3$$` -- ```r durham %>% t_test(response = quality_library, mu = 3, alternative = "greater", conf_int = FALSE) ``` ``` ## # A tibble: 1 x 4 ## statistic t_df p_value alternative ## <dbl> <dbl> <dbl> <chr> ## 1 24.6 520 2.25e-89 greater ``` --- ## CLT-based confidence intervals in `infer` Calculate a 95% confidence interval for the mean satisfaction rating. -- ```r durham %>% t_test(response = quality_library, alternative = "two-sided", conf_int = TRUE, conf_level = 0.95) ``` ``` ## # A tibble: 1 x 6 ## statistic t_df p_value alternative lower_ci upper_ci ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 101. 520 0 two.sided 3.89 4.05 ``` --- ## Other built-in functionality in R - There are more built in functions for doing some of these tests in R. - However a learning goal is this course is not to go through an exhaustive list of all CLT based tests and how to implement them - Instead the goal is to understand how these methods are / are not like the simulation based methods we learned about earlier -- .question[ What is similar, and what is different, between CLT based test of means vs. simulation based test? ]