class: center, middle, inverse, title-slide # Central Limit Theorem (CLT) ### Becky Tang ### 06.16.2021 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: center, middle ## Sample Statistics and Sampling Distributions --- ## Variability of sample statistics - We've seen that each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.) - Previously we've quantified this value via simulation - Today we talk about some of the theory underlying .vocab[sampling distributions], particularly as they relate to sample means. --- ## Statistical inference - Statistical inference is the act of generalizing from a sample in order to make conclusions regarding a population. - We are interested in population parameters, which we do not observe. Instead, we must calculate statistics from our sample in order to learn about them. - As part of this process, we must quantify the degree of uncertainty in our sample statistic. --- ## Random variables and Distributions A .vocab[random variable] is a variable whose value is the outcome of a random event. In statistics, a .vocab[probability distribution] is the mathematical function that gives the probabilities of occurrence of different possible outcomes/events of a random variable. -- It can be as simple as a list of outcomes and their associated probabilities. For example, let `\(X\)` be a random variable denoting a fair six-sided die. |X|1|2|3|4|5|6| |-------------| |Probability| 1/6| 1/6| 1/6| 1/6| 1/6| 1/6| --- ## Random variables and Distributions However, it's rarely the case that we have random variables that have so few possible outcomes. It would be impossible to write a table of probabilites for distributions with (infinitely) many possible outcomes. For example, what if `\(X\)` = height of students at Duke? -- Instead, we can define a function that defines the distribution of `\(X\)`. This is just like the table: for a given outcome of `\(X\)`, the function tells you the probability with which that event occurs, according to the distribution. --- ## Sampling distribution of the mean Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following: -- 1. Take a random sample of size `\(n\)` from this population, and calculate the mean resting heart rate in this sample, `\(\bar{X}_1\)` -- 2. Put the sample back, take a second random sample of size `\(n\)`, and calculate the mean resting heart rate from this new sample, `\(\bar{X}_2\)` -- 3. Put the sample back, take a third random sample of size `\(n\)`, and calculate the mean resting heart rate from this sample, too... -- ...and so on. --- ## Sampling distribution of the mean After repeating this many times, we have a dataset that has the sample averages from the population: `\(\bar{X}_1\)`, `\(\bar{X}_2\)`, `\(\cdots\)`, `\(\bar{X}_K\)` (assuming we took `\(K\)` total samples). -- .question[ Can we say anything about the distribution of these sample means (that is, the .vocab[sampling distribution] of the mean?) ] *(Keep in mind, we don't know what the underlying distribution of mean resting heart rate looks like in Duke students!)* --- class: center, middle ## The Central Limit Theorem --- class: middle A quick caveat... For now, let's assume we know the underlying standard deviation, `\(\sigma\)`, from our distribution. This is a measure of the amount of variation in a set of values. --- ## The Central Limit Theorem For a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample average `\(\bar{X}\)`, assuming certain conditions hold: -- 1. The mean of the sampling distribution of the mean is identical to the population mean `\(\mu\)`. -- 2. The standard deviation of the distribution of the sample averages is `\(\sigma/\sqrt{n}\)`. - This is called the .vocab[standard error] (SE) of the mean. -- 3. For `\(n\)` large enough, the shape of the sampling distribution of means is approximately .vocab[normally distributed]. --- ## The normal (Gaussian) distribution The normal distribution is unimodal and symmetric and is described by its .vocab[density function]: If a random variable `\(X\)` follows the normal distribution, then `$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2} \right\}$$` where `\(\mu\)` is the mean and `\(\sigma^2\)` is the variance `\((\sigma \text{ is the standard deviation})\)` .alert[ We often write `\(N(\mu, \sigma)\)` to describe this distribution, and that we write `\(X \sim N(\mu,\sigma)\)` to denote that `\(X\)` follows the normal. ] --- ## The normal distribution (graphically) <img src="13-clt_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- ## Wait, *any* distribution? The central limit theorem tells us that *<b>sample averages</b>* are normally distributed, if we have enough data and certain assumptions hold. This is true *even if our original variables are not normally distributed*. Click [here](http://onlinestatbook.com/stat_sim/sampling_dist/index.html) to see an interactive demonstration of this idea. --- ## Conditions for CLT We need to check two conditions for CLT to hold: independence, sample size/distribution. -- 1) .vocab[Independence:] The sampled observations must be independent. This is difficult to check, but the following are useful guidelines: - the sample must be randomly taken - if sampling without replacement, sample size must be less than 10% of the population size -- If samples are independent, then by definition one sample's value does not "influence" another sample's value. --- ## Conditions for CLT 2) .vocab[Sample size / distribution:] if we know for sure that the underlying data are normally distributed, then the distribution of sample averages will also be exactly normal, regardless of the sample size. Otherwise, we have the following guidelines: - if data are numerical, usually n > 30 is considered a large enough sample for the CLT to kick in - if data are categorical, at least 10 successes and 10 failures. --- class: middle, center ## Let's run our own simulation --- ### Underlying population (not observed in real life!) .small[ ```r rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100) ``` ] <img src="13-clt_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> **The true population parameters** .small[ ``` ## # A tibble: 1 x 2 ## mu sigma ## <dbl> <dbl> ## 1 16.7 14.1 ``` ] --- ## Sampling from the population - 1 ```r set.seed(1) samp_1 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_1 ``` ``` ## # A tibble: 1 x 1 ## x_bar ## <dbl> ## 1 14.2 ``` --- ## Sampling from the population - 2 ```r set.seed(2) samp_2 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_2 ``` ``` ## # A tibble: 1 x 1 ## x_bar ## <dbl> ## 1 16.1 ``` --- ## Sampling from the population - 3 ```r set.seed(3) samp_3 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_3 ``` ``` ## # A tibble: 1 x 1 ## x_bar ## <dbl> ## 1 17.7 ``` -- keep repeating... --- ## Sampling distribution .small[ ```r set.seed(1212) sampling <- rs_pop %>% rep_sample_n(size = 50, replace = TRUE, reps = 5000) %>% group_by(replicate) %>% summarise(xbar = mean(x)) ``` ] <img src="13-clt_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ```r sampling %>% summarise(mean = mean(xbar), se = sd(xbar)) ``` ``` ## # A tibble: 1 x 2 ## mean se ## <dbl> <dbl> ## 1 16.7 1.97 ``` --- .question[ How do the shapes, centers, and spreads of these distributions compare? ] <img src="13-clt_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Recap - If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution. -- - The center of the sampling distribution of the mean is at the center of the population distribution. -- - The sampling distribution of the mean is less variable than the population distribution (and we can quantify by how much). -- .question[ What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as `\(n\)` increases? ] --- class: center, middle ## Finding probabilities in R --- ## Probabilities under N(0,1) curve .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(Z < -1.5)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Probabilities under N(0,1) curve .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(Z < -1.5)\)`? ] The `p<dist>(z)` function in R lets us calculate the probability that a random variable that follows the specified distribution is less than `\(z\)`. ```r # P(Z < -1.5) pnorm(-1.5) ``` ``` ## [1] 0.0668072 ``` --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] -- <img src="13-clt_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] ```r pnorm(2) - pnorm(-1) ``` ``` ## [1] 0.8185946 ``` --- ## Finding cutoff values under N(0,1) curve .question[ What value of `\(Z\)` would be at the 25-th percentile of `\(N(0,1)\)` distribution? ] <img src="13-clt_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ## Finding cutoff values under N(0,1) curve .question[ What value of `\(Z\)` would be at the 25-th percentile of `\(N(0,1)\)` distribution? ] The `q<dist>(p)` function in R lets us find which value of the random variable `\(Z\)` lies at the `\(p\)`-th percentile, when `\(Z\)` follows the specified distribution ```r # find Q1 qnorm(0.25) ``` ``` ## [1] -0.6744898 ``` --- ## Standard Normal We have been using `\(Z\)` to denote a random variable that follows a `\(N(0,1)\)` distribution, which is called the .vocab[standard normal]. - Mean `\(\mu = 0\)` - Standard deviation and variance `\(\sigma = \sigma^2 = 1\)` These are the default parameter values when using the `_norm()` functions in R. -- If you want a different mean and standard deviation (ex. `\(\mu = -2\)` and `\(\sigma\)` = 0.25), you have to provide these values: ```r # value at 75-th percentile of N(-2, 0.25) distribution qnorm(0.75, -2, 0.25) ``` ``` ## [1] -1.831378 ``` --- ## Different standard deviation <img src="13-clt_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />