infer
library(tidyverse)
library(infer)
The General Social Survey (GSS) gathers data on contemporary American society in order to understand and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.
The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
We will analyze data from the 2018 GSS, using to understand how much time US adults spend on email.
gss_email <- read_csv("data/gss_email_2018.csv")
We’ll use the variable email
: the number of minutes the respondents spend on email weekly.
We want to calculate a 95% confidence interval for the mean amount of time Americans spend on email weekly.
Set a seed to control R’s random sampling process.
set.seed(06072021)
boot_dist <- gss_email %>%
specify(response = email) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
specify()
is used to specify which variable in our data frame is the relevant response variable (i.e. the variable we’re bootstrapping).generate()
generates the bootstrap samples.calculate()
calculates the sample statistic. In this case, we’re using the mean, but we could just as easily calculate, say, bootstrap medians.glimpse(boot_dist)
## Rows: 1,000
## Columns: 2
## $ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ stat <dbl> 451.2368, 443.8189, 439.5772, 427.5405, 426.8520, 436.1001, …
ci <- boot_dist %>%
summarise(lower = quantile(stat, 0.025),
upper = quantile(stat, 0.975))
ci
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 396. 469.
ggplot(data = boot_dist, aes(x = stat)) +
geom_histogram(alpha = 0.7) +
geom_vline(xintercept = ci$lower, lty = 2, size = 1, color = "steelblue") +
geom_vline(xintercept = ci$upper, lty = 2, size = 1, color = "steelblue") +
labs(title = "Bootstrap distribution of means",
subtitle = "With 95% confidence interval",
x = "Bootstrap Means",
y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The 95% confidence interval for the mean number of minutes Americans spend on email weekly is (395.522, 468.575). Which of the following is the correct interpretation of this interval?
95% of the time Americans spend on email in the sample is between 395.522 and 468.575 minutes weekly.
95% of all Americans spend between 395.522 and 468.575 minutes on email weekly.
We are 95% confident that the mean time spent on email by all Americans is between 395.522 and 468.575 minutes weekly.
We are 95% confident that the time American spend on email in this sample is between 395.522 and 468.575 minutes weekly.
Go to the ae-10-[GITHUB USERNAME]
repo, clone it, and start a new project in RStudio.
Run the following code to configure Git. Fill in your GitHub username and the email address associated with your GitHub account.
library(usethis)
use_git_config(user.name = "your github username", user.email ="your email")
We will use the infer
package to calculate a 95% confidence interval for the median amount of time Americans spend on email weekly.
We start by setting a seed. We’ll use 1234 to set our seed today but you can use any value you want on assignments.
set.seed(1234)
Uncomment the lines and fill in the blanks to create a data frame containing the bootstrapped data - sample medians in our case.
boot_dist <- gss_email #%>%
#specify(______) %>%
#generate(______) %>%
#calculate(______)
Glimpse the data frame of the bootstrapped data.
### Glimpse the bootstrapped data frame
Uncomment the lines and fill in the blanks to construct the 95% bootstrap confidence interval for the median amount of time Americans spend on email weekly.
#___ %>%
# summarise(lower = quantile(______),
# upper = quantile(______))
Write the interpretation for the interval calculated above.
Modify the code used to calculate a 95% confidence interval to calculate a 90% confidence interval of the median time American spend on email weekly. Is your interval wider, narrower, or the same as the 95% confidence interval?
#calculate a 90% confidence interval
Now let’s calculate a 99% confidence interval for the median time Americans spend on email weekly.
#calculate a 99% confidence interval
Leaving everything else the same, how does decreasing the confidence level affect the confidence interval? How does increasing the confidence level affect the confidence interval?