AE 10: Time spent on email

Announcements

No lab this week
Exam 01: will be released Friday morning, and is Sunday, June 13 at 11:59pm Eastern. Note: 12-hour grace period will NOT apply for the exam
- Material from Weeks 01 - 04 (data viz, data wrangling, probability)
- Open book/ open note
- You cannot discuss the exam with anyone
- TA Marc will not have office hours during exam period
- Email Becky if you have questions. I will be checking my email consistently
- See the Academic honesty policy in the course syllabus.
- Will have an Exam 01 repo on GitHub. Exam instructions will be in the README of the repo. Push your changes to the repo, then submit the final PDF on Gradescope. Be sure to mark the pages corresponding to each question.

Example: Bootstrap Estimation with `infer`

library(tidyverse)
library(infer)

Data

The General Social Survey (GSS) gathers data on contemporary American society in order to understand and explain trends and constants in attitudes, behaviors, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.

The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

We will analyze data from the 2018 GSS, using to understand how much time US adults spend on email.

gss_email <- read_csv("data/gss_email_2018.csv")

We’ll use the variable email: the number of minutes the respondents spend on email weekly.

We want to calculate a 95% confidence interval for the mean amount of time Americans spend on email weekly.

Set seed

Set a seed to control R’s random sampling process.

set.seed(06072021)

Bootstrap distribution

boot_dist <- gss_email %>%
  specify(response = email) %>% 
  generate(reps = 1000, type = "bootstrap") %>% 
  calculate(stat = "mean")

specify() is used to specify which variable in our data frame is the relevant response variable (i.e. the variable we’re bootstrapping).
generate() generates the bootstrap samples.
calculate() calculates the sample statistic. In this case, we’re using the mean, but we could just as easily calculate, say, bootstrap medians.

glimpse(boot_dist)

## Rows: 1,000
## Columns: 2
## $ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ stat      <dbl> 451.2368, 443.8189, 439.5772, 427.5405, 426.8520, 436.1001, …

Calculate 95% confidence interval

ci <- boot_dist %>%
  summarise(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975))
ci

## # A tibble: 1 x 2
##   lower upper
##   <dbl> <dbl>
## 1  396.  469.

ggplot(data = boot_dist, aes(x = stat)) +
  geom_histogram(alpha = 0.7) + 
  geom_vline(xintercept = ci$lower, lty = 2, size = 1, color = "steelblue") +
  geom_vline(xintercept = ci$upper, lty = 2, size = 1, color = "steelblue") + 
  labs(title = "Bootstrap distribution of means", 
       subtitle = "With 95% confidence interval", 
       x = "Bootstrap Means", 
       y = "Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation (Zoom poll)

The 95% confidence interval for the mean number of minutes Americans spend on email weekly is (395.522, 468.575). Which of the following is the correct interpretation of this interval?

95% of the time Americans spend on email in the sample is between 395.522 and 468.575 minutes weekly.
95% of all Americans spend between 395.522 and 468.575 minutes on email weekly.
We are 95% confident that the mean time spent on email by all Americans is between 395.522 and 468.575 minutes weekly.
We are 95% confident that the time American spend on email in this sample is between 395.522 and 468.575 minutes weekly.

Clone a repo + start a new project

Go to the ae-10-[GITHUB USERNAME] repo, clone it, and start a new project in RStudio.

Configure git

Run the following code to configure Git. Fill in your GitHub username and the email address associated with your GitHub account.

library(usethis)
use_git_config(user.name = "your github username", user.email ="your email")

Bootstrap confidence intervals

We will use the infer package to calculate a 95% confidence interval for the median amount of time Americans spend on email weekly.

We start by setting a seed. We’ll use 1234 to set our seed today but you can use any value you want on assignments.

set.seed(1234)

Creating the data frame

Uncomment the lines and fill in the blanks to create a data frame containing the bootstrapped data - sample medians in our case.

boot_dist <- gss_email #%>%
  #specify(______) %>%
  #generate(______) %>%
  #calculate(______)

Glimpse the data frame of the bootstrapped data.

### Glimpse the bootstrapped data frame

Find the confidence interval

Uncomment the lines and fill in the blanks to construct the 95% bootstrap confidence interval for the median amount of time Americans spend on email weekly.

#___ %>%
#  summarise(lower = quantile(______),
  #          upper = quantile(______))

Interpret the interval

Write the interpretation for the interval calculated above.

Changing the confidence level

Modify the code used to calculate a 95% confidence interval to calculate a 90% confidence interval of the median time American spend on email weekly. Is your interval wider, narrower, or the same as the 95% confidence interval?

#calculate a 90% confidence interval

Now let’s calculate a 99% confidence interval for the median time Americans spend on email weekly.

#calculate a 99% confidence interval

Leaving everything else the same, how does decreasing the confidence level affect the confidence interval? How does increasing the confidence level affect the confidence interval?