Introduction and Data

The goal of today’s lab is to practice creating bootstrap confidence intervals, and visualizing bootstrap distributions.

Tips

Use a small number of reps (about 100) as you write and test out your code. Once you have finalized all of your code, increase the number of reps to at least 1000 (10,000 if possible) to produce your final results.
Write your code and narrative in a reproducible way, so you can easily change the number of reps. Tips on making your code more reproducible:

Tip 1: Set your seed

When dealing with randomness (as often the case in simulation in statistics), it is important to specify which pseudo-random draw you used in your analysis, so that you or someone else can reproduce the exact numbers you initially report. The set.seed() function in R allows you to ensure that all of your analysis relies on a specific pseudo-random draw:

set.seed(42)

Tip 2: Define global parameters

Often, we rely on specific parameters values throughout our analysis, and at a later point, we may want to replace them. In order to minimize the need to change your code later, we can assign the parameter values to a name, and use the name (rather than the hard-coded value) downstream. Then, to update your code at a later point, you can just change the value. Here, we are assinging the number of reps to a variable called num_reps:

num_reps <- 100
boot_dist <- diamonds %>%
  infer::specify(response = price) %>%
  infer::generate(reps=num_reps, type = "bootstrap") %>%
  infer::calculate(stat="mean")

The data

The data for today’s lab may be found by cloning your lab-05-durham- repository available in the GitHub course organization.

Today’s data comes from the City of Durham’s annual Resident Satisfaction Survey for 2018 (more information accessible here). In particular, the durham_survey dataset contains data from 608 Durham residents on the survey questions below. Assume that the data are representative of Durham residents and may be generalized to the wider population of all city residents.

Any variable starting with quality refers to the perceived quality of the listed variable, with 1 being “highly dissatisfied,” 3 being “neutral”, and 5 being “highly satisfied.” A value of 9 indicates that the subject responded with N/A.

You may load in the data with the following code, where ____ should be replaced by a meaningful name of your choosing:

library(tidyverse)
library(infer)
____ <- read_csv("data/durham_survey.csv")

Exercises

Write all R code according to the style guidelines discussed in class.

⊕Hint: be careful with how missing values are coded in this survey. As well, don’t forget to set a seed in order to ensure reproducibility!

Provide a point estimate of the mean satisfaction with the fire department (quality_fire) among Durham residents in 2018.
Construct a 95% bootstrap interval for the mean satisfaction with the fire department among durham residents in 2018. Use at least 1,000 bootstrap samples. Make sure your interval is reproducible.
Visualize the bootstrap distribution and your confidence interval from Exercise 2. Interpret the confidence interval you constructed.
Provide a point estimate of the proportion of the respondents in the survey who were satisfied (score of 4 or 5) with the quality of parks and recreation (quality_parks_rec) in Durham.

⊕Hint: see if you can reuse parts of code used in previous exercises.

Construct a 99% bootstrap interval for the proportion of respondents in the survey who were satisfied with the quality of parks and recreation in Durham. Make sure your interval is reproducible.
Visualize the bootstrap distribution and your confidence interval from Exercise 5. Interpret the confidence interval you constructed.

⊕Hint: If either of the two scores is missing, then that observation cannot be used to calculate the correlation.

Provide a point estimate of the correlation between survey scores of perceived quality of bike paths (quality_bike_path) and the perceived quality of pedestrian paths (quality_ped_path).

⊕Hint: To simulate the correlation between two variables, use specify(var1 ~ var2). Remember that correlation is still a numerical quantity, so that should help you choose the type of simulation you want to perform.

Construct a 95% bootstrap interval for the correlation between survey responses for perceived quality of bike paths and pedestrian paths. Make sure your interval is reproducible.
Construct a 99% bootstrap interval using the bootstrap distribution from Exercise 8.

How does the 99% bootstrap interval compare to the 95% bootstrap interval calculated in the previous exercise?

In general, how does the bootstrap interval change when the confidence level increases?

Submission

Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Please only upload your PDF document to Gradescope. Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.

Lab 05 - Bootstrap estimation

due Thursday, June 10 at 11:59p