The goal of today’s lab is to practice two-sample statistical inference using both simulation- and CLT-based approaches. Use the lecture notes, readings, and application exercises to help you complete the lab.
Today’s dataset has been adapted from a large data set of wine reviews from Kaggle. Reviews were scraped from WineEnthusiast, which contain information about both the wine itself as well as the reviewer’s description of that wine. Here, I provide a subset of the data. You may consider each of these observations to be an independent, representative sample of all wines.
The variables are as follows
country
: the country that the wine is from (France, Italy, US)points
: the number of points WineEnthusiast rated the wine on a scale of 1-100.price
: cost for a bottle (USD)variety
: type of grapes used to make the winetype
: red or white winetitle
: title of the wine reviewdescription
: reviewer’s description of the wineThe data for today’s lab may be found by cloning your repository available at the class GitHub repository. Load the data into your RStudio environment, and don’t forget to configure GitHub beforehand. Write all R code according to the style guidelines discussed in class.
To following resource provides code needed to make useful symbols. You may use the code to typeset the characters of interest in the narrative of your document:
$\mu$
$\alpha$
$\ge$
$\le$
$\neq$
$H_0$
$H_a$
Overall hint: When performing a hypothesis test, you must provide the significance level of your test, the null and alternative hypotheses, the p-value, your decision, and an interpretation of the p-value in context of the original research question.
You may load in the data with the following code, where ____
should be replaced by a meaningful name of your choosing:
At the start of each exercise that requires simulation, set a random seed equal to the exercise number in the R chunk.
Suppose you are interested in knowing if red or white wines in these three countries are more rated heigher. Comprehensively evaluate the hypothesis that the average points
of white wines is lower than the average points
of red wines using a simulation-based approach.
Now suppose you are now interested in how the type of wine (red or white) impacts its price. Comprehensively evaluate the hypothesis that the average price of red wines is different from the average price of white wines in these three countries. Use a CLT-based approach at the \(\alpha = 0.05\) level.
Construct and interpret a a 95% two-sided confidence interval for the difference you investigated in Exercise 2 using a CLT-based approach.
For this exercise, consider wines from US and Italy only. Evaluate the hypothesis that the country
where the wine is produced is independent of its type
(red or white). Use a simulation-based approach at the \(\alpha = 0.05\) level. Visualize your null distribution.
Let’s say I’m doing to a nice dinner party, and I’m told to bring a bottle of wine. I want to bring an acceptable wine, but I don’t like these friends enough to spend more than $20 on a bottle of wine, so I have to be strategic. I’ve been told that a cheap white is usually better than a cheap red, but that an expensive red is always better than an expensive white. Maybe I should buy a wine based on its points-to-price ratio.
Create a new variable for this ratio. Wines with higher values for this ratio might indicate that the wine is a good deal.
Then, using a CLT-based approach, comprehensively evaluate the hypothesis at the \(\alpha = 0.05\) level that among wines that are at most $20 per bottle, white wines have a higher average point-to-price ratio compared to red wines.
How would your decision and conclusion change if we used \(\alpha = 0.10\) instead?
Hint: refer back to the lectures about probability and the definition of Type I error to answer this question!
Suppose you perform 10 independent hypothesis tests at the \(\alpha = 0.05\) level, and further suppose that in reality, the null hypothesis is TRUE for all 10 tests. What is the probability that you make at least one Type I error?
Let’s say you just really hate white wine and are trying to demonstrate to others that it’s associated with lower ratings in terms of points. Describe the potential ethical ramifications of performing all of these tests and reporting only the significant results from the tests that support your narrative. Consider your answer from Exercise 6 in crafting this narrative.
Hint: if your distribution is skewed, then median and IQR are appropriate summary statistics. If your distribution is symmetric, then mean and standard deviation are appropriate.
Plot and describe the distribution of the points
of wine. Also provide appropriate summary statistics when describing your distribution, obtained by using the summarise()
function.
From Ex. 4, you should have noticed that the values that points
takes on lie between 80 and 100, even though the points are awarded on the 1-100 scale. This is because WineEnthusiast only provides reviews for wines that are at least considered “Good”. They using the following scale for classification of quality:
Create a new variable for the quality of each wine based on its points.
Is there evidence to suggest that the quality of a wine and the country
where its produced are associated? Comprehensively evaluate this hypothesis using an appropriate method of your choice.
Knit to PDF to create a PDF document. Knit and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Please only upload your PDF document to Gradescope. Associate the “Overall” graded section with the first page of your PDF, and mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages.