Will be released this Friday at 9:00am, and will be due Sunday, July 11 at 11:59pm, with the same format as Exam 01. Exam 02 will cover Inference, as well as Simple Linear Regression.
No lab or TA office hours this Friday.
Write the proposal in the project-proposal.Rmd
file in the project repo.
Let’s filter spam emails! We’ll examine a data set of 3921 emails and use logistic regression to determine which ones are potentially spam.
We’ll use the following variables in the analysis:
spam
- 1: email is spam, 0: email is not spamwinner
- yes: “winner” appeared in the email, no: “winner” did not appear in emailnum_char
- Number of characters in the email (in thousands)email <- read_csv("data/email.csv") %>%
mutate(spam = factor(spam))
Visualize the relationship between (1) spam
and winner
, and (2) spam
and num_char
. Use these plots to describe the relationship between the response variable and each of the explanatory variables.
Fit a logistic regression model with spam
as the response and winner
and num_char
as explanatory variables. Use the tidy function to display the model output. Hint: You need to use the glm function and family = "binomial"
for the model.
Write the model equation. You can use “log-odds-spam” as the response variable.
Interpret \(\hat{\beta}_{num\_char}\) in terms of the log odds an email is spam.
Interpret \(\hat{\beta}_{winner}\) in terms of the log odds an email is spam.
Calculate the predicted log odds that an email that has 400 words and contains the word “winner” is spam.
Calculate the predicted probability that an email that has 400 words and contains the word “winner” is spam.
Suppose your model is used to identify which emails should be classified as spam and moved to the junk folder. Should the email from the previous question be classified as spam? Briefly explain your decision.