Announcements

Exam 02

Will be released this Friday at 9:00am, and will be due Sunday, July 11 at 11:59pm, with the same format as Exam 01. Exam 02 will cover Inference, as well as Simple Linear Regression.

No lab or TA office hours this Friday.

Project - Proposal due this Friday!

Write the proposal in the project-proposal.Rmd file in the project repo.

Exercises

Let’s filter spam emails! We’ll examine a data set of 3921 emails and use logistic regression to determine which ones are potentially spam.

We’ll use the following variables in the analysis:

email <- read_csv("data/email.csv") %>%
    mutate(spam = factor(spam))

Exercise 1

Visualize the relationship between (1) spam and winner, and (2) spam and num_char. Use these plots to describe the relationship between the response variable and each of the explanatory variables.

Exercise 2

Fit a logistic regression model with spam as the response and winner and num_char as explanatory variables. Use the tidy function to display the model output. Hint: You need to use the glm function and family = "binomial" for the model.

Exercise 3

Write the model equation. You can use “log-odds-spam” as the response variable.

Exercise 4

Interpret \(\hat{\beta}_{num\_char}\) in terms of the log odds an email is spam.

Exercise 5

Interpret \(\hat{\beta}_{winner}\) in terms of the log odds an email is spam.

Exericse 6

Calculate the predicted log odds that an email that has 400 words and contains the word “winner” is spam.

Exercise 7

Calculate the predicted probability that an email that has 400 words and contains the word “winner” is spam.

Exercise 8

Suppose your model is used to identify which emails should be classified as spam and moved to the junk folder. Should the email from the previous question be classified as spam? Briefly explain your decision.