class: center, middle, inverse, title-slide # Logistic Regression ### Becky Tang ### 07.07.2021 --- <!-- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --> --- class: middle center ## Model Selection --- ## Multiple linear regression Let's say that we have 3 explanatory variables that we can use to predict the response variable `\(y\)`. Should I use all 3? What about interactions? `$$\begin{align} &\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_2 x_{2}\\ &\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_2 x_{2} + \beta_{3} x_{3}\\ &\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_2 x_{2} + \beta_{3} x_{1}x_{2}\\ & \quad \vdots \end{align}$$` -- Use Adjusted `\(R^2\)` to help us determine best fitting linear model. --- ## Backwards elimination - Start with full model (including all candidate explanatory variables and all candidate interactions) - Remove one variable at a time, and select the model with the highest adjusted `\(R^2\)` - Continue until adjusted `\(R^2\)` does not increase --- ## Forward elimination - Start with empty model (only an intercept) - Add one variable (or interaction effect) at a time, and select the model with the highest adjusted `\(R^2\)` - Continue until adjusted `\(R^2\)` does not increase -- Note: adjusted `\(R^2\)` is just one criterion; can use other metrics --- ## Backwards elimination ```r sports_car_prices <- read_csv("data/sports_cars.csv") ``` ```r m_full <- lm(mileage ~ price + age + car + price*car + age*car, data = sports_car_prices) ``` ```r round(glance(m_full)$adj.r.squared, 2) ``` ``` ## [1] 0.77 ``` -- Current model: `\(\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche +\beta_{4} ~price*carPorsche + \beta_{5} ~age*carPorsche\)` Current adjusted `\(R^2\)`: 0.77 --- ## Backwards elimination Current model: `\(\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche +\beta_{4} ~price*carPorsche + \beta_{5} ~age*carPorsche\)` Current adjusted `\(R^2\)`: 0.77 ```r m_cand1 <- lm(mileage ~ price + age + car + price*car, data = sports_car_prices) m_cand2 <- lm(mileage ~ price + age + car + age*car, data = sports_car_prices) round(glance(m_cand1)$adj.r.squared, 2) ``` ``` ## [1] 0.76 ``` ```r round(glance(m_cand2)$adj.r.squared, 2) ``` ``` ## [1] 0.77 ``` -- New model: `\(\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche + \beta_{4} ~age*carPorsche\)` New adjusted `\(R^2\)`: 0.77 --- ## Backwards elimination Current model: `\(\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche + \beta_{4} ~age*carPorsche\)` Current adjusted `\(R^2\)`: 0.77 ```r m_cand1 <- lm(mileage ~ price + age + car, data = sports_car_prices) m_cand2 <- lm(mileage ~ age + car + age*car, data = sports_car_prices) round(glance(m_cand1)$adj.r.squared, 2) ``` ``` ## [1] 0.77 ``` ```r round(glance(m_cand2)$adj.r.squared, 2) ``` ``` ## [1] 0.67 ``` -- New model: `\(\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche\)` New adjusted `\(R^2\)`: 0.77 --- ## Backwards elimination Current model: `\(\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche\)` Current adjusted `\(R^2\)`: 0.77 ```r m_cand1 <- lm(mileage ~ price + age, data = sports_car_prices) m_cand2 <- lm(mileage ~ age + car, data = sports_car_prices) m_cand3 <- lm(mileage ~ price + car, data = sports_car_prices) round(glance(m_cand1)$adj.r.squared, 2) ``` ``` ## [1] 0.74 ``` ```r round(glance(m_cand2)$adj.r.squared, 2) ``` ``` ## [1] 0.61 ``` ```r round(glance(m_cand3)$adj.r.squared, 2) ``` ``` ## [1] 0.67 ``` --- ## Backwards elimination Adjusted `\(R^2\)` stopped increasing, so we have our final model. ```r m_final <- lm(mileage ~ price + age + car, data = sports_car_prices) tidy(m_final) ``` ``` ## # A tibble: 4 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 48960. 5822. 8.41 1.64e-11 ## 2 price -0.735 0.119 -6.18 7.60e- 8 ## 3 age 2100. 430. 4.89 8.94e- 6 ## 4 carPorsche 10035. 3781. 2.65 1.03e- 2 ``` `\(\widehat{mileage} = 48960.15 -0.73 ~ price + 2100.21 ~ age + 10034.86 ~ carPorsche\)` --- ## `step()` function - Automate the process of model selection using the `step` function. - Note: the criterion the `step()` uses to compare models is Akaike information criterion (AIC) ```r m_final <- step(m_full, direction = "backward") ``` --- ## `step()` function ``` ## Start: AIC=1116.75 ## mileage ~ price + age + car + price * car + age * car ## ## Df Sum of Sq RSS AIC ## - price:car 1 147257329 6097924111 1116.2 ## <none> 5950666782 1116.8 ## - age:car 1 376369525 6327036307 1118.4 ## ## Step: AIC=1116.21 ## mileage ~ price + age + car + age:car ## ## Df Sum of Sq RSS AIC ## <none> 6097924111 1116.2 ## - age:car 1 229960767 6327884878 1116.4 ## - price 1 2970101807 9068025918 1138.0 ``` --- ## `step()` function ```r tidy(m_final) ``` ``` ## # A tibble: 5 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 39802. 8585. 4.64 0.0000223 ## 2 price -0.663 0.128 -5.18 0.00000330 ## 3 age 3478. 1047. 3.32 0.00160 ## 4 carPorsche 15627. 5395. 2.90 0.00541 ## 5 age:carPorsche -1393. 967. -1.44 0.155 ``` --- class: middle center ## Logistic Regression --- ## Introduction Multiple regression allows us to relate a numerical response variable to one or more numerical or categorical predictors. We can use multiple regression models to understand relationships, assess differences, and make predictions. But what about a situation where the response of interest is categorical and binary? -- - spam or not spam - malignant or benign tumor - survived or died - admitted or or not admitted --- ## Titanic On April 15, 1912 the famous ocean liner *Titanic* sank in the North Atlantic after striking an iceberg on its maiden voyage. The dataset `titanic.csv` contains the survival status and other attributes of individuals on the titanic. - `survived`: survival status (1 = survived, 0 = died) - `pclass`: passenger class (1 = 1st, 2 = 2nd, 3 = 3rd) - `name`: name of individual - `sex`: sex (male or female) - `age`: age in years - `fare`: passenger fare in British pounds We are interested in investigating the variables that contribute to passenger survival. Do women and children really come first? --- ## Data and Packages ```r library(tidyverse) library(broom) ``` ```r glimpse(titanic) ``` ``` ## Rows: 887 ## Columns: 7 ## $ pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2… ## $ name <chr> "Mr. Owen Harris Braund", "Mrs. John Bradley (Florence Briggs… ## $ sex <chr> "male", "female", "female", "female", "male", "male", "male",… ## $ age <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,… ## $ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21… ## $ died <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1… ## $ survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0… ``` --- ## Exploratory Data Analysis <img src="20-classification_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ## The linear model with multiple predictors - Population model: $$ y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon $$ -- - Sample model that we use to estimate the population model: $$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$ -- Denote by `\(p\)` the probability of survival and consider the model below. $$ p = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon$$ -- Can you see any problems with this approach? --- ## Linear Regression? ```r lm_survival <- lm(survived ~ age + sex, data = titanic) tidy(lm_survival) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 0.752 0.0356 21.1 2.88e-80 ## 2 age -0.000343 0.000979 -0.350 7.26e- 1 ## 3 sexmale -0.551 0.0289 -19.1 3.50e-68 ``` --- ## Visualizing the Model <img src="20-classification_files/figure-html/linear-model-viz-1.png" style="display: block; margin: auto;" /> --- ## Diagnostics <img src="20-classification_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> -- This isn't helpful! We need to develop a new tool. --- ## Preliminaries - Denote by `\(p\)` the probability of some event - The .vocab[odds] the event occurs is `$$\frac{p}{1-p}$$` -- Odds are sometimes expressed as X : Y and read X to Y. It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure. -- If `\(P(A) = 1/2\)`, the odds of `\(A\)` are `\(\color{#1689AE}{\frac{1/2}{1/2} = 1}\)` -- If `\(P(B) = 1/3\)`, the odds of `\(B\)` are `\(\color{#1689AE}{\frac{1/3}{2/3} = 0.5}\)` -- If `\(P(B) = 2/3\)`, the odds of `\(B\)` are `\(\color{#1689AE}{\frac{2/3}{1/3} = 2}\)` -- An .vocab[odds ratio] is a ratio of odds. --- ## Preliminaries - Taking the natural log of the odds yields the .vocab[logit] of `\(p\)` `$$\text{logit}(p) = \text{log}\left(\frac{p}{1-p}\right)$$` -- The logit takes a value of `\(p\)` between 0 and 1 and outputs a value between `\(-\infty\)` and `\(\infty\)`. -- The .vocab[inverse logit (expit)] takes a value between `\(-\infty\)` and `\(\infty\)` and outputs a value between 0 and 1. `$$\text{inverse logit}(x) = \frac{e^x}{1+e^x}$$` --- ## Logistic Regression Model .alert[ `$$\text{log}\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}$$` ] -- Use the inverse logit to find the expression for `\(p\)`. .alert[ `$$p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}$$` ] We can use the logistic regression model to obtain predicted probabilities of success for a binary response variable. --- ## Logistic Regression Model We handle fitting the model via computer using the `glm` function. ```r logit_mod <- glm(survived ~ sex + age, data = titanic, family = "binomial") tidy(logit_mod) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.11 0.208 5.34 9.05e- 8 ## 2 sexmale -2.50 0.168 -14.9 3.24e-50 ## 3 age -0.00206 0.00586 -0.351 7.25e- 1 ``` --- ## Logistic Regression Model And use `augment` to find predicted log-odds. ```r pred_log_odds <- augment(logit_mod) ``` --- ## The Estimated Logistic Regression Model .midi[ ```r tidy(logit_mod) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.11 0.208 5.34 9.05e- 8 ## 2 sexmale -2.50 0.168 -14.9 3.24e-50 ## 3 age -0.00206 0.00586 -0.351 7.25e- 1 ``` ] `$$\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age$$` `$$\hat{p} = \frac{e^{1.11 - 2.50~sex - 0.00206~age}}{{1+e^{1.11 - 2.50~sex - 0.00206~age}}}$$` --- ## Interpreting coefficients .alert[ `$$\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age$$` ] <br> -- Holding sex constant, for every additional year of age, we expect the **log-odds** of survival to decrease by approximately 0.002. <br> -- Holding age constant, we expect males to have a **log-odds** of survival that is 2.50 less than females. --- ## Interpreting coefficients .alert[ `$$\frac{\hat{p}}{1-\hat{p}} = e^{1.11 - 2.50~sex - 0.00206~age}$$` ] <br> -- `$$e^{x+1} = e^{x} \times e^{1} \Rightarrow e^{1.11 - 2.50sex -0.00206(age + 1)} = e^{1.11 - 2.50sex -0.00206age} \times e^{-0.0026}$$` -- Holding sex constant, for every one year increase in age, the **odds** of survival are expected to multiply by a factor of `\(e^{-0.00206} = 0.998\)`. <br> -- Holding age constant, the **odds** of survival for males are `\(e^{-2.50} = 0.082\)` times the odds of survival for females. --- ## Classification - Logistic regression allows us to obtain predicted probabilities of success for a binary variable. - By imposing a threshold (for example if the probability is greater than `\(0.50\)`) we can create a classifier. -- Observed vs predicted survival: ``` ## # A tibble: 2 x 3 ## survived Died Survived ## <dbl> <int> <int> ## 1 0 464 81 ## 2 1 109 233 ``` --- ## Strengths and Weaknesses .vocab[Weaknesses] - Logistic regression has assumptions: independence and linearity in the log-odds (some other methods require fewer assumptions) - If the predictors are correlated, coefficient estimates may be unreliable -- .vocab[Strengths] - Straightforward interpretation of coefficients - Handles numerical and categorical predictors - Can quantify uncertainty around a prediction - Can extend to more than 2 categories (multinomial regression)