Logistic Regression

# Logistic Regression
### Becky Tang
### 07.07.2021

---

---

## Model Selection

---

## Multiple linear regression

Let's say that we have 3 explanatory variables that we can use to predict the response variable `$y$`. Should I use all 3? What about interactions?

`$$\begin{align} &\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_2 x_{2}\\
 &\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_2 x_{2} + \beta_{3} x_{3}\\
&\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_2 x_{2} + \beta_{3} x_{1}x_{2}\\
& \quad \vdots
\end{align}$$`

Use Adjusted `$R^2$` to help us determine best fitting linear model.

---

## Backwards elimination

- Start with full model (including all candidate explanatory variables and all candidate interactions)

- Remove one variable at a time, and select the model with the highest adjusted `$R^2$`

- Continue until adjusted `$R^2$` does not increase

---

## Forward elimination

- Start with empty model (only an intercept)

- Add one variable (or interaction effect) at a time, and select the model with the highest adjusted `$R^2$`

- Continue until adjusted `$R^2$` does not increase

Note: adjusted `$R^2$` is just one criterion; can use other metrics

---
## Backwards elimination

```r
sports_car_prices <- read_csv("data/sports_cars.csv")
```

```r
m_full <- lm(mileage ~ price + age + car + price*car + age*car, data = sports_car_prices)
```

```r
round(glance(m_full)$adj.r.squared, 2)
```

```
## [1] 0.77
```

Current model: `$\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche +\beta_{4} ~price*carPorsche + \beta_{5} ~age*carPorsche$`

Current adjusted `$R^2$`: 0.77

---

## Backwards elimination

Current model: `$\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche +\beta_{4} ~price*carPorsche + \beta_{5} ~age*carPorsche$`

Current adjusted `$R^2$`: 0.77

```r
m_cand1 <- lm(mileage ~ price + age + car + price*car, data = sports_car_prices)
m_cand2 <- lm(mileage ~ price + age + car + age*car, data = sports_car_prices)

round(glance(m_cand1)$adj.r.squared, 2)
```

```
## [1] 0.76
```

```r
round(glance(m_cand2)$adj.r.squared, 2)
```

```
## [1] 0.77
```
--
New model: `$\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche + \beta_{4} ~age*carPorsche$`

New adjusted `$R^2$`: 0.77

---

## Backwards elimination

Current model: `$\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche + \beta_{4} ~age*carPorsche$`

Current adjusted `$R^2$`: 0.77

```r
m_cand1 <- lm(mileage ~ price + age + car, data = sports_car_prices)
m_cand2 <- lm(mileage ~ age + car + age*car, data = sports_car_prices)
round(glance(m_cand1)$adj.r.squared, 2)
```

```
## [1] 0.77
```

```r
round(glance(m_cand2)$adj.r.squared, 2)
```

```
## [1] 0.67
```

New model: `$\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche$`

New adjusted `$R^2$`: 0.77

---

## Backwards elimination

Current model: `$\widehat{mileage} = \beta_{0} + \beta_{1}~ price + \beta_{2} ~ age + \beta_{3} ~ carPorsche$`

Current adjusted `$R^2$`: 0.77

```r
m_cand1 <- lm(mileage ~ price + age, data = sports_car_prices)
m_cand2 <- lm(mileage ~ age + car, data = sports_car_prices)
m_cand3 <- lm(mileage ~ price + car, data = sports_car_prices)

round(glance(m_cand1)$adj.r.squared, 2)
```

```
## [1] 0.74
```

```r
round(glance(m_cand2)$adj.r.squared, 2)
```

```
## [1] 0.61
```

```r
round(glance(m_cand3)$adj.r.squared, 2)
```

```
## [1] 0.67
```

---

## Backwards elimination

Adjusted `$R^2$` stopped increasing, so we have our final model.

```r
m_final <- lm(mileage ~ price + age + car, data = sports_car_prices)
tidy(m_final)
```

```
## # A tibble: 4 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 48960.     5822.         8.41 1.64e-11
## 2 price          -0.735     0.119     -6.18 7.60e- 8
## 3 age          2100.      430.         4.89 8.94e- 6
## 4 carPorsche  10035.     3781.         2.65 1.03e- 2
```

`$\widehat{mileage} = 48960.15 -0.73 ~ price + 2100.21 ~ age + 10034.86 ~ carPorsche$`

---

## `step()` function

- Automate the process of model selection using the `step` function.

- Note: the criterion the `step()` uses to compare models is Akaike information criterion (AIC)

```r
m_final <- step(m_full, direction = "backward")
```

---

## `step()` function

```
## Start:  AIC=1116.75
## mileage ~ price + age + car + price * car + age * car
## 
##             Df Sum of Sq        RSS    AIC
## - price:car  1 147257329 6097924111 1116.2
## <none>                   5950666782 1116.8
## - age:car    1 376369525 6327036307 1118.4
## 
## Step:  AIC=1116.21
## mileage ~ price + age + car + age:car
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  6097924111 1116.2
## - age:car  1  229960767 6327884878 1116.4
## - price    1 2970101807 9068025918 1138.0
```

---

## `step()` function

```r
tidy(m_final)
```

```
## # A tibble: 5 x 5
##   term            estimate std.error statistic    p.value
##   <chr>              <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)    39802.     8585.         4.64 0.0000223 
## 2 price             -0.663     0.128     -5.18 0.00000330
## 3 age             3478.     1047.         3.32 0.00160   
## 4 carPorsche     15627.     5395.         2.90 0.00541   
## 5 age:carPorsche -1393.      967.        -1.44 0.155
```

---

## Logistic Regression

---

## Introduction

Multiple regression allows us to relate a numerical response variable to one or
more numerical or categorical predictors.

We can use multiple regression models
to understand relationships, assess differences, and make predictions.

But what about a situation where the response of interest is categorical and
binary?

- spam or not spam
- malignant or benign tumor
- survived or died
- admitted or or not admitted

---

## Titanic

On April 15, 1912 the famous ocean liner *Titanic* sank in the North Atlantic 
after striking an iceberg on its maiden voyage. The dataset `titanic.csv` 
contains the survival status and other attributes of individuals on the titanic.

- `survived`: survival status  (1 = survived, 0 = died)
- `pclass`: passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
- `name`: name of individual
- `sex`: sex (male or female)
- `age`: age in years
- `fare`: passenger fare in British pounds

We are interested in investigating the variables that contribute to passenger 
survival. Do women and children really come first?

---

## Data and Packages

```r
library(tidyverse)
library(broom)
```

```r
glimpse(titanic)
```

```
## Rows: 887
## Columns: 7
## $ pclass   <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2…
## $ name     <chr> "Mr. Owen Harris Braund", "Mrs. John Bradley (Florence Briggs…
## $ sex      <chr> "male", "female", "female", "female", "male", "male", "male",…
## $ age      <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,…
## $ fare     <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21…
## $ died     <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1…
## $ survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
```

---

## Exploratory Data Analysis

---

## The linear model with multiple predictors

- Population model:

$$ y = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon $$

- Sample model that we use to estimate the population model:
  
$$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$

Denote by `$p$` the probability of survival and consider the model below.

$$ p = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k + \epsilon$$

Can you see any problems with this approach?

---

## Linear Regression?

```r
lm_survival <- lm(survived ~ age + sex, data = titanic)
tidy(lm_survival)
```

```
## # A tibble: 3 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.752     0.0356      21.1   2.88e-80
## 2 age         -0.000343  0.000979    -0.350 7.26e- 1
## 3 sexmale     -0.551     0.0289     -19.1   3.50e-68
```

---

## Visualizing the Model

---

## Diagnostics

This isn't helpful! We need to develop a new tool.

---

## Preliminaries

- Denote by `$p$` the probability of some event
- The .vocab[odds] the event occurs is `$$\frac{p}{1-p}$$`

Odds are sometimes expressed as X : Y and read X to Y.

It is the ratio of successes to 
failures, where values larger than 1 favor a success and values smaller than 1
favor a failure.

If `$P(A) = 1/2$`, the odds of `$A$` are `$\color{#1689AE}{\frac{1/2}{1/2} = 1}$`

If `$P(B) = 1/3$`, the odds of `$B$` are `$\color{#1689AE}{\frac{1/3}{2/3} = 0.5}$`

If `$P(B) = 2/3$`, the odds of `$B$` are `$\color{#1689AE}{\frac{2/3}{1/3} = 2}$`

An .vocab[odds ratio] is a ratio of odds.

---

## Preliminaries

- Taking the natural log of the odds yields the .vocab[logit] of `$p$`

`$$\text{logit}(p) = \text{log}\left(\frac{p}{1-p}\right)$$`

The logit takes a value of `$p$` between 0 and 1 and outputs a value between 
`$-\infty$` and `$\infty$`.

The .vocab[inverse logit (expit)] takes a value between `$-\infty$` and `$\infty$` and 
outputs a value between 0 and 1.

`$$\text{inverse logit}(x) = \frac{e^x}{1+e^x}$$`

---

## Logistic Regression Model

.alert[
`$$\text{log}\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}$$`
]

Use the inverse logit to find the expression for `$p$`.

.alert[
`$$p = \frac{e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}{1 + e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_{k}}}$$`
]

We can use the logistic regression model to obtain predicted probabilities of 
success for a binary response variable.

---

## Logistic Regression Model

We handle fitting the model via computer using the `glm` function.

```r
logit_mod <- glm(survived ~ sex + age, data = titanic, 
                 family = "binomial")
tidy(logit_mod)
```

```
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.11      0.208       5.34  9.05e- 8
## 2 sexmale     -2.50      0.168     -14.9   3.24e-50
## 3 age         -0.00206   0.00586    -0.351 7.25e- 1
```

---

## Logistic Regression Model

And use `augment` to find predicted log-odds.

```r
pred_log_odds <- augment(logit_mod)
```

---

## The Estimated Logistic Regression Model

```r
tidy(logit_mod)
```

`$$\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 1.11 - 2.50~sex - 0.00206~age$$`
`$$\hat{p} = \frac{e^{1.11 - 2.50~sex - 0.00206~age}}{{1+e^{1.11 - 2.50~sex - 0.00206~age}}}$$`

---

## Interpreting coefficients

<br>

Holding sex constant, for every additional year of age, we expect the **log-odds** of survival to decrease by approximately 0.002.

<br>

Holding age constant, we expect males to have a **log-odds** of survival that is 2.50 less than females.

---

## Interpreting coefficients

<br>

--
`$$e^{x+1} = e^{x} \times e^{1} \Rightarrow e^{1.11 - 2.50sex -0.00206(age + 1)} = e^{1.11 - 2.50sex -0.00206age} \times e^{-0.0026}$$`
--

Holding sex constant, for every one year increase in age, the **odds** of survival are expected to multiply by a factor of `$e^{-0.00206} = 0.998$`.

<br>

Holding age constant, the **odds** of survival for males are `$e^{-2.50} = 0.082$` times the odds of survival for females.

---

## Classification

- Logistic regression allows us to obtain predicted probabilities of success
for a binary variable.
- By imposing a threshold (for example if the probability is greater than 
`$0.50$`) we can create a classifier.

Observed vs predicted survival:

```
## # A tibble: 2 x 3
##   survived  Died Survived
##      <dbl> <int>    <int>
## 1        0   464       81
## 2        1   109      233
```

---

## Strengths and Weaknesses

- Logistic regression has assumptions: independence and linearity in the log-odds (some other methods require fewer assumptions)
- If the predictors are correlated,  coefficient estimates may be unreliable

- Straightforward interpretation of coefficients
- Handles numerical and categorical predictors
- Can quantify uncertainty around a prediction
- Can extend to more than 2 categories (multinomial regression)