Multiple linear regression

# Multiple linear regression
### Becky Tang
### 06.30.2021

---

---

## Review

---

## Vocabulary

- .vocab[Response variable]: Variable whose behavior or variation you are trying 
to understand.

- .vocab[Explanatory variables]: Other variables that you want to use to explain
the variation in the response.

- .vocab[Predicted value]:</font> Output of the **model function**
    - The model function gives the typical value of the response variable
    *conditioning* on the explanatory variables.

- .vocab[Residuals]: Shows how far each case is from its predicted value
   - **Residual = Observed value - Predicted value**

---

## The linear model with a single predictor

- We're interested in the `$\beta_0$` (population parameter for the intercept)
and the `$\beta_1$` (population parameter for the slope) in the 
following model:

$$ \hat{y} = \beta_0 + \beta_1~x $$

- Unfortunately, we can't get these values

- So we use sample statistics to estimate them:

$$ \hat{y} = b_0 + b_1~x $$

---

## Least squares regression

The regression line minimizes the sum of squared residuals.

- .vocab[Residuals]: `$e_i = y_i - \hat{y}_i$`,

- The regression line minimizes `$\sum_{i = 1}^n e_i^2$`.

- Equivalently, minimizing `$\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2$`

---

## Data and Packages

```r
library(tidyverse)
library(broom)
```

```r
car_prices <- read_csv("data/car_prices.csv")
```

---

## Single numerical predictor

```r
m_pr_mi <- lm(price ~ mileage, data = car_prices)
tidy(m_pr_mi)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   21.4     0.609        35.1 1.69e-53
## 2 mileage       -0.128   0.00874     -14.6 2.55e-25
```

`$$\widehat{price_{1000}} = 21.37 -0.13~mileage_{per1000}$$`

---

## Single categorical predictor (> 2 levels)

```r
m_pr_type <- lm(price ~ type, data = car_prices)
tidy(m_pr_type)
```

```
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    14.3      0.977    14.6   3.94e-25
## 2 typeMaxima      1.19     1.38      0.858 3.93e- 1
## 3 typeMazda6     -2.77     1.38     -2.01  4.79e- 2
```

`$$\widehat{price_{1000}} = 14.3 +1.19 ~typeMaxima- 2.77~typeMazda6$$`

---

## Single categorical predictor (> 2 levels)

`$$\widehat{price_{1000}} = 14.3 +1.19 ~typeMaxima- 2.77~typeMazda6$$`

How does the average price of Maximas compare to the average price of Accords?

For Maximas: `$typeMaxima = 1$` and `$typeMazda6 = 0$`. The average price of Maximas is 
`$$\widehat{price_{1000}} = 14.3 +1.19 \times 1- 2.77 \times 0 = 14.3 +1.19$$`

For Accords: `$typeMaxima = 0$` and `$typeMazda6 = 0$`. The average price of Accords is 
$$\widehat{price_{1000}} = 14.3 +1.19 \times 0 - 2.77 \times 0 = 14.3 $$

---

## The linear model with multiple predictors

---

## The linear model with multiple predictors

- Population model:

$$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$

where `$k$` is the number of explanatory variables.

- Sample model that we use to estimate the population model:
  
$$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$

---

## Data

The data set contains prices for Porsche and Jaguar cars for sale
on cars.com.

---

## Price, age, and make

---

## Price vs. age and make

---

## Modeling with main effects

A .vocab[main effect] is the effect of a single independent variable on a dependent variable

```r
m_main <- lm(price ~ age + car, data = sports_car_prices)
m_main %>%
  tidy() %>%
  select(term, estimate)
```

```
## # A tibble: 3 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   44310.
## 2 age           -2487.
## 3 carPorsche    21648.
```

- Plug in 0 for `carPorsche` to get the linear model for Jaguars.

`$$\begin{align}\widehat{price} &= 44310 - 2487~age + 21648 \times 0\\
&= 44310 - 2487~age\\\end{align}$$`

- Plug in 1 for `carPorsche` to get the linear model for Porsches.

`$$\begin{align}\widehat{price} &= 44310 - 2487~age + 21648 \times 1\\
&= 65958 - 2487~age\\\end{align}$$`

---

`$$\begin{align}\widehat{price}_{jaguar} &= 44310 - 2487~age + 21648 \times 0\\
&= 44310 - 2487~age\\\end{align}$$`

**Porsche**

`$$\begin{align}\widehat{price}_{porsche} &= 44310 - 2487~age + 21648 \times 1\\
&= 65958 - 2487~age\\\end{align}$$`
]

- Rate of change in price as the age of the car increases does not depend on 
make of car (.vocab[same slopes])
- Porsches are consistently more expensive than Jaguars (.vocab[different intercepts])

---

## Interpretation of main effects

---

## Main effects

```
## # A tibble: 3 x 2
##   term        estimate
##   <chr>          <dbl>
## 1 (Intercept)   44310.
## 2 age           -2487.
## 3 carPorsche    21648.
```

- **All else held constant**, for each additional year of a car's age, the price of the car is predicted to decrease, on average, by $2,487.

- **All else held constant**, Porsches are predicted, on average, to have a 
price that is $21,647 greater than Jaguars.

- Jaguars that are new (age = 0) are predicted, on average, to have a price of $44,309.

---

.question[
Why is our linear regression model different from what we got from `geom_smooth(method = "lm")`?
]

]
.pull-right[

]

---

## What went wrong?

- .vocab[`car`] is the only variable in our model that affects the intercept.

- The model we specified assumes Jaguars and Porsches have the .vocab[`same slope`]
and .vocab[`different intercepts`].

- What is the most appropriate model for these data?

- same slope and intercept for Jaguars and Porsches?
  - same slope and different intercept for Jaguars and Porsches?
  - different slope and different intercept for Jaguars and Porsches?
  
---

## Interacting explanatory variables

- Including an interaction effect in the model allows for different slopes, i.e. 
nonparallel lines.

- This means that the relationship between an explanatory variable and the 
response depends on another explanatory variable.

- We can accomplish this by adding an .vocab[interaction variable]. This is the 
product of two explanatory variables.

---

## Price vs. age and car interacting

```r
ggplot(data = sports_car_prices,
       mapping = aes(y = price, x = age, color = car)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Age (years)", y = "Price (USD)", color = "Car Make")
```

<img src="18-mlr_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]

---

## Modeling with interaction effects

```r
m_int <- lm(price ~ age + car + age * car, data = sports_car_prices) 
m_int %>%
  tidy() %>%
  select(term, estimate)
```

```
## # A tibble: 4 x 2
##   term           estimate
##   <chr>             <dbl>
## 1 (Intercept)      56988.
## 2 age              -5040.
## 3 carPorsche        6387.
## 4 age:carPorsche    2969.
```

`$$\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche$$`

---

## Interpretation of interaction effects

- Plug in 0 for `carPorsche` to get the linear model for Jaguars.

`$$\begin{align}\widehat{price} &= 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche \\
&= 56988 - 5040~age + 6387 \times 0 + 2969~age \times 0\\
&\color{#00b3b3}{= 56988 - 5040~age}\end{align}$$`

- Plug in 1 for `carPorsche` to get the linear model for Porsches.

`$$\begin{align}\widehat{price} &= 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche \\
&= 56988 - 5040~age + 6387 \times 1 + 2969~age \times 1\\
&\color{#00b3b3}{= 63375 - 2071~age}\end{align}$$`

---

## Interpretation of interaction effects

`$$\widehat{price} = 56988 - 5040~age$$`

**Porsche**

`$$\widehat{price} = 63375 - 2071~age$$`
]

- Rate of change in price as the age of the car increases depends on the make 
of the car (.vocab[different slopes]).

- Porsches are consistently more expensive than Jaguars (.vocab[different intercepts]).

---

`$$\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche$$`

---

## Continuous by continuous interactions

- Interpretation becomes trickier

- Slopes conditional on values of explanatory variables

## Third order interactions

- Can you? Yes

- Should you? Probably not if you want to interpret these interactions in 
context of the data.

---

## Assessing quality of model fit

---

## Assessing the quality of the fit

- The strength of the fit of a linear model is commonly evaluated using `$R^2$`.

- It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained.

- `$R^2$` is sometimes called the **coefficient of determination**.

---

## Obtaining `$R^2$` in R

`price` vs. `age` and `make`

```r
glance(m_main)
```

```
## # A tibble: 1 x 12
##   r.squared adj.r.squared  sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.607         0.593 11848.      44.0 2.73e-12     2  -646. 1301. 1309.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

```r
glance(m_main)$r.squared 
```

```
## [1] 0.6071375
```

About 60.7% of the variability in price of used cars can be explained by age and make.

---

## `$R^2$`

```r
glance(m_main)$r.squared #model with main effects
```

```
## [1] 0.6071375
```

```r
glance(m_int)$r.squared #model with main effects + interactions
```

```
## [1] 0.6677881
```

- The model with interactions has a higher `$R^2$`.

- Using `$R^2$` for model selection in models with multiple explanatory variables is **<u>not</u>** a good idea as `$R^2$` increases when **any** variable is added to the model.

---

## `$R^2$` - first principles

- We can write explained variation using the following ratio of sums of squares:

$$ R^2 = 1 - \left( \frac{ \text{variability in residuals}}{ \text{variability in response} } \right) $$
.question[
Why does this expression make sense?
]

- But remember, adding **any** explanatory variable will always increase `$R^2$`

---

## Adjusted `$R^2$`

.alert[
$$ R^2_{adj} = 1 - \left( \frac{ \text{variability in residuals}}{ \text{variability in response} } \times \frac{n - 1}{n - k - 1} \right)$$
where `$n$` is the number of observations and `$k$` is the number of predictors in the model.
]

- Adjusted `$R^2$` doesn't increase if the new variable does not provide any new 
information or is completely unrelated.

- This makes adjusted `$R^2$` a preferable metric for model selection in multiple
regression models.

---

## Comparing models

```r
glance(m_main)$r.squared  
```

```
## [1] 0.6071375
```

```r
glance(m_int)$r.squared
```

```
## [1] 0.6677881
```
]

```r
glance(m_main)$adj.r.squared  
```

```
## [1] 0.5933529
```

```r
glance(m_int)$adj.r.squared
```

```
## [1] 0.649991
```
]

---

## In pursuit of Occam's Razor

- Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected.

- Model selection follows this principle.

- We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model.

- In other words, we prefer the simplest best model, i.e. .vocab[parsimonious] model.