Multiple linear regression

# Multiple linear regression
## Inference + conditions
### Becky Tang
### 07.01.2021

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Review

---

## Vocabulary

- .vocab[Response variable]: Variable whose behavior or variation you are trying 
to understand.

- .vocab[Explanatory variables]: Other variables that you want to use to explain
the variation in the response.

- .vocab[Predicted value]: Output of the **model function**

- .vocab[Residuals]: Shows how far each case is from its predicted value
   - **Residual = Observed value - Predicted value**

---

## The linear model with multiple predictors

- Population model:

$$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$

- Sample model that we use to estimate the population model:
  
$$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$

---

## AE 18: Review

```r
paintings <- read_csv("data/paris_paintings.csv", 
                            na = c("n/a", "", "NA"))
```

.question[Fit a main effects model using width and whether the painting has landscape elements to predict the height.]

```r
m_main <- lm(Height_in ~ Width_in + landsALL, data = paintings)
tidy(m_main)
```

```
## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    5.62    0.269        20.9 1.24e-90
## 2 Width_in       0.777   0.00909      85.4 0       
## 3 landsALL      -5.02    0.292       -17.2 3.70e-63
```

`$$\widehat{height}_{in} = 5.62 + 0.78 ~ width_{in} - 5.02 ~ landsALL$$`
---

## AE 18: Review

- Intercept: Paintings that are 0 inches wide and do not have landscape elements are expected to be 5.62 inches tall, on average.

- Coefficient for width: Holding all else constant, if we increase the width of a painting by one inch, we expect, on average, for the height of the painting to increase by 0.78 inches.

- Coefficient for landsALL:  Holding all else constant, paintings that have landscape elements are expected to be, on average, 5.02 inches shorter than paintings without landscape elements.

---

## AE 18: Review

Equation for paintings with landscape elements:

`$$\begin{align}\widehat{height}_{in} &= 5.62 + 0.78 ~ width_{in} - 5.02 \times 1 \\
&= 0.6 +  0.78 ~ width_{in}\\
\end{align}$$`

Equation for paintings without landscape elements:

`$$\begin{align}\widehat{height}_{in} &= 5.62 + 0.78 ~ width_{in} - 5.02 \times 0 \\
&= 5.62 +  0.78 ~ width_{in}
\end{align}$$`

---

## AE 18: Review

.question[Fit a linear model using width, whether the painting has landscape elements, and the interaction between the two variables to predict the height.]

```r
m_int<- lm(Height_in ~ Width_in + landsALL + Width_in * landsALL, data = paintings)
tidy(m_int) %>%
  select(term, estimate)
```

```
## # A tibble: 4 x 2
##   term              estimate
##   <chr>                <dbl>
## 1 (Intercept)         5.15  
## 2 Width_in            0.798 
## 3 landsALL           -3.47  
## 4 Width_in:landsALL  -0.0725
```

`$$\widehat{height}_{in} = 5.15 + 0.80 ~ width_{in} - 3.47 ~ landsALL - 0.07 ~ width \times landsALL$$`

---

## AE 18: Review

.alert[
`$$\widehat{height}_{in} = 5.15 + 0.80 ~ width_{in} - 3.47 ~ landsALL - 0.07 ~ width_{in} \times landsALL$$`
]

Equation for paintings with landscape elements:

`$$\begin{align} \widehat{height}_{in} &= 5.15 + 0.80 ~ width_{in} - 3.47 \times 1 - 0.07 ~ width_{in} \times 1 \\
&= 1.7 + 0.73 ~ width_{in}\\
\end{align}$$`

Equation for paintings without landscape elements:

`$$\begin{align} \widehat{height}_{in} &= 5.15 + 0.80 ~ width_{in} - 3.47 \times 0 - 0.07 ~ width_{in} \times 0 \\
&= 5.15 + 0.80 ~ width_{in}
\end{align}$$`

---

## AE 18: Review

```r
glance(m_main)$adj.r.squared
```

```
## [1] 0.7100106
```

```r
glance(m_int)$adj.r.squared
```

```
## [1] 0.7111417
```

---

## Inference in linear regression

---
## Data and Packages

```r
library(tidyverse)
library(broom)
```

Recall the file `sportscars.csv` contains prices for Porsche and Jaguar cars for sale on cars.com.

`car`: car make (Jaguar or Porsche)

`price`: price in USD

`age`: age of the car in years

`mileage`: previous miles driven

---

## Multiple Linear Regression

```r
m_int <- lm(price ~ age + car + age * car, 
            data = sports_car_prices) 
m_int %>%
  tidy() %>%
  select(term, estimate)
```

```
## # A tibble: 4 x 2
##   term           estimate
##   <chr>             <dbl>
## 1 (Intercept)      56988.
## 2 age              -5040.
## 3 carPorsche        6387.
## 4 age:carPorsche    2969.
```

`$$\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche$$`
---

## CLT-based Inference in Regression

---

## The linear model with multiple predictors

Population model:

$$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$

Sample model that we use to estimate the population model:
  
$$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$

Similar to other sample statistics (mean, proportion, etc) there is variability
in our estimates of the slope and intercept.

.question[
- Do we have convincing evidence that the true linear model has a non-zero slope?
- What is a confidence interval for the population regression coefficient?
]

---

## Mileage vs. age

We will consider a simple linear regression model predicting mileage using age.

```r
m_age_miles <- lm(mileage ~ age, data = sports_car_prices)
```
]

---

# A confidence interval for `$\beta_1$`

---

## Confidence interval

.alert[
`$$b_1 \pm t^*_{n-2} \times SE_{b_1}$$`
where `$t^*_{n-2}$` is calculated using a `$t$` distribution with `$n-2$` degrees of freedom.
]

---

## Tidy confidence interval

```r
tidy(m_age_miles, conf.int = TRUE, conf.level = 0.95)
```

```
## # A tibble: 2 x 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   13967.     2876.      4.86 9.40e- 6    8211.    19723.
## 2 age            3837.      403.      9.52 1.86e-13    3030.     4643.
```

```r
est <- tidy(m_age_miles) %>%
  filter(term == "age") %>%
  pull(estimate)
se <- tidy(m_age_miles) %>%
  filter(term == "age") %>%
  pull(std.error)
```
---

## Calculating the 95% CI manually

A 95% confidence interval for `$\beta_1$` can be calculated as

```r
(df <- nrow(sports_car_prices) - 2)
```

```
## [1] 58
```

```r
(tstar <- qt(0.975,df))
```

```
## [1] 2.001717
```

```r
(ci <- est + c(-1,1) * tstar *se)
```

```
## [1] 3029.838 4643.199
```

---

## Interpretation

```r
tidy(m_age_miles, conf.int = TRUE, conf.level = 0.95) %>% 
  filter(term == "age") %>%
  select(conf.low, conf.high)
```

```
## # A tibble: 1 x 2
##   conf.low conf.high
##      <dbl>     <dbl>
## 1    3030.     4643.
```

.alert[We are 95% confident that for every additional year of a car's age, the 
mileage is expected to increase, on average, between about 3030 and 4643 miles.]

---

# A hypothesis test for `$\beta_1$`

---

## Hypothesis testing for `$\beta_1$`

Is there convincing evidence, based on our sample data, that age is 
associated with mileage?

We can set this up as a hypothesis test, with the hypotheses below.

`$H_a: \beta_1 \ne 0$`. The slope is not 0. There is a relationship between mileage and age.
]

We only reject `$H_0$` in favor of `$H_a$` if the data provide strong evidence
that the true slope parameter is different from zero.

---

## Hypothesis testing for `$\beta_1$`

```r
tidy(m_age_miles)
```

```
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   13967.     2876.      4.86 9.40e- 6
## 2 age            3837.      403.      9.52 1.86e-13
```

`$$T = \frac{b_1 - 0}{SE_{b_1}} \sim t_{n-2}$$`

The p-value is in the output is the p-value associated with the two-sided hypothesis test `$H_a: \beta_1 \neq 0$`.

---

## Hypothesis testing for `$\beta_1$`

```r
tidy(m_age_miles)
```

.vocab[The p-value is very small, so we reject] `$\color{#1689AE}{H_0}$`. .vocab[The data provide sufficient evidence that the coefficient of age is not equal to 0, and there is a linear relationship between the mileage and age of a car.]  
---

## Final Thoughts

We used a CLT-based approach to construct confidence intervals and perform 
hypothesis tests.

Note that you can also use simulation-based methods to do inference using `infer`. [Click here](https://infer.netlify.app/articles/observed_stat_examples.html#two-numerical-vars---slr) for examples.

---
class: center, middle

## Conditions for Inference in Regression

---

## Conditions

- **Linearity**: The relationship between response and predictor(s) is linear

- **Independence**: The residuals are independent

- **Normality**: The residuals are nearly normally distributed

- **Equal Variance**: The residuals have constant variance

---

## Conditions

- .vocab[L]**inearity**: The relationship between response and predictor(s) is linear

- .vocab[I]**ndependence**: The residuals are independent

- .vocab[N]**ormality**: The residuals are nearly normally distributed

- .vocab[E]**qual Variance**: The residuals have constant variance

<br>

*For multiple regression, the predictors shouldn't be too correlated with each 
other.*

---

## Examples

Top row: scatterplot of `$x$` vs `$y$`

Bottom row: residual plot

---

## `augment` data with model results

- `.fitted`: Predicted value of the response variable
- `.resid`: Residuals

```r
m_age_miles_aug <- augment(m_age_miles)
m_age_miles_aug %>%
  slice(1:3)
```

```
## # A tibble: 3 x 8
##   mileage   age .fitted .resid   .hat .sigma  .cooksd .std.resid
##     <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
## 1   21500     3  25477. -3977. 0.0223 13981. 0.000959     -0.290
## 2   43000     3  25477. 17523. 0.0223 13793. 0.0186        1.28 
## 3   19900     2  21640. -1740. 0.0275 13989. 0.000229     -0.127
```
]

We will use the fitted values and residuals to check the conditions by 
constructing .vocab[diagnostic plots].

---

## Residuals vs fitted plot

Use to check .vocab[**L**inearity] and .vocab[**E**qual variance.]

```r
ggplot(m_age_miles_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
  labs(x = "Predicted Mileage", y = "Residuals") 
```

<img src="19-mlr-inference_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
]

---

## Residuals in order of collection

Use to check .vocab[**I**ndependence]

```r
ggplot(data = m_age_miles_aug, 
       aes(x = 1:nrow(sports_car_prices), 
           y = .resid)) + 
  geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
  labs(x = "Index", y = "Residual")
```

<img src="19-mlr-inference_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
]

---

## Histogram of residuals

Use to check .vocab[**N**ormality]

```r
ggplot(m_age_miles_aug, mapping = aes(x = .resid)) +
  geom_histogram(bins = 15) + labs(x = "Residuals")
```

<img src="19-mlr-inference_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
]

---

## Multiple linear regression: inference

---

## Hypothesis testing

For null hypotheses, we now condition on each of the other remaining variables in model.

For example, let's say I want to create a linear model:

`$$\widehat{price} = \beta_0 + \beta_{1} ~ mileage + \beta_{2}~ age$$`

We can write two different null hypotheses:

1. `$H_{0}: \beta_{1} = 0$`, given `age` is included in the model

2. `$H_{0}: \beta_{2} = 0$`, given `mileage` is included in the model

---

## Hypothesis tests

```r
m_mult <- lm(price ~ mileage + age, data = sports_car_prices)
tidy(m_mult)
```

```
## # A tibble: 3 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) 62950.     3176.       19.8   3.03e-27
## 2 mileage        -0.695     0.122    -5.68  4.75e- 7
## 3 age           516.      601.        0.859 3.94e- 1
```

The p-value for testing `$H_{0}: \beta_{1} = 0$`, given `age` is included in the model is almost 0.

The p-value for testing `$H_{0}: \beta_{2} = 0$`, given `mileage` is included in the model is 0.39.

---

## Assessing conditions

```r
m_mult_aug <- augment(m_mult)
```

Because we have multiple predictors, we also need to check the condition that the explanatory variables are not too correlated.

.vocab[Multicollinearity] occurs when the predictor variables are correlated within themselves. If we have multicollinearity, the coefficients in a multiple regression model can be difficult to interpret.

```r
cor(sports_car_prices %>% 
      select(mileage, age))
```

```
##           mileage       age
## mileage 1.0000000 0.7808789
## age     0.7808789 1.0000000
```

---

## Assessing conditions

```r
corrplot::corrplot(sports_car_prices %>%
                     dplyr::select(-car) %>%
                     cor(),
                   type = "lower")
```

---

## Residuals vs fitted plot

Check .vocab[**L**inearity] and .vocab[**E**qual variance.]

```r
ggplot(m_mult_aug, mapping = aes(x = .fitted, y = .resid)) +
  geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
  labs(x = "Predicted Price", y = "Residuals") 
```

<img src="19-mlr-inference_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" />
]

---

## Residuals in order of collection

Check .vocab[**I**ndependence]

```r
ggplot(data = m_mult_aug, 
       aes(x = 1:nrow(sports_car_prices), 
           y = .resid)) + 
  geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) +
  labs(x = "Index", y = "Residual")
```

<img src="19-mlr-inference_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" />
]

---

## Histogram of residuals

Use to check .vocab[**N**ormality]

```r
ggplot(m_mult_aug, mapping = aes(x = .resid)) +
  geom_histogram(bins = 15) + labs(x = "Residuals")
```

<img src="19-mlr-inference_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" />
]