class: center, middle, inverse, title-slide # Multiple linear regression ## Inference + conditions ### Becky Tang ### 07.01.2021 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: center, middle ## Review --- ## Vocabulary - .vocab[Response variable]: Variable whose behavior or variation you are trying to understand. - .vocab[Explanatory variables]: Other variables that you want to use to explain the variation in the response. - .vocab[Predicted value]: Output of the **model function** - .vocab[Residuals]: Shows how far each case is from its predicted value - **Residual = Observed value - Predicted value** --- ## The linear model with multiple predictors - Population model: $$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$ -- - Sample model that we use to estimate the population model: $$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$ --- ## AE 18: Review ```r paintings <- read_csv("data/paris_paintings.csv", na = c("n/a", "", "NA")) ``` .question[Fit a main effects model using width and whether the painting has landscape elements to predict the height.] ```r m_main <- lm(Height_in ~ Width_in + landsALL, data = paintings) tidy(m_main) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 5.62 0.269 20.9 1.24e-90 ## 2 Width_in 0.777 0.00909 85.4 0 ## 3 landsALL -5.02 0.292 -17.2 3.70e-63 ``` `$$\widehat{height}_{in} = 5.62 + 0.78 ~ width_{in} - 5.02 ~ landsALL$$` --- ## AE 18: Review .alert[ `$$\widehat{height}_{in} = 5.62 + 0.78 ~ width_{in} - 5.02 ~ landsALL$$` ] - Intercept: Paintings that are 0 inches wide and do not have landscape elements are expected to be 5.62 inches tall, on average. - Coefficient for width: Holding all else constant, if we increase the width of a painting by one inch, we expect, on average, for the height of the painting to increase by 0.78 inches. - Coefficient for landsALL: Holding all else constant, paintings that have landscape elements are expected to be, on average, 5.02 inches shorter than paintings without landscape elements. --- ## AE 18: Review .alert[ `$$\widehat{height}_{in} = 5.62 + 0.78 ~ width_{in} - 5.02 ~ landsALL$$` ] Equation for paintings with landscape elements: `$$\begin{align}\widehat{height}_{in} &= 5.62 + 0.78 ~ width_{in} - 5.02 \times 1 \\ &= 0.6 + 0.78 ~ width_{in}\\ \end{align}$$` Equation for paintings without landscape elements: `$$\begin{align}\widehat{height}_{in} &= 5.62 + 0.78 ~ width_{in} - 5.02 \times 0 \\ &= 5.62 + 0.78 ~ width_{in} \end{align}$$` --- ## AE 18: Review .question[Fit a linear model using width, whether the painting has landscape elements, and the interaction between the two variables to predict the height.] ```r m_int<- lm(Height_in ~ Width_in + landsALL + Width_in * landsALL, data = paintings) tidy(m_int) %>% select(term, estimate) ``` ``` ## # A tibble: 4 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 5.15 ## 2 Width_in 0.798 ## 3 landsALL -3.47 ## 4 Width_in:landsALL -0.0725 ``` `$$\widehat{height}_{in} = 5.15 + 0.80 ~ width_{in} - 3.47 ~ landsALL - 0.07 ~ width \times landsALL$$` --- ## AE 18: Review .alert[ `$$\widehat{height}_{in} = 5.15 + 0.80 ~ width_{in} - 3.47 ~ landsALL - 0.07 ~ width_{in} \times landsALL$$` ] Equation for paintings with landscape elements: `$$\begin{align} \widehat{height}_{in} &= 5.15 + 0.80 ~ width_{in} - 3.47 \times 1 - 0.07 ~ width_{in} \times 1 \\ &= 1.7 + 0.73 ~ width_{in}\\ \end{align}$$` Equation for paintings without landscape elements: `$$\begin{align} \widehat{height}_{in} &= 5.15 + 0.80 ~ width_{in} - 3.47 \times 0 - 0.07 ~ width_{in} \times 0 \\ &= 5.15 + 0.80 ~ width_{in} \end{align}$$` --- ## AE 18: Review ```r glance(m_main)$adj.r.squared ``` ``` ## [1] 0.7100106 ``` ```r glance(m_int)$adj.r.squared ``` ``` ## [1] 0.7111417 ``` --- class: center, middle ## Inference in linear regression --- ## Data and Packages ```r library(tidyverse) library(broom) ``` Recall the file `sportscars.csv` contains prices for Porsche and Jaguar cars for sale on cars.com. `car`: car make (Jaguar or Porsche) `price`: price in USD `age`: age of the car in years `mileage`: previous miles driven --- ## Multiple Linear Regression ```r m_int <- lm(price ~ age + car + age * car, data = sports_car_prices) m_int %>% tidy() %>% select(term, estimate) ``` ``` ## # A tibble: 4 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 56988. ## 2 age -5040. ## 3 carPorsche 6387. ## 4 age:carPorsche 2969. ``` `$$\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche$$` --- class: center, middle, inverse ## CLT-based Inference in Regression --- ## The linear model with multiple predictors Population model: $$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$ Sample model that we use to estimate the population model: $$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$ Similar to other sample statistics (mean, proportion, etc) there is variability in our estimates of the slope and intercept. -- .question[ - Do we have convincing evidence that the true linear model has a non-zero slope? - What is a confidence interval for the population regression coefficient? ] --- ## Mileage vs. age We will consider a simple linear regression model predicting mileage using age. .midi[ ```r m_age_miles <- lm(mileage ~ age, data = sports_car_prices) ``` ] <img src="19-mlr-inference_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- class: center, middle # A confidence interval for `\(\beta_1\)` --- ## Confidence interval .alert[ `$$point~estimate \pm critical~value \times SE$$` ] -- .alert[ `$$b_1 \pm t^*_{n-2} \times SE_{b_1}$$` where `\(t^*_{n-2}\)` is calculated using a `\(t\)` distribution with `\(n-2\)` degrees of freedom. ] --- ## Tidy confidence interval ```r tidy(m_age_miles, conf.int = TRUE, conf.level = 0.95) ``` ``` ## # A tibble: 2 x 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 13967. 2876. 4.86 9.40e- 6 8211. 19723. ## 2 age 3837. 403. 9.52 1.86e-13 3030. 4643. ``` ```r est <- tidy(m_age_miles) %>% filter(term == "age") %>% pull(estimate) se <- tidy(m_age_miles) %>% filter(term == "age") %>% pull(std.error) ``` --- ## Calculating the 95% CI manually A 95% confidence interval for `\(\beta_1\)` can be calculated as -- ```r (df <- nrow(sports_car_prices) - 2) ``` ``` ## [1] 58 ``` -- ```r (tstar <- qt(0.975,df)) ``` ``` ## [1] 2.001717 ``` -- ```r (ci <- est + c(-1,1) * tstar *se) ``` ``` ## [1] 3029.838 4643.199 ``` --- ## Interpretation ```r tidy(m_age_miles, conf.int = TRUE, conf.level = 0.95) %>% filter(term == "age") %>% select(conf.low, conf.high) ``` ``` ## # A tibble: 1 x 2 ## conf.low conf.high ## <dbl> <dbl> ## 1 3030. 4643. ``` .alert[We are 95% confident that for every additional year of a car's age, the mileage is expected to increase, on average, between about 3030 and 4643 miles.] --- class: center, middle # A hypothesis test for `\(\beta_1\)` --- ## Hypothesis testing for `\(\beta_1\)` Is there convincing evidence, based on our sample data, that age is associated with mileage? We can set this up as a hypothesis test, with the hypotheses below. -- .alert[ `\(H_0: \beta_1 = 0\)`. The slope is 0. There is no relationship between mileage and age. `\(H_a: \beta_1 \ne 0\)`. The slope is not 0. There is a relationship between mileage and age. ] -- We only reject `\(H_0\)` in favor of `\(H_a\)` if the data provide strong evidence that the true slope parameter is different from zero. --- ## Hypothesis testing for `\(\beta_1\)` ```r tidy(m_age_miles) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 13967. 2876. 4.86 9.40e- 6 ## 2 age 3837. 403. 9.52 1.86e-13 ``` -- `$$T = \frac{b_1 - 0}{SE_{b_1}} \sim t_{n-2}$$` -- The p-value is in the output is the p-value associated with the two-sided hypothesis test `\(H_a: \beta_1 \neq 0\)`. --- ## Hypothesis testing for `\(\beta_1\)` ```r tidy(m_age_miles) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 13967. 2876. 4.86 9.40e- 6 ## 2 age 3837. 403. 9.52 1.86e-13 ``` .vocab[The p-value is very small, so we reject] `\(\color{#1689AE}{H_0}\)`. .vocab[The data provide sufficient evidence that the coefficient of age is not equal to 0, and there is a linear relationship between the mileage and age of a car.] --- ## Final Thoughts We used a CLT-based approach to construct confidence intervals and perform hypothesis tests. Note that you can also use simulation-based methods to do inference using `infer`. [Click here](https://infer.netlify.app/articles/observed_stat_examples.html#two-numerical-vars---slr) for examples. --- class: center, middle ## Conditions for Inference in Regression --- ## Conditions - **Linearity**: The relationship between response and predictor(s) is linear - **Independence**: The residuals are independent - **Normality**: The residuals are nearly normally distributed - **Equal Variance**: The residuals have constant variance --- ## Conditions - .vocab[L]**inearity**: The relationship between response and predictor(s) is linear - .vocab[I]**ndependence**: The residuals are independent - .vocab[N]**ormality**: The residuals are nearly normally distributed - .vocab[E]**qual Variance**: The residuals have constant variance -- <br> *For multiple regression, the predictors shouldn't be too correlated with each other.* --- ## Examples <img src="img/19/violated_conds.png" width="50%" style="display: block; margin: auto;" /> Top row: scatterplot of `\(x\)` vs `\(y\)` Bottom row: residual plot --- ## `augment` data with model results - `.fitted`: Predicted value of the response variable - `.resid`: Residuals .midi[ ```r m_age_miles_aug <- augment(m_age_miles) m_age_miles_aug %>% slice(1:3) ``` ``` ## # A tibble: 3 x 8 ## mileage age .fitted .resid .hat .sigma .cooksd .std.resid ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21500 3 25477. -3977. 0.0223 13981. 0.000959 -0.290 ## 2 43000 3 25477. 17523. 0.0223 13793. 0.0186 1.28 ## 3 19900 2 21640. -1740. 0.0275 13989. 0.000229 -0.127 ``` ] -- We will use the fitted values and residuals to check the conditions by constructing .vocab[diagnostic plots]. --- ## Residuals vs fitted plot Use to check .vocab[**L**inearity] and .vocab[**E**qual variance.] .midi[ ```r ggplot(m_age_miles_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + labs(x = "Predicted Mileage", y = "Residuals") ``` <img src="19-mlr-inference_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> ] --- ## Residuals in order of collection Use to check .vocab[**I**ndependence] .midi[ ```r ggplot(data = m_age_miles_aug, aes(x = 1:nrow(sports_car_prices), y = .resid)) + geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + labs(x = "Index", y = "Residual") ``` <img src="19-mlr-inference_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ] --- ## Histogram of residuals Use to check .vocab[**N**ormality] .midi[ ```r ggplot(m_age_miles_aug, mapping = aes(x = .resid)) + geom_histogram(bins = 15) + labs(x = "Residuals") ``` <img src="19-mlr-inference_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle ## Multiple linear regression: inference --- ## Hypothesis testing For null hypotheses, we now condition on each of the other remaining variables in model. For example, let's say I want to create a linear model: `$$\widehat{price} = \beta_0 + \beta_{1} ~ mileage + \beta_{2}~ age$$` -- We can write two different null hypotheses: 1. `\(H_{0}: \beta_{1} = 0\)`, given `age` is included in the model 2. `\(H_{0}: \beta_{2} = 0\)`, given `mileage` is included in the model --- ## Hypothesis tests ```r m_mult <- lm(price ~ mileage + age, data = sports_car_prices) tidy(m_mult) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 62950. 3176. 19.8 3.03e-27 ## 2 mileage -0.695 0.122 -5.68 4.75e- 7 ## 3 age 516. 601. 0.859 3.94e- 1 ``` The p-value for testing `\(H_{0}: \beta_{1} = 0\)`, given `age` is included in the model is almost 0. The p-value for testing `\(H_{0}: \beta_{2} = 0\)`, given `mileage` is included in the model is 0.39. --- ## Assessing conditions ```r m_mult_aug <- augment(m_mult) ``` Because we have multiple predictors, we also need to check the condition that the explanatory variables are not too correlated. .vocab[Multicollinearity] occurs when the predictor variables are correlated within themselves. If we have multicollinearity, the coefficients in a multiple regression model can be difficult to interpret. ```r cor(sports_car_prices %>% select(mileage, age)) ``` ``` ## mileage age ## mileage 1.0000000 0.7808789 ## age 0.7808789 1.0000000 ``` --- ## Assessing conditions ```r corrplot::corrplot(sports_car_prices %>% dplyr::select(-car) %>% cor(), type = "lower") ``` <img src="19-mlr-inference_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- ## Residuals vs fitted plot Check .vocab[**L**inearity] and .vocab[**E**qual variance.] .midi[ ```r ggplot(m_mult_aug, mapping = aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + labs(x = "Predicted Price", y = "Residuals") ``` <img src="19-mlr-inference_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ] --- ## Residuals in order of collection Check .vocab[**I**ndependence] .midi[ ```r ggplot(data = m_mult_aug, aes(x = 1:nrow(sports_car_prices), y = .resid)) + geom_point() + geom_hline(yintercept = 0, lwd = 2, col = "red", lty = 2) + labs(x = "Index", y = "Residual") ``` <img src="19-mlr-inference_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> ] --- ## Histogram of residuals Use to check .vocab[**N**ormality] .midi[ ```r ggplot(m_mult_aug, mapping = aes(x = .resid)) + geom_histogram(bins = 15) + labs(x = "Residuals") ``` <img src="19-mlr-inference_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> ]