class: center, middle, inverse, title-slide # Multiple linear regression ### Becky Tang ### 06.30.2021 --- <!-- layout: true --> <!-- <div class="my-footer"> --> <!-- <span> --> <!-- <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> --> <!-- </span> --> <!-- </div> --> --- class: center, middle ## Review --- ## Vocabulary -- - .vocab[Response variable]: Variable whose behavior or variation you are trying to understand. -- - .vocab[Explanatory variables]: Other variables that you want to use to explain the variation in the response. -- - .vocab[Predicted value]:</font> Output of the **model function** - The model function gives the typical value of the response variable *conditioning* on the explanatory variables. -- - .vocab[Residuals]: Shows how far each case is from its predicted value - **Residual = Observed value - Predicted value** --- ## The linear model with a single predictor - We're interested in the `\(\beta_0\)` (population parameter for the intercept) and the `\(\beta_1\)` (population parameter for the slope) in the following model: $$ \hat{y} = \beta_0 + \beta_1~x $$ -- - Unfortunately, we can't get these values - So we use sample statistics to estimate them: $$ \hat{y} = b_0 + b_1~x $$ --- ## Least squares regression The regression line minimizes the sum of squared residuals. - .vocab[Residuals]: `\(e_i = y_i - \hat{y}_i\)`, - The regression line minimizes `\(\sum_{i = 1}^n e_i^2\)`. - Equivalently, minimizing `\(\sum_{i = 1}^n [y_i - (b_0 + b_1~x_i)]^2\)` --- ## Data and Packages ```r library(tidyverse) library(broom) ``` ```r car_prices <- read_csv("data/car_prices.csv") ``` --- ## Single numerical predictor ```r m_pr_mi <- lm(price ~ mileage, data = car_prices) tidy(m_pr_mi) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 21.4 0.609 35.1 1.69e-53 ## 2 mileage -0.128 0.00874 -14.6 2.55e-25 ``` `$$\widehat{price_{1000}} = 21.37 -0.13~mileage_{per1000}$$` --- ## Single categorical predictor (> 2 levels) ```r m_pr_type <- lm(price ~ type, data = car_prices) tidy(m_pr_type) ``` ``` ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 14.3 0.977 14.6 3.94e-25 ## 2 typeMaxima 1.19 1.38 0.858 3.93e- 1 ## 3 typeMazda6 -2.77 1.38 -2.01 4.79e- 2 ``` `$$\widehat{price_{1000}} = 14.3 +1.19 ~typeMaxima- 2.77~typeMazda6$$` --- ## Single categorical predictor (> 2 levels) `$$\widehat{price_{1000}} = 14.3 +1.19 ~typeMaxima- 2.77~typeMazda6$$` How does the average price of Maximas compare to the average price of Accords? -- For Maximas: `\(typeMaxima = 1\)` and `\(typeMazda6 = 0\)`. The average price of Maximas is `$$\widehat{price_{1000}} = 14.3 +1.19 \times 1- 2.77 \times 0 = 14.3 +1.19$$` For Accords: `\(typeMaxima = 0\)` and `\(typeMazda6 = 0\)`. The average price of Accords is $$\widehat{price_{1000}} = 14.3 +1.19 \times 0 - 2.77 \times 0 = 14.3 $$ --- class: center, middle ## The linear model with multiple predictors --- ## The linear model with multiple predictors - Population model: $$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$ where `\(k\)` is the number of explanatory variables. -- - Sample model that we use to estimate the population model: $$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$ --- ## Data The data set contains prices for Porsche and Jaguar cars for sale on cars.com. .vocab[`car`]: car make (Jaguar or Porsche) .vocab[`price`]: price in USD .vocab[`age`]: age of the car in years .vocab[`mileage`]: previous miles driven --- ## Price, age, and make <img src="18-mlr_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## Price vs. age and make .question[ Does the relationship between age and price depend on the make of the car? ] <img src="18-mlr_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Modeling with main effects A .vocab[main effect] is the effect of a single independent variable on a dependent variable ```r m_main <- lm(price ~ age + car, data = sports_car_prices) m_main %>% tidy() %>% select(term, estimate) ``` ``` ## # A tibble: 3 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 44310. ## 2 age -2487. ## 3 carPorsche 21648. ``` -- .midi[ $$ \widehat{price} = 44310 - 2487~age + 21648~carPorsche $$ ] --- .alert[ $$ \widehat{price} = 44310 - 2487~age + 21648~carPorsche $$ ] - Plug in 0 for `carPorsche` to get the linear model for Jaguars. -- `$$\begin{align}\widehat{price} &= 44310 - 2487~age + 21648 \times 0\\ &= 44310 - 2487~age\\\end{align}$$` -- - Plug in 1 for `carPorsche` to get the linear model for Porsches. -- `$$\begin{align}\widehat{price} &= 44310 - 2487~age + 21648 \times 1\\ &= 65958 - 2487~age\\\end{align}$$` --- .alert[ **Jaguar** `$$\begin{align}\widehat{price}_{jaguar} &= 44310 - 2487~age + 21648 \times 0\\ &= 44310 - 2487~age\\\end{align}$$` **Porsche** `$$\begin{align}\widehat{price}_{porsche} &= 44310 - 2487~age + 21648 \times 1\\ &= 65958 - 2487~age\\\end{align}$$` ] - Rate of change in price as the age of the car increases does not depend on make of car (.vocab[same slopes]) - Porsches are consistently more expensive than Jaguars (.vocab[different intercepts]) --- ## Interpretation of main effects <img src="18-mlr_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Main effects ``` ## # A tibble: 3 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 44310. ## 2 age -2487. ## 3 carPorsche 21648. ``` -- - **All else held constant**, for each additional year of a car's age, the price of the car is predicted to decrease, on average, by $2,487. -- - **All else held constant**, Porsches are predicted, on average, to have a price that is $21,647 greater than Jaguars. -- - Jaguars that are new (age = 0) are predicted, on average, to have a price of $44,309. --- .question[ Why is our linear regression model different from what we got from `geom_smooth(method = "lm")`? ] .pull-left[ ] .pull-right[ ] <img src="18-mlr_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## What went wrong? -- - .vocab[`car`] is the only variable in our model that affects the intercept. -- - The model we specified assumes Jaguars and Porsches have the .vocab[`same slope`] and .vocab[`different intercepts`]. -- - What is the most appropriate model for these data? - same slope and intercept for Jaguars and Porsches? - same slope and different intercept for Jaguars and Porsches? - different slope and different intercept for Jaguars and Porsches? --- ## Interacting explanatory variables - Including an interaction effect in the model allows for different slopes, i.e. nonparallel lines. - This means that the relationship between an explanatory variable and the response depends on another explanatory variable. - We can accomplish this by adding an .vocab[interaction variable]. This is the product of two explanatory variables. --- ## Price vs. age and car interacting .midi[ ```r ggplot(data = sports_car_prices, mapping = aes(y = price, x = age, color = car)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = "Age (years)", y = "Price (USD)", color = "Car Make") ``` <img src="18-mlr_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] --- ## Modeling with interaction effects ```r m_int <- lm(price ~ age + car + age * car, data = sports_car_prices) m_int %>% tidy() %>% select(term, estimate) ``` ``` ## # A tibble: 4 x 2 ## term estimate ## <chr> <dbl> ## 1 (Intercept) 56988. ## 2 age -5040. ## 3 carPorsche 6387. ## 4 age:carPorsche 2969. ``` `$$\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche$$` --- ## Interpretation of interaction effects .alert[ `$$\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche$$` ] -- - Plug in 0 for `carPorsche` to get the linear model for Jaguars. `$$\begin{align}\widehat{price} &= 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche \\ &= 56988 - 5040~age + 6387 \times 0 + 2969~age \times 0\\ &\color{#00b3b3}{= 56988 - 5040~age}\end{align}$$` -- - Plug in 1 for `carPorsche` to get the linear model for Porsches. `$$\begin{align}\widehat{price} &= 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche \\ &= 56988 - 5040~age + 6387 \times 1 + 2969~age \times 1\\ &\color{#00b3b3}{= 63375 - 2071~age}\end{align}$$` --- ## Interpretation of interaction effects .alert[ **Jaguar** `$$\widehat{price} = 56988 - 5040~age$$` **Porsche** `$$\widehat{price} = 63375 - 2071~age$$` ] - Rate of change in price as the age of the car increases depends on the make of the car (.vocab[different slopes]). - Porsches are consistently more expensive than Jaguars (.vocab[different intercepts]). --- <img src="18-mlr_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> `$$\widehat{price} = 56988 - 5040~age + 6387~carPorsche + 2969~age \times carPorsche$$` --- ## Continuous by continuous interactions - Interpretation becomes trickier - Slopes conditional on values of explanatory variables -- ## Third order interactions - Can you? Yes - Should you? Probably not if you want to interpret these interactions in context of the data. --- class: center, middle ## Assessing quality of model fit --- ## Assessing the quality of the fit - The strength of the fit of a linear model is commonly evaluated using `\(R^2\)`. - It tells us what percentage of the variability in the response variable is explained by the model. The remainder of the variability is unexplained. - `\(R^2\)` is sometimes called the **coefficient of determination**. .question[ What does "explained variability in the response variable" mean? ] --- ## Obtaining `\(R^2\)` in R `price` vs. `age` and `make` ```r glance(m_main) ``` ``` ## # A tibble: 1 x 12 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.607 0.593 11848. 44.0 2.73e-12 2 -646. 1301. 1309. ## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int> ``` ```r glance(m_main)$r.squared ``` ``` ## [1] 0.6071375 ``` -- About 60.7% of the variability in price of used cars can be explained by age and make. --- ## `\(R^2\)` ```r glance(m_main)$r.squared #model with main effects ``` ``` ## [1] 0.6071375 ``` ```r glance(m_int)$r.squared #model with main effects + interactions ``` ``` ## [1] 0.6677881 ``` -- - The model with interactions has a higher `\(R^2\)`. -- - Using `\(R^2\)` for model selection in models with multiple explanatory variables is **<u>not</u>** a good idea as `\(R^2\)` increases when **any** variable is added to the model. --- ## `\(R^2\)` - first principles - We can write explained variation using the following ratio of sums of squares: $$ R^2 = 1 - \left( \frac{ \text{variability in residuals}}{ \text{variability in response} } \right) $$ .question[ Why does this expression make sense? ] - But remember, adding **any** explanatory variable will always increase `\(R^2\)` --- ## Adjusted `\(R^2\)` .alert[ $$ R^2_{adj} = 1 - \left( \frac{ \text{variability in residuals}}{ \text{variability in response} } \times \frac{n - 1}{n - k - 1} \right)$$ where `\(n\)` is the number of observations and `\(k\)` is the number of predictors in the model. ] -- - Adjusted `\(R^2\)` doesn't increase if the new variable does not provide any new information or is completely unrelated. -- - This makes adjusted `\(R^2\)` a preferable metric for model selection in multiple regression models. --- ## Comparing models .pull-left[ ```r glance(m_main)$r.squared ``` ``` ## [1] 0.6071375 ``` ```r glance(m_int)$r.squared ``` ``` ## [1] 0.6677881 ``` ] .pull-right[ ```r glance(m_main)$adj.r.squared ``` ``` ## [1] 0.5933529 ``` ```r glance(m_int)$adj.r.squared ``` ``` ## [1] 0.649991 ``` ] --- ## In pursuit of Occam's Razor - Occam's Razor states that among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected. - Model selection follows this principle. - We only want to add another variable to the model if the addition of that variable brings something valuable in terms of predictive power to the model. - In other words, we prefer the simplest best model, i.e. .vocab[parsimonious] model.