class: center, middle, inverse, title-slide # Scientific studies, confounding, and Simpson’s paradox ### Becky Tang ### 06.02.2021 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Announcements - HW 01 - due **Today 11:59pm** --- class: center, middle # Scientific studies --- ## Scientific studies - <font class="vocab">Observational</font> - Collect data in a way that does not interfere with how the data arise ("observe") - Only establish an association - <font class="vocab">Experimental</font> - Randomly assign subjects to treatments - Establish causal connections .question[ Design a study comparing average energy levels of people who do and do not exercise -- both as an observational study and as an experiment. ] --- ### Study: "Cereal Keeps Girls Slim" .small[ Girls who ate breakfast of any type had a lower average body mass index, a common obesity gauge, than those who said they didn't. The index was even lower for girls who said they ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical Research Institute with funding from the National Institutes of Health (NIH) and cereal-maker General Mills. [...] The results were gleaned from a larger NIH survey of 2,379 girls in California, Ohio, and Maryland who were tracked between the ages of 9 and 19. [...] As part of the survey, the girls were asked once a year what they had eaten during the previous three days.... ] <br> .footnote[ Source: [Study: Cereal Keeps Girls Slim](https://www.cbsnews.com/news/study-cereal-keeps-girls-slim/) ] --- ### 3 possible explanations -- - Eating breakfast causes girls to be slimmer <br> -- - Being slim causes girls to eat breakfast <br> -- - A third variable is responsible for both -- a confounding variable -- .alert[ A <font class="vocab">confounding</font> variable is an an extraneous variable that affects both the explanatory and the response variable, and that make it seem like there is a relationship between them ] --- ## Correlation != causation <br><br> .center[ ![](img/08/xkcdcorrelation.png) ] <br><br> .footnote[ Randall Munroe CC BY-NC 2.5 http://xkcd.com/552/ ] --- ## Studies and conclusions <img src="img/08/random_sample_assign_grid.png" width="700" style="display: block; margin: auto;" /> --- ### Non-random samples: a cautionary tale In 2016, the Natural Environment Research Council in England started an online competition in an effort to name a polar research ship. People were invited to submit suggestions and/or cast a vote for their favorite choice. .question[ What type of sampling design is this? ] [What happened?](https://www.cnn.com/2016/04/18/world/boaty-mcboatface-wins-vote/index.html) --- class: center, middle # Conditional probability --- ## Conditional probability: Review .question[ A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below. What proportion of **<u>all respondents</u>** are very familiar with the DREAM act? ] <br> .pull-left[ | | 18 - 49 | 50+ | Total | |------------|---------|-----|-------| | Very | 90 | 32 | 122 | | Somewhat | 125 | 86 | 211 | | Not very | 56 | 33 | 89 | | Not at all | 36 | 24 | 60 | | Not sure | 9 | 9 | 18 | | Total | 316 | 184 | 500 | <br><br> ] -- .pull-right[ `\(P(\text{Very}) = \frac{122}{500} = 0.244\)` ] <br> .footnote[ Source: [SurveyUSA News Poll 23754](http://www.surveyusa.com/client/PollReport.aspx?g=783743b0-efc1-4b67-9201-58352a8f61f1) ] --- .question[ A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below. What proportion of **<u>respondents who are 18 - 49 years old</u>** are very familiar with the DREAM act? ] <br> .pull-left[ | | 18 - 49 | 50+ | Total | |------------|---------|-----|-------| | Very | 90 | 32 | 122 | | Somewhat | 125 | 86 | 211 | | Not very | 56 | 33 | 89 | | Not at all | 36 | 24 | 60 | | Not sure | 9 | 9 | 18 | | Total | 316 | 184 | 500 | ] -- .pull-right[ `\(P(\text{Very}~|~18-49) = \frac{90}{316} = 0.285\)` ] --- .question[ A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below. What proportion of **<u>respondents who are 50+ years old</u>** are very familiar with the DREAM act? ] <br> .pull-left[ | | 18 - 49 | 50+ | Total | |------------|---------|-----|-------| | Very | 90 | 32 | 122 | | Somewhat | 125 | 86 | 211 | | Not very | 56 | 33 | 89 | | Not at all | 36 | 24 | 60 | | Not sure | 9 | 9 | 18 | | Total | 316 | 184 | 500 | ] -- .pull-right[ `\(P(\text{Very}~|~50+) = \frac{32}{184} = 0.173\)` ] --- .question[ Given that - `\(P(\text{Very}) = \frac{122}{500} = 0.244\)` - `\(P(\text{Very}~|~18-49) = \frac{90}{316} = 0.285\)` - `\(P(\text{Very}~|~50+) = \frac{32}{184} = 0.173\)` does there appear to be a relationship between age and familiarity with the DREAM act? Explain your reasoning. ] -- <br> .question[ Could there be another variable that explains this relationship? ] --- ## Independence .question[ Inspired by the previous example and how we used the conditional probabilities to make conclusions, come up with a definition of independent events. If easier, you can keep the context limited to the example (independence/dependence of familiarity with the DREAM act and age), but try to push yourself to make a more general statement. ] --- class: center, middle # Simpson's paradox --- ## Relationships between variables - **Bivariate relationship**: Fitness `\(\rightarrow\)` Heart health - **Multivariate relationship**: Calories + Age + Fitness `\(\rightarrow\)` Heart health --- ## Simpson's paradox - Not considering an important variable when studying a relationship can result in <font class="vocab">Simpson's paradox</font>, a phenomenon in which the omission of one explanatory variable can affect the measure of association between another explanatory variable and a response variable. - In other words, the inclusion of a third variable in the analysis can change the apparent relationship between the other two variables. --- ## Simpson's paradox <img src="08-confounding_files/figure-html/simpsons_plot-1.png" style="display: block; margin: auto;" /> --- ## Simpson's paradox <img src="08-confounding_files/figure-html/simpsons_plot2-1.png" style="display: block; margin: auto;" /> --- ## Glimpse of data in tidy form ```r glimpse(admissions) ``` ``` ## Rows: 4,526 ## Columns: 3 ## $ department <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,… ## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "ma… ## $ decision <fct> admit, admit, admit, admit, admit, admit, admit, admit, adm… ``` .footnote[ [https://en.wikipedia.org/wiki/Simpson%27s_paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox) ] --- ## Overall distribution of acceptance by gender .question[ What type of visualization would be appropriate for representing this data? ] ```r admissions %>% count(gender, decision) %>% group_by(gender) %>% mutate(prop_admit = n / sum(n)) ``` ``` ## # A tibble: 4 x 4 ## # Groups: gender [2] ## gender decision n prop_admit ## <chr> <fct> <int> <dbl> ## 1 female deny 1278 0.696 ## 2 female admit 557 0.304 ## 3 male deny 1493 0.555 ## 4 male admit 1198 0.445 ``` --- ## Overall distribution of acceptance ![](08-confounding_files/figure-html/berkley_hist-1.png)<!-- --> --- ## Closer look Let's look at data from the six largest departments, labeled A-F: | Department | Female: Admit | Female: Total | Male: Admit | Male: Total |------------|---------|-----|-------|------| | A | 89 | 108 | 511 | 825| | B | 17 | 25 | 353 | 560 | | C | 202 | 593 | 120 | 227 | | D | 131 | 375 | 138 | 417 | | E | 94 | 393 | 54 | 191| | F | 24 | 341 | 22 | 373 --- ## UC Berkeley admissions: condition on department .question[ Within each department, what is the probability of admittance for each gender? That is, within Department X, what are `\(P(\text{man admit} | \text{department X})\)` and `\(P(\text{woman admit} | \text{department X})\)`? ] | Department | Female: Admit | Female: Total | Male: Admit | Male: Total| |------------|---------|-----|-------|------| | A | 89 | 108 | 511 | 825| | B | 17 | 25 | 353 | 560 | | C | 202 | 593 | 120 | 227 | | D | 131 | 375 | 138 | 417 | | E | 94 | 393 | 54 | 191| | F | 24 | 341 | 22 | 373| --- ## UC Berkeley admissions: Conditional probabilities | | women| men| all| |:--|-----:|----:|----:| |A | 0.82| 0.62| 0.64| |B | 0.68| 0.63| 0.63| |C | 0.34| 0.37| 0.35| |D | 0.35| 0.33| 0.34| |E | 0.24| 0.28| 0.25| |F | 0.07| 0.06| 0.06| --- ## Distribution of acceptance by department .pull_left[ ```r admissions %>% count(department, gender, decision) ``` ``` ## department gender decision n ## 1 A female deny 19 ## 2 A female admit 89 ## 3 A male deny 314 ## 4 A male admit 511 ## 5 B female deny 8 ## 6 B female admit 17 ## 7 B male deny 207 ## 8 B male admit 353 ## 9 C female deny 391 ## 10 C female admit 202 ## 11 C male deny 205 ## 12 C male admit 120 ## 13 D female deny 244 ## 14 D female admit 131 ## 15 D male deny 279 ## 16 D male admit 138 ## 17 E female deny 299 ## 18 E female admit 94 ## 19 E male deny 137 ## 20 E male admit 54 ## 21 F female deny 317 ## 22 F female admit 24 ## 23 F male deny 351 ## 24 F male admit 22 ``` ] .pull_right[ ```r admissions %>% count(department, gender, decision) %>% group_by(department, gender) %>% mutate(prop_admit = n / sum(n)) ``` ``` ## # A tibble: 24 x 5 ## # Groups: department, gender [12] ## department gender decision n prop_admit ## <fct> <chr> <fct> <int> <dbl> ## 1 A female deny 19 0.176 ## 2 A female admit 89 0.824 ## 3 A male deny 314 0.381 ## 4 A male admit 511 0.619 ## 5 B female deny 8 0.32 ## 6 B female admit 17 0.68 ## 7 B male deny 207 0.370 ## 8 B male admit 353 0.630 ## 9 C female deny 391 0.659 ## 10 C female admit 202 0.341 ## # … with 14 more rows ``` ] <br> .question[ What type of visualization would be appropriate for representing this data? ] --- ## Distribution of acceptance by department ![](08-confounding_files/figure-html/berkley_hist_facet-1.png)<!-- --> --- ## UC Berkeley admissions: closer look | Department | Female: Total | Female: Acceptance | Male: Total | Male: Acceptance | |------------|---------|-----|-------|------| | A | 108 | 0.82 | 825 | 0.62 | | B | 25 | 0.68 | 560 | 0.63 | | C | 593 | 0.34 | 227 | 0.37 | | D | 375 | 0.35 | 417 | 0.33 | | E | 393 | 0.24 | 191 | 0.28 | | F | 341 | 0.07 | 373 | 0.06 | <br> - Are the departments uniform in their admission rates? Notice how **A** and **B** have highest acceptance. - Rank departments by total number of male applicants: **A** > **B** > D > C > F > E - Rank departments by total number of female applicants: C > E > D > F > **A** > **B** --- ## UC Berkeley admissions: plot <img src="img/08/bickel.png" width="400" style="display: block; margin: auto;" /> <br> .footnote[ [https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf](https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf) ]