Scientific studies, confounding, and Simpson’s paradox

# Scientific studies, confounding, and Simpson’s paradox
### Becky Tang
### 06.02.2021

---

<div class="my-footer">

<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>

</div>

---

## Announcements

- HW 01 - due **Today 11:59pm**

---

# Scientific studies

---

## Scientific studies

- Observational
 - Collect data in a way that does not interfere with how the data arise ("observe")
 - Only establish an association
- Experimental
 - Randomly assign subjects to treatments
 - Establish causal connections

.question[
Design a study comparing average energy levels of people who do and do not exercise -- both as an observational study and as an experiment.
]

---

### Study: "Cereal Keeps Girls Slim"

.small[
Girls who ate breakfast of any type had a lower average body mass index, a common obesity gauge, than those who said they didn't. The index was even lower for girls who said they ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical Research Institute with funding from the National Institutes of Health (NIH) and cereal-maker General Mills.

[...]

The results were gleaned from a larger NIH survey of 2,379 girls in California, Ohio, and Maryland who were tracked between the ages of 9 and 19.

[...]

As part of the survey, the girls were asked once a year what they had eaten during the previous three days....
]

.footnote[
Source: [Study: Cereal Keeps Girls Slim](https://www.cbsnews.com/news/study-cereal-keeps-girls-slim/)
]

---

### 3 possible explanations

- Eating breakfast causes girls to be slimmer

- Being slim causes girls to eat breakfast

- A third variable is responsible for both -- a confounding variable

.alert[
A confounding variable is an an extraneous variable that affects both the explanatory and the response variable, and that make it seem like there is a relationship between them
]

---

## Correlation != causation

---

## Studies and conclusions

---

### Non-random samples: a cautionary tale

In 2016, the Natural Environment Research Council in England
started an online competition in an effort to name a polar research
ship. People were invited to submit suggestions and/or cast a vote for
their favorite choice.

[What happened?](https://www.cnn.com/2016/04/18/world/boaty-mcboatface-wins-vote/index.html)

---

# Conditional probability

---

## Conditional probability: Review

.question[
A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below.

What proportion of **all respondents** are very familiar with the DREAM act? 
]
 
.pull-left[
| | 18 - 49 | 50+ | Total |
|------------|---------|-----|-------|
| Very | 90 | 32 | 122 |
| Somewhat | 125 | 86 | 211 |
| Not very | 56 | 33 | 89 |
| Not at all | 36 | 24 | 60 |
| Not sure | 9 | 9 | 18 |
| Total | 316 | 184 | 500 |

]

--
.pull-right[
`$P(\text{Very}) = \frac{122}{500} = 0.244$`
]

.footnote[
 Source: [SurveyUSA News Poll 23754](http://www.surveyusa.com/client/PollReport.aspx?g=783743b0-efc1-4b67-9201-58352a8f61f1)
]

---

.question[
A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below.

What proportion of **respondents who are 18 - 49 years old** are very familiar with the DREAM act?
]
 
.pull-left[
| | 18 - 49 | 50+ | Total |
|------------|---------|-----|-------|
| Very | 90 | 32 | 122 |
| Somewhat | 125 | 86 | 211 |
| Not very | 56 | 33 | 89 |
| Not at all | 36 | 24 | 60 |
| Not sure | 9 | 9 | 18 |
| Total | 316 | 184 | 500 |
]
--
.pull-right[
`$P(\text{Very}~|~18-49) = \frac{90}{316} = 0.285$`
]

---

.question[
A January 2018 SurveyUSA poll asked 500 randomly selected Californians whether they are familiar with the DREAM act. The distribution of the responses by age category are shown below.

What proportion of **respondents who are 50+ years old** are very familiar with the DREAM act?
]
 
.pull-left[
| | 18 - 49 | 50+ | Total |
|------------|---------|-----|-------|
| Very | 90 | 32 | 122 |
| Somewhat | 125 | 86 | 211 |
| Not very | 56 | 33 | 89 |
| Not at all | 36 | 24 | 60 |
| Not sure | 9 | 9 | 18 |
| Total | 316 | 184 | 500 |
]
--
.pull-right[
`$P(\text{Very}~|~50+) = \frac{32}{184} = 0.173$`
]

---

- `$P(\text{Very}) = \frac{122}{500} = 0.244$`

- `$P(\text{Very}~|~18-49) = \frac{90}{316} = 0.285$`

- `$P(\text{Very}~|~50+) = \frac{32}{184} = 0.173$`

does there appear to be a relationship between age and familiarity with the DREAM act? Explain your reasoning.
]

---

## Independence

.question[
Inspired by the previous example and how we used the conditional probabilities to make conclusions, come up with a definition of independent events. If easier, you can keep the context limited to the example (independence/dependence of familiarity with the DREAM act and age), but try to push yourself to make a more general statement.
]

---

# Simpson's paradox

---

## Relationships between variables

- **Bivariate relationship**: Fitness `$\rightarrow$` Heart health

- **Multivariate relationship**: Calories + Age + Fitness `$\rightarrow$` Heart health

---

## Simpson's paradox

- Not considering an important variable when studying a relationship can result in Simpson's paradox, a phenomenon in which the omission of one explanatory variable can affect the measure of association between another explanatory variable and a response variable.

- In other words, the inclusion of a third variable in the analysis can change the apparent relationship between the other two variables.

---

## Simpson's paradox

---

## Simpson's paradox

---

## Glimpse of data in tidy form

```r
glimpse(admissions)
```

```
## Rows: 4,526
## Columns: 3
## $ department <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A,…
## $ gender <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ decision <fct> admit, admit, admit, admit, admit, admit, admit, admit, adm…
```

.footnote[
[https://en.wikipedia.org/wiki/Simpson%27s_paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox)
]

---

## Overall distribution of acceptance by gender

```r
admissions %>%
  count(gender, decision) %>%
  group_by(gender) %>%
  mutate(prop_admit = n / sum(n))
```

```
## # A tibble: 4 x 4
## # Groups: gender [2]
## gender decision n prop_admit
## <chr> <fct> <int> <dbl>
## 1 female deny 1278 0.696
## 2 female admit 557 0.304
## 3 male deny 1493 0.555
## 4 male admit 1198 0.445
```

---

## Overall distribution of acceptance

![](08-confounding_files/figure-html/berkley_hist-1.png)

---

## Closer look

Let's look at data from the six largest departments, labeled A-F:

| Department | Female: Admit | Female: Total | Male: Admit | Male: Total
|------------|---------|-----|-------|------|
| A       | 89      | 108  | 511   | 825|
| B   | 17     | 25  | 353   | 560 |
| C   | 202      | 593  | 120    | 227 |
| D |  131     | 375  | 138    | 417 |
| E   | 94       | 393   | 54    | 191|
| F      | 24     | 341 | 22   | 373

---

## UC Berkeley admissions: condition on department

.question[
  Within each department, what is the probability of admittance for each gender? 
  That is, within Department X, what are `$P(\text{man admit} | \text{department X})$` and `$P(\text{woman admit} | \text{department X})$`?
]

| Department | Female: Admit | Female: Total | Male: Admit | Male: Total|
|------------|---------|-----|-------|------|
| A       | 89      | 108  | 511   | 825|
| B   | 17     | 25  | 353   | 560 |
| C   | 202      | 593  | 120    | 227 |
| D |  131     | 375  | 138    | 417 |
| E   | 94       | 393   | 54    | 191|
| F      | 24     | 341 | 22   | 373|

---

## UC Berkeley admissions: Conditional probabilities

|   | women|  men|  all|
|:--|-----:|----:|----:|
|A  |  0.82| 0.62| 0.64|
|B  |  0.68| 0.63| 0.63|
|C  |  0.34| 0.37| 0.35|
|D  |  0.35| 0.33| 0.34|
|E  |  0.24| 0.28| 0.25|
|F  |  0.07| 0.06| 0.06|

---

## Distribution of acceptance by department

```r
admissions %>%
  count(department, gender, decision)
```

```
##    department gender decision   n
## 1           A female     deny  19
## 2           A female    admit  89
## 3           A   male     deny 314
## 4           A   male    admit 511
## 5           B female     deny   8
## 6           B female    admit  17
## 7           B   male     deny 207
## 8           B   male    admit 353
## 9           C female     deny 391
## 10          C female    admit 202
## 11          C   male     deny 205
## 12          C   male    admit 120
## 13          D female     deny 244
## 14          D female    admit 131
## 15          D   male     deny 279
## 16          D   male    admit 138
## 17          E female     deny 299
## 18          E female    admit  94
## 19          E   male     deny 137
## 20          E   male    admit  54
## 21          F female     deny 317
## 22          F female    admit  24
## 23          F   male     deny 351
## 24          F   male    admit  22
```
]

```r
admissions %>%
  count(department, gender, decision) %>%
  group_by(department, gender) %>%
  mutate(prop_admit = n / sum(n)) 
```

```
## # A tibble: 24 x 5
## # Groups: department, gender [12]
## department gender decision n prop_admit
## <fct> <chr> <fct> <int> <dbl>
## 1 A female deny 19 0.176
## 2 A female admit 89 0.824
## 3 A male deny 314 0.381
## 4 A male admit 511 0.619
## 5 B female deny 8 0.32 
## 6 B female admit 17 0.68 
## 7 B male deny 207 0.370
## 8 B male admit 353 0.630
## 9 C female deny 391 0.659
## 10 C female admit 202 0.341
## # … with 14 more rows
```
]

---

## Distribution of acceptance by department

![](08-confounding_files/figure-html/berkley_hist_facet-1.png)

---

## UC Berkeley admissions: closer look

| Department | Female: Total | Female: Acceptance | Male: Total | Male: Acceptance |
|------------|---------|-----|-------|------|
| A       | 108      |  0.82 | 825 | 0.62 |
| B   | 25     |  0.68  | 560   | 0.63 |
| C   | 593     | 0.34  | 227    | 0.37 |
| D |  375    | 0.35  | 417    | 0.33 |
| E   | 393      | 0.24   | 191    | 0.28 |
| F      | 341     | 0.07 | 373   | 0.06 |

- Are the departments uniform in their admission rates? Notice how **A** and **B** have highest acceptance.

- Rank departments by total number of male applicants: **A** > **B** > D > C > F > E

- Rank departments by total number of female applicants: C > E > D > F > **A** > **B**

---

## UC Berkeley admissions: plot

.footnote[
  [https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf](https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf)
]