Intro to Probability

# Intro to Probability
### Becky Tang
### 05.26.2021

---

<div class="my-footer">

<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>

</div>

---

## What we've done so far...

- Use visualization techniques to *visualize* data  
- Use descriptive statistics to *describe* and *summarize* data
- Use data wrangling tools to *manipulate* data
- ...all using the reproducible, shareable tools of R and git

That's all great, but what we eventually want to do is to *quantify uncertainty*
in order to make **principled conclusions** about the data
  
---

## The statistical process

1. form a question of interest,

2. collect and summarize data,

3. and interpret the results.
]

---

## The population of interest

The .vocab[population] is the group we'd like to learn something about. For 
example:

.small[
- What is the prevalence of diabetes among **U.S. adults**, and has it changed
over time? 
- Does the average amount of caffeine vary by vendor in **12 oz. cups of**
**coffee at Duke coffee shops**?
- Is there a relationship between tumor type and five-year mortality among
**breast cancer patients**?
]

The .vocab[research question of interest] is what we want to answer - often 
relating one or more numerical quantities or summary statistics.

If we had data from every unit in the population, we could just calculate what
we wanted and be done!

---

## Sampling from the population

Unfortunately, we (usually) have to settle with a .vocab[sample] from the
population.

Ideally, the sample is .vocab[representative] (has similar characteristics as the population), allowing us to make conclusions 
that are .vocab[generalizable] (i.e. applicable) to the broader population of interest.

We'll use probability and statistical inference (more on this later!) to draw conclusions about the population based on our sample.

---

# Interpreting probabilities

---

## Interpretations of probability

---

## Interpretations of probability

---

## Interpretations of probability

---

## Interpretations of probability

---

# Formalizing probabilities

---

## What do we need?

We can think of probabilities as objects that model random phenomena. We'll use three components to talk about probabilities:

1.  .vocab[Sample space]: the set of all possible .vocab[outcomes]

2.  .vocab[Events]: Subsets of the sample space, comprise any number of possible outcomes (including none of them!)

3.  .vocab[Probability]: Proportion of times an event would occur if we observed the random phenomenon an infinite number of times.

---

## Sample spaces

Sample spaces depend on the random phenomenon in question

- Tossing a single fair coin

- Sum of rolling two fair six-sided dice

- Guessing the answer on a multiple choice question with choices *a, b, c, d*.

---

## Events

.vocab[Events] are subsets of the sample space that comprise all possible outcomes from
that event. These are the "plausibly reasonable" outcomes we may want to calculate the probabilities for

- Tossing a single fair coin

- Sum of rolling two fair six-sided dice

- Guessing the answer on a multiple choice question with choices *a, b, c, d*.

---

## Probabilities

Consider the following possible events and their corresponding probabilities:

- Getting a head from a single fair coin toss: **0.5**
- Getting a prime number sum from rolling two fair six-sided dice: **5/12**
- Guessing the correct answer: **1/4**

*We'll talk more about how we calculated these probabilities, but for now remember that
probabilities are numbers describing the likelihood of each event's occurrence,
which map events to a number between 0 and 1, inclusive.*

---

# Working with probabilities

---

## Set operations

Remember that events are (sub)sets of the outcome space. For two sets (in this
case events) `\(A\)` and `\(B\)`, the most common relationships are:
 
- .vocab[Intersection] `\((A \text{ and } B)\)`: `\(A\)` **and** `\(B\)` both occur
- .vocab[Union] `\((A \text{ or } B)\)`: `\(A\)` **or** `\(B\)` occurs (including when both occur)
- .vocab[Complement] `\((A^c)\)`: `\(A\)` does **not** occur

Two sets `\(A\)` and `\(B\)` are said to be .vocab[disjoint] or .vocab[mutually exclusive] if they cannot happen at the same time, i.e. `\(A \text{ and } B = \emptyset\)`.

---

## Combining set operations

- Complement of union: `\((A \text{ or } B)^c = A^c \text{ and } B^c\)`
- Complement of intersection: `\((A \text{ and } B)^c = A^c \text{ or } B^c\)`

These can be straightforwardly extended to more than two events
]

---

## How do probabilities work?

- The probability of any event is real number that's `\(\geq 0\)`

-  The probability of the entire sample space is 1

-  If `\(A\)` and `\(B\)` are disjoint events, then `\(P(A \text{ or } B) = P(A) + P(B)\)`

The Kolmogorov axioms lead to all probabilities being between 0 and 1 inclusive, and also lead to important rules...

---

## Two important rules

Suppose we have events `\(A\)` and `\(B\)`, with probabilities `\(P(A)\)` and `\(P(B)\)` of
occurring. Based on the Kolmogorov axioms:

- .vocab[Complement Rule]: `\(P(A^c) = 1 - P(A)\)`
- .vocab[Inclusion-Exclusion]: `\(P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)\)`

---

## Practicing with probabilities: Admissions

UC Berkeley admission figures for Fall of 1973:

| | Admit | Deny | Total |
|----|----|-----|-----|
|Men | 3738 |  4704 | 8442|
|Women| 1494 | 2827 | 4321 |
|Total| 5232 | 7531 | 12763|

Source: https://homepage.stat.uiowa.edu/~mbognar/1030/Bickel-Berkeley.pdf
---

## Practicing with probabilities

UC Berkeley admission figures for Fall of 1973:

| | Admit | Deny | Total |
|----|----|-----|-----|
|Men | 3738 |  4704 | 8442|
|Women| 1494 | 2827 | 4321 |
|Total| 5232 | 7531 | 12763|

.question[
`\(P(\text{admission}) = \dfrac{\text{# admitted}}{\text{# applied}} = \dfrac{5232}{12763} \approx 0.41\)`
]

---

## Practicing with probabilities

| | Admit | Deny | Total |
|----|----|-----|-----|
|Men | 3738 |  4704 | 8442|
|Women| 1494 | 2827 | 4321 |
|Total| 5232 | 7531 | 12763|

.question[
- `\(\small{P(\text{admit among men}) =}\)` ?
- `\(\small{P(\text{admit among women})=}\)` ? 
]

---

## Practicing with probabilities

| | Admit | Deny | Total |
|----|----|-----|-----|
|Men | 3738 |  4704 | 8442|
|Women| 1494 | 2827 | 4321 |
|Total| 5232 | 7531 | 12763|

.question[
- `\(P(\text{admit among men}) = \dfrac{3738}{8442} \approx 0.44\)` 
- `\(P(\text{admit among women}) = \dfrac{1494}{4321} \approx 0.35\)` 
]

---

## Practicing with probabilities: Coffee

|                           | Did not die| Died|
|:--------------------------|-----------:|----:|
|Does not drink coffee      |        5438| 1039|
|Drinks coffee occasionally |       29712| 4440|
|Drinks coffee regularly    |       24934| 3601|

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5788283/

---

## Practicing with probabilities
.midi[

.question[
.midi[
Define events *A* = died and *B* = non-coffee drinker. Calculate the following for a randomly selected person in the cohort:]
- `\(\small{P(A)}\)`
- `\(\small{P(B)}\)`
- `\(\small{P(A \text{ and } B)}\)`
- `\(\small{P(A \text{ or } B)}\)`
- `\(\small{P(A \text{ or } B^c)}\)`
]