What If: Notions of Causality

Mon Jul 31 2023

Series: Working Through 'What If?'

These notes are effectively my own study notes from the book I’m working from.
I recommend reading the book. The preprint is free and it’s good.

Causal Inference Without models

is the title of part 1. The first chapter is basically just a formalization of what a causal effect is. The biggest takeaway is:

Causal inference is hard, and I won’t learn everything I could possibly know about it in this book.
Damn.

Causal Effects

Individual Effects

Lenny and Trevor both have a headache. Lenny takes an aspirin and his headache goes away. Trevor takes an aspirin and his headache goes away. But we also have a universe-simulation-machine and can see what would have happened if Lenny had not taken an aspirin. In that universe, Lenny’s headache would have gone away anyway, but Travis’s would not have.

So, we say that the aspirin had a causal effect on Travis’s headache, but not on Lenny’s. Pretty straightforward.

We use $Y$ to denote the outcome variable, and $A$ to denote the treatment variable. In this case we have $Y = \{0, 1\}$ and $A = \{0, 1\}$ .
We denote the outcome under treatment $a$ as $Y^{a=1}$ .

Travis’s headache is $Y^{a=1}=1$ , $Y^{a=0}=0$ , indicating that under treatment his headache went away, but it wouldn’t have otherwise.
Lenny’s headache is $Y^{a=1}=1$ , $Y^{a=0}=1$ , indicating that under treatment his headache went away, but it would have otherwise.

The treatment has a causal effect on Travis’s headache, but not on Lenny’s, because $Y^{a=1} \neq Y^{a=0}$ for Travis, but $Y^{a=1} = Y^{a=0}$ for Lenny. Of course, in practice we don’t have a universe rewinding machine, so we can’t know what would have happened if Lenny hadn’t taken an aspirin. These outcomes are counterfactuals, and are really how we humans intuitively think about causality. We try to imagine what would have happened if we had done something differently, and we use that to inform our decisions (with variable success).

A counterfactual is indeed factual if $Y = Y^A$ . That is, the treatment administered is what we actually observed. We call this consistency. This framing is simple but interesting. Individual causal effects are the contrast of counterfactuals. Of course, we can’t actually observe any counterfactuals except for 1, so we can’t actually identify individual effects.

Average Effects

However, we can do some interesting stuff with averages.

At a population level, we have $Y^a = \{0, 1\}$ , a population of outcomes $Y^{a=0}$ , $Y^{a=1}$

Assume instead we have a population and we have complete knowledge of counterfactual outcomes.

Name	$Y^{a=0}$	$Y^{a=1}$
John	0	1
Mary	1	0
Daniel	0	0
Emily	1	1
Max	0	1
Olivia	1	0
Ethan	0	0
Sophia	1	1
Benjamin	0	1
Victoria	1	0

If we take the average of the treatments, we find $Pr[{Y}^{a=1} = 1] = 0.5$ and $Pr[{Y}^{a=0} = 1] = 0.5$ . This is the average causal effect, and it exists if $Pr[{Y}^{a=1} = 1] \neq Pr[{Y}^{a=0} = 1]$ (which it doesn’t). More formally, the null hypothesis of no average causal effect is true.

One biggy to remember is that just because there is no average causal effect, doesn’t mean there are no individual causal effects. In fact we can see them, right there on the chart. Remember that we have perfect knowledge of counterfactuals, so we can see that John and Mary would have had different outcomes if they had taken different treatments.

Average causal effect $E[Y^{a=1}] - E[Y^{a=0}]$ is equal to the average $E[Y^{a=1} - Y^{a=0}]$ of individual causes.

If there are truly no individual causal effects, then the sharp null hypothesis is true. This is the case where $Y^{a=1} = Y^{a=0}$ for all individuals.

Note: Interference

We implicitly assume that the treatment of one individual does not affect the outcome of another. This is called interference. If there is interference, it becomes… rather difficult to talk about counterfactuals. What does it even mean to say that “if John had not taken an aspirin, Mary’s headache would have gone away”? So usually we assume no interference, but I have a feeling this will come back later.

Note: Treatment Variation

We also assume that there is only one version of the treatment. For example, John and Mary aren’t taking different doses of aspirin. If all versions of a treatment have the same outcome, then we have treatment variation irrelevance.
As a personal note, seems pretty easy to split up versions of a treatment and treat them individually, but maybe it’s more complicated than that?

Measures of Causal Effects

We already said that the causal null hypothesis holds in the example above because $Pr[{Y}^{a=1} = 1] = Pr[{Y}^{a=0} = 1]$ .
But there are other ways of representing the same thing.

We can say the Causal Risk Difference is 0: $Pr[{Y}^{a=1} = 1] - Pr[{Y}^{a=0} = 1] = 0$
We can say the Risk Ratio is 1: $\frac{Pr[{Y}^{a=1} = 1]}{Pr[{Y}^{a=0} = 1]} = 1$
We can say the Odds Ratio is 1: $\frac{Pr[{Y}^{a=1} = 1]/Pr[{Y}^{a=1} = 0]}{Pr[{Y}^{a=0} = 1]/Pr[{Y}^{a=0} = 0]} = 1$

Consider a different example, where we have a population of 100 million people have a heart condition and are being prescribed a new drug. Say we know that:

20 million people will die within 5 years if they take the drug
30 million people will die within 5 years if they don’t take the drug

Then:

The Causal Risk Difference is: $Pr[{Y}^{a=1} = 1] - Pr[{Y}^{a=0} = 1] = 0.2-0.3 = -0.1$
The Risk Ratio is: $\frac{Pr[{Y}^{a=1} = 1]}{Pr[{Y}^{a=0} = 1]} = \frac{0.2}{0.3} = 0.67$
The Odds Ratio is: $\frac{Pr[{Y}^{a=1} = 1]/Pr[{Y}^{a=1} = 0]}{Pr[{Y}^{a=0} = 1]/Pr[{Y}^{a=0} = 0]} = \frac{0.2/0.8}{0.3/0.7} = 0.58$

Sticking with the heart condition example, we can also talk about the number needed to treat (NNT). This is the number of people who need to take the drug for one person to benefit. There are 100 million people in the population, and 10 million people will benefit from taking the drug. So the NNT is 10. Generally speaking:

NNT = -1 / Causal Risk Difference

If this is a negative number, then taking the absolute value of it gives us the number needed to harm (NNH). This is the number of people who need to take the drug for one person to be harmed.

Causal Risk Difference gives us an absolute measure of the effect of the treatment. If we want to know the relative measure of the effect of the treatment, we can use the Risk Ratio or Odds Ratio.

Random Variability

Sampling Variability

Even if we have perfect counterfactual knowledge of a sample, we can’t know the counterfactuals of the population. So we can’t know the true causal effect, only an estimate. We really should have used slightly different notation for the sample that we analyzed.

$\widehat{PR}[...]$ is the sample estimate of the probability of an event.
$PR[...]$ is the population probability of an event.

Even though we measured $\widehat{PR}[{Y}^{a=1} = 1] = 0.5$ and $\widehat{PR}[{Y}^{a=0} = 1] = 0.5$ , it’s possible, for example, that $PR[{Y}^{a=1} = 1] = 0.6$ and $PR[{Y}^{a=0} = 1] = 0.4$ in the population. This difference is called sampling variability. $\widehat{PR}[{Y}^{a=1}]$ is a consistent estimator of $PR[{Y}^{a=1}]$ if $\widehat{PR}[{Y}^{a=1}] \rightarrow PR[{Y}^{a=1}]$ as the sample size increases (which it does, because that variability is random). So we can’t actually say for certain whether a causal effect exists, but we can statistically estimate it. Sampling Variability is not the only monster in these woods, though.

Nondeterministic Counterfactuals

We’ve been assuming that the counterfactuals are deterministic. That is, if John takes the aspirin, then his headache disappears. If he doesn’t, it stays. But what if it’s probabilistic? What if John has a 90% chance of recovery under treatment? What if he has a 25% chance of recovery under no treatment?

In the probabilistic case, $E[Y^a]$ is the weighted sum of the possible outcomes:

\sum_{y} y \cdot Pr[Y^a = y]

Causation and Association

Say it with me! They are not the same! Obviously we don’t actually know counterfactuals. If John took the aspirin, we can’t also know what would happen if he hadn’t. This is reality, and we are stuck knowing only one outcome. So let’s say we randomly choose one of the two outcomes in the table above to simulate more realistic data.

Person	Treatment (A)	Outcome (Y)
John	1	1
Mary	1	0
Daniel	0	0
Emily	1	1
Max	0	0
Olivia	0	1
Ethan	0	0
Sophia	1	1
Benjamin	1	1
Victoria	0	0

Let’s go back to high school and do some simple probability. By way of the lost art of counting, we can see that

$P[Y=1|A=1] = \frac{4}{5}$ and $P[Y=1|A=0] = \frac{1}{5}$

Whoa that’s quite the difference. But note, again, we actually just picked one of the two outcomes for each person. When we do know what would have happened, there was no risk difference. Now that we don’t, we (somewhat unluckily) have a huge risk difference. This is the difference between causation and association, and it’s why our statistical forefathers have warned us of conflating the two.

Mirroring our causal equations, we can also define them for associations.

Association Risk Difference: $P[Y=1|A=1] - P[Y=1|A=0]$
Association Risk Ratio: $\frac{P[Y=1|A=1]}{P[Y=1|A=0]}$
Association Odds Ratio: $\frac{P[Y=1|A=1]/P[Y=0|A=1]}{P[Y=1|A=0]/P[Y=0|A=0]}$

These measures quantify the strength of the association between A and Y. They are not causal measures.

In fact, we can now easily formalize the difference between causation and association. Association is defined by the risk difference (or risk ratio, or odds ratio) between the disjoint subgroups of the population as separated by their treatment value. Causality is the risk difference in the same population under two different treatment values. And there are lots of reasons why assocation and causation measures will yield different results.

That brings us to the end of Chapter 1. If you’re interested in the tech portion of this chapter, you can find it here.