Causal Inference and Statistical Learning: A Comprehensive Guide

While artificial intelligence and predictive inference dominate current discussions, mastering causal inference is becoming increasingly crucial. Causal inference focuses on understanding the "why" behind data, moving beyond mere prediction to discern cause-and-effect relationships. This article serves as a self-study guide, suitable for all levels, to equip you with the knowledge to confidently determine causal relationships.

Introduction: Why Causal Inference Matters

Causal inference, the field dedicated to understanding cause-and-effect relationships, seeks to answer critical questions of 'Why?' and 'What if?'. Understanding causality is crucial for addressing a wide range of issues, from combating climate change to making strategic decisions.

Consider these examples of major questions requiring causal inference:

What impact might banning fuel cars have on pollution levels?
What are the causes behind the spread of certain health issues?
Could reducing screen time lead to increased happiness?
What is the Return On Investment of our ad campaign?

Causal inference is a valuable skill to acquire today:

Broad Applicability: It is tremendously useful for virtually any job, extending beyond data scientists to include business leaders and managers.
Niche Expertise: Few people are experts in this field, and interest in it is growing fast.
Relevance to AI: "Causal machine learning" is a growing trend. Knowing causal inference will help you connect this knowledge with the current AI focus.

Key Concepts in Causal Inference

The Fundamental Problem of Causal Inference

The most fundamental concept necessary to understand causal inference can be illustrated through a common scenario. Imagine you've been working on your computer all day, a deadline is approaching, and you start to feel a headache coming on. You decide to take a pill, and after a while, your headache is gone. Was it really the pill that made the difference? Or was it because you drank tea or took a break? It is impossible to definitively answer this question because all those effects are confounded.

The only way to know for certain if it was the pill that cured your headache would be to have two parallel worlds. In one world, you take the pill, and in the other, you don’t (or you take a placebo). The pill's causal effect can only be proven if you feel better in the world where you took the pill, as the pill is the only difference between the two worlds.

Unfortunately, we do not have access to parallel worlds to experiment with and assess causality. Hence, many factors occur simultaneously and are confounded (e.g., taking a pill for a headache, drinking tea, and taking a break; increasing ad spending during peak sales seasons; assigning more police officers to areas with higher crime rates, etc.).

Formalizing Causality: Potential Outcomes

The potential outcomes framework allows for the clear articulation of model assumptions. These are essential for specifying the problems and identifying the solutions.

The central notation used in this model are:

Yᵢ(0) represents the potential outcome of individual i without the treatment.
Yᵢ(1) represents the potential outcome of individual i with the treatment.

The reference to the treatment (1 or 0) may appear in parentheses, in superscript, or subscript. The letter “Y” refers to the outcome of interest, such as a binary variable that takes the value one if a headache is present and zero otherwise. The subscript “i” refers to the observed entity (e.g., a person, a lab rat, a city, etc.). Finally, the term ‘treatment’ refers to the ‘cause’ you are interested in (e.g., a pill, an advertisement, a policy, etc.).

Using this notation, we can refer to the fundamental problem of causal inference by stating that it is impossible to observe both Yᵢ(0) and Yᵢ(1) simultaneously. In other words, you never observe the outcome for the same individual with and without the treatment at the same time.

While we cannot identify the individual effect Yᵢ(1)-Yᵢ(0), we can measure the Average Treatment Effect (ATE): E[Yᵢ(1)-Yᵢ(0)]. However, this Average Treatment Effect is biased if you have systematic differences between the two groups other than the treatment.

Visualizing Causal Links: Directed Acyclic Graphs

Visual representations clarify assumptions and facilitate communication. In causal inference, directed graphs are used. These graphs depict various elements (e.g., headache, pill, tea) as nodes, connected by unidirectional arrows that illustrate the direction of causal relationships. Causal inference differs from predictive inference due to the assumed underlying causal relationships. Those relationships are explicitly represented using this special kind of graph called Directed (Acyclic) Graphs. This tool jointly with the potential outcome framework is at the core of causal inference and will allow thinking clearly about the potential problems and, consequently, solutions for assessing causality.

Technical Tools for Causal Inference

To apply these methods to data, you need:

A foundational understanding of probability, statistics, and linear regression.
Knowledge of a statistical software.

Probability, Statistics, and Linear Regression

These tools are valuable for data science in general, and you can concentrate specifically on what matters most. Furthermore, there is a chapter dedicated to this topic in both of the reference books including exclusively on the concepts useful for causal inference. One of the most valuable yet often disregarded topics related to linear regression is the concept of “bad controls”. Understanding what you should control for and what will actually create problems is key. Finally, understanding Fixed Effects regression is essential for causal inference. This type of regression allows us to account for numerous confounding factors that might be impossible to measure (e.g., culture) or for which data is simply unavailable.

Read also: Inference and Learning Algorithms

Statistical Software: STATA, Python, and R

Numerous tools allow us to do causal inference and statistical analysis, and in my opinion, the best among them are STATA, Python, and R.

STATA: Specifically designed for statistics, particularly econometrics, making it an incredibly powerful tool. It offers the latest packages from cutting-edge research. However, it is expensive and not versatile.
Python: The leading programming language today. It is open-source and highly versatile. Additionally, ChatGPT performs very well with Python-related queries which is an important advantage in this AI era.
R: Very powerful for statistics. The debate between R and Python is ongoing. Note that R is less versatile than Python, and it appears that ChatGPT’s proficiency with R is not as strong. Additionally, the two main books I reference contain Python code (‘Causal Inference for the Brave and True’; ‘Causal Inference: The Mixtape’). This further supports focusing on Python.

Randomized Experiments (A/B Testing)

In the first section, we discovered the fundamental problem of causal inference. This problem highlighted the difficulty of assessing causality. So what can we do? Usually, the first solution presented and considered the “gold standard” in causal inference, is randomized experiments (Randomized Controlled Trials).

Essentially, the idea behind randomized experiments is to replicate, or at least approach as much as possible, a parallel world scenario. This allows us to isolate the effect (the consequence) of the treatment (the cause).

We take a sample that is, hopefully, representative of a larger population, and randomly allocate subjects between two groups (treatment and control) or more. The subjects typically do not know whether they receive the treatment or not (a process known as blinding). Therefore, the two groups are arguably comparable. Since the only difference is the treatment, if we observe an effect, it is potentially causal, provided that no other biases exist.

Quasi-Experimental Designs

Controlled experiments are not always possible (e.g. changing the sex/origin for studying discrimination) or ethical (e.g. exposing humans to lethal doses of pollutant to study respiratory disease). Moreover, Randomized experiments tend to have a very strong internal validity but a weaker external validity. Internal validity means that it allows within the scope of the study to measure precisely causality, while external validity refers to the capacity to extrapolate the results beyond the scope of the study.

One of the main limitation of controlled experiments is the external validity. For example, medical research relies extensively on inbred strains of rats/mice. Those animals have almost an identical genetic code, they live the same lab life, eat the same food, etc. Hence, when you do a controlled experiment with such animals you are very close to the parallel world situation working almost with clones. However, the external validity is weaker due to the homogeneity of the study subjects. Moreover, often in controlled experiments the whole environment is controlled and in some cases, it makes the situation a bit unrealistic and reduces the usefulness of the results.

One way to address this issue is by relying on other methods, known as quasi-experimental designs. The idea is to observe a quasi-random allocation between groups in natural settings. ‘Quasi-random’ means that the allocation is effectively as good as random once we isolate or control for potential systematic differences.

Regression Discontinuity Design (RDD)

To illustrate the concept of quasi-experimental design, I will explain the intuition behind one of the methods called Regression Discontinuity Design (RDD) used to measure the impact of alcohol consumption on mortality. The idea behind a RDD is to exploit a discontinuity in treatment allocation (e.g., a geographical border, an age-related administrative law, etc.) where individuals or places very similar to each other receive different treatments based on a cutoff point. For instance, in the study “The Effect of Alcohol Consumption on Mortality” by Carpenter and Dobkin (2009), the authors used the discontinuity at the minimum legal drinking age to examine the immediate effect of alcohol consumption.

tags: #causal #inference #statistical #learning #tutorial