Nearest Neighbor Learner for Causal Inference: A Comprehensive Overview

Most machine learning methods focus on predicting outcomes rather than understanding the underlying causal relationships. While machine learning algorithms excel at identifying correlations in data, they often struggle to determine causation. This limitation restricts their applicability in inferring causal relationships within biological networks and other dynamical systems, such as medical intervention strategies and clinical outcomes. For those seeking to understand the mechanisms driving system dynamics and predict how networks respond to external stimuli, tools capable of discerning causal relationships are essential.

The increasing availability of big data, coupled with the application of machine learning techniques and high-performance computing, is revolutionizing fields like biology, medicine, and healthcare. Machine learning algorithms are designed to improve with experience and data utilization. These techniques are now integrated with bioinformatics methods, curated databases, and biological networks to enhance training, validate findings, identify interpretable features, and facilitate model investigation.

In molecular biology, machine learning aids in analyzing genome sequencing data, including the annotation of sequence elements and epigenetic, proteomic, or metabolomic data. These methods predict the sequence specificities of DNA- and RNA-binding proteins, enhancers, and other regulatory regions using data generated by omics approaches like DNase-seq, FAIRE-seq, ATAC-seq, and STARR-seq. Machine learning can also build models to predict regulatory elements and non-coding variant effects de novo from DNA sequences. Furthermore, machine learning approaches are used in population and evolutionary genetics to identify regions under purifying selection or selective sweep.

Beyond genetics and genomics, machine learning has rapidly expanded into medicine, addressing disease diagnosis, classification, risk assessment, preventative measures, and personalized treatment. However, precision medicine requires not only predicting risks and outcomes but also predicting clinical models, necessitating the correct specification of cause and effect and the calculation of alternative scenarios.

The Challenge of Causal Inference in Observational Studies

While causal relationships are ideally determined through controlled experiments, many questions in biomedical research can only be answered with observational studies. Unlike controlled experiments or randomized clinical trials, observational studies are susceptible to biases, including selection bias, information bias, measurement error, and confounders, compromising the reliability of results. Under these conditions, causal inference becomes challenging without substantial prior knowledge.

Even data-driven prediction models derived from experiments with minimized bias should be interpreted cautiously, as their parameters and predictions may not necessarily have a causal interpretation. This problem extends to molecular biology, where researchers have long sought computational methods for inferring biological networks, such as gene regulatory, protein-protein, metabolic, and signaling networks.

The development of experiments that collect large amounts of heterogeneous data, combined with high-performance computing capabilities, has enabled the application of machine learning and deep learning methods to deduce causality relationships in biological networks.

However, some experts in artificial intelligence argue that machine learning still lags behind animal intelligence in crucial areas like transfer learning and generalization across different problems, primarily because machine learning often disregards factors like interventions, domain shifts, and temporal structure, which are critical for animals.

Identifying Causal Variables: A Strategic Imperative

Methods for identifying causal variables are crucial in any computational pipeline dedicated to causal inference. Studies aimed at identifying cancer biomarkers, such as the method proposed by Zhang et al. (2020a) for high-throughput identification of cancer biomarkers in human body fluids, provide a solid foundation for developing new methods. Biomarkers play a vital role in defining the causal pathway of a disease. Zhang et al.'s method integrates physicochemical properties and weighted observed percentages and position-specific scoring matrices profiles to enhance attributes reflecting the evolutionary conservation of body fluid-related proteins. The least absolute selection and shrinkage operator feature selection is used to generate the optimal feature subset. Additional research by Zhang et al. focuses on data collection needed for identifying cancer biomarkers and structure-trained predictors for predicting protein-binding residues. Furthermore, Zhang et al. presented a method to detect bioluminescent proteins, which can serve as easily detectable biomarkers in biomedical research.

Towards Causal Machine Learning: Overcoming Limitations

This paper explores possible scenarios for developing machine learning approaches capable of inferring causal relationships within biological systems. It focuses on the analysis of issues and perspectives in structural causal model inference, acknowledging that a comprehensive coverage of all issues is currently impossible due to the recent ubiquity of machine learning techniques and the emerging understanding of their limitations.

Read also: Air Travel to Penn State

The following sections delve into the mathematics of structural models and the current state-of-the-art in machine learning methods for learning causal graphs, proposing a strong coupling between meta-modeling and meta-learning to overcome current limitations in causal discovery. Furthermore, popular machine learning algorithms reformulated for causal discovery are examined, deepening the perspectives presented and proposing modular meta-learning upstream of meta-modeling as a future direction for causal inference in machine learning.

Machine Learning: Learning from Data

A machine learning algorithm is a computer program capable of learning from data. Mitchell defined learning as follows: "A computer program P is said to learn from experience E with respect to some class of tasks T and performance P, if its performance at tasks in T, as measured by performance P, improves with experience E." Machine learning algorithms can be built using a variety of experiences, tasks, and performance measures.

Learning is the ability to perform tasks, typically described in terms of how the machine learning should process an example. An example is a set of features quantitatively measured from an object or event. Machine learning can perform tasks like classification, regression, transcription, machine translation, anomaly detection, imputation of missing values, de-noising, and probability density estimation. However, causal inference remains a challenge due to the inability to generalize from one problem to the next, rather than from one data point to the next. Schölkopf et al. (2021) refer to the first type of generalization as "out-of-distribution" generalization.

To understand the problem more fully, a formal description of a causal model is necessary.

Formal Description of Causal Models

Consider a set of variables (V = {X1, …, Xn}), where each variable (Xi) is associated with a function (fi) such that:

Read also: Empowering Community Leaders

(Xi = fi(Pa(Xi), \epsiloni))

Here, (Pa(Xi)) represents the parents of (Xi) in the causal graph, and (\epsilon_i) is a noise term independent of all other noise terms. These assignments represent the causal relationships responsible for statistical dependencies among the variables.

The functions (f_i) describe how each variable depends on its parents. Causal reasoning allows us to draw conclusions on the effect of interventions and potential outcomes if we have a causal model. However, deducing the graph is generally dependent on deducing the functions.

A common method to infer the graph from data is performing conditional independence tests, i.e., testing whether two random variables X and Y are independent, given Z. The Causal Markov Condition states that for every (X \in V), Y is independent of (V \setminus (Descendants(X) \cup Parents(X))) given (Parents(X)). This condition holds regardless of the complexity of the functions in a structural causal model, making conditional independence tests advantageous. However, testing for conditional independence is a challenging statistical problem.

K-Nearest Neighbor Matching (K-NNM) in Causal Inference

Causal inference is essential for understanding data generation mechanisms across various real-world domains, including economics, government policy evaluation, and fairness. A key focus is estimating the causal effect of a treatment on an outcome of interest. A central challenge is addressing confounding bias, which occurs when covariates influencing both treatment and outcome have different distributions between the treated and control groups.

While randomized controlled trials (RCTs) are the gold standard for establishing causality, they are often impractical due to high costs, time demands, and ethical concerns. Consequently, estimating causal effects from observational data has become a pragmatic alternative in many applications.

Matching is a cornerstone strategy in causal effect estimation, aiming to alleviate confounding bias by creating comparable treated and control groups, thereby balancing the distribution of confounding variables. Commonly employed matching methods include exact matching, propensity score matching (PSM), full matching, genetic matching (GenMatch), and Mahalanobis distance matching. K-Nearest Neighbor Matching (K-NNM) is a widely used technique in causal inference that seeks to pair each treated subject with K control subjects who share the closest covariate values, thereby forming comparable groups.

The number of nearest neighbors, or the K parameter, plays a critical role in determining the quality of K-NNM for causal effect estimation. Selecting the appropriate K is challenging: choosing too small a K makes the estimate sensitive to outliers, while selecting too large a K reduces the similarity among matched samples, thereby failing to adequately mitigate confounding bias. Traditional approaches often use a fixed K, which can lead to poor estimates in real-world applications by overlooking heterogeneity within different data subsets.

To illustrate the limitations of a fixed K in K-NNM, consider an example where counterfactual outcomes are calculated for three samples. When K = 2, the nearest neighbor space for the matched samples is defined. However, the second sample from the treatment group may have more than two nearest neighbors, while the third sample may have only one. This underscores the importance of allowing different samples to be matched with varying numbers of nearest neighbors. Consequently, there is a clear need for a method that dynamically determines the optimal K-value, thereby enhancing K-NNM’s accuracy and efficiency in causal effect estimation.

Dynamic K-NNM (DK-NNM): A Sparse Representation Learning Approach

To determine an optimal K-value for each individual in K-NNM, a sparse representation learning method is proposed that reconstructs a sparse coefficient matrix while simultaneously learning a graph matrix to preserve local information and sample similarity. Consequently, the optimal K-value for each individual is derived from this learned sparse representation space. No prior work has explored the role of local structure information around samples in determining K-values in K-NNM for causal inference. Moreover, no established strategy exists for addressing confounding bias due to high-dimensional covariates when determining the optimal K-value for each individual. Both propensity and prognostic scores are employed to address confounding bias and reduce high-dimensional covariates, thereby circumventing the curse of dimensionality in the matching process. A sparse learning-based method reconstructs all samples and identifies the optimal K value for each individual.

Related Work in Causal Inference and Matching Methods

Two prominent frameworks address confounding bias caused by covariates: the potential outcome framework and the structural causal model.

In practical applications, matching methods play a crucial role in causal inference by aiming to identify groups with comparable or balanced covariate distributions. The concept of optimal matching involves selecting matches by minimizing a global distance metric across all possible pairs. Rosenbaum introduced minimax and quantile constraints for dimensionality reduction. Rubin proposed propensity score matching (PSM), which projects all covariates into a single dimension. Imbens refined this approach by adding regression adjustment. Diamond and Sekhon advanced these methods with GenMatch, which optimizes covariate balance by learning weights for covariates, building on both PSM and Mahalanobis distance matching. Rubin and Thomas integrated propensity scores for prognostic covariates, underscoring the effectiveness of considering prognostic factors to reduce bias. Later simulation studies by Leacy and Stuart highlighted the advantages of combining propensity and prognostic scores to improve the quality of matching methods.

The standard K-NNM method is widely used, with subsequent advancements enhancing its capability for estimating causal effects. Luna et al. proposed two resampling strategies to improve estimation accuracy in K-NN matching estimators. Wager and Athey introduced a tree-based K-NN approach using random forests to determine weights for neighboring observations, conceptualized as an adaptation of K-NN with an adaptive neighborhood metric. However, these methods uniformly use a fixed K value. When confronted with intricate scenarios, such as substantial differences between individuals, the adoption of a fixed K value may result in considerable deviations in causal effect estimation.

Potential Outcome Framework

The potential outcome framework is used as the basic model. A binary treatment variable (T{i}) is considered, where samples receiving treatment ((T{i}=1)) are referred to as treated samples, whereas those not receiving treatment ((T{i}=0)) are termed control samples. ({\textbf{X}}) represents a set of pre-treatment covariates, which include Pa(T) and Pa(Y). This assumption ensures that ({\textbf{X}}) contains only relevant confounders, so there are no irrelevant noise variables. The observed outcome for sample i is denoted by (Y{i}). Here, (Y{i}(1)) and (Y{i}(0)) represent the potential outcomes for sample i if assigned to the treated group and the control group, respectively. Thus, the pair ((Y{i}(1), Y{i}(0))) capture the potential outcomes for each sample. In real-life case, only one of (Y{i}(1)) and (Y{i}(0)) can be observed for an individual. This limitation presents the primary challenge in causal inference.

In the field of causal inference, a very important objective is to infer the impact of the treatment T on its outcome Y of interest using observational data. The aim is to estimate the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated group (ATT). Moreover, the prognostic score, denoted as (p(\textbf{X})), has also been used in treatment effect estimation.

DK-NNM Methodology: A Detailed Explanation

The DK-NNM method for causal effect estimation first uses the local structure of (\textbf{X}) to learn a personalized K value for each individual, ensuring adaptive and flexible neighbor selection. Then, DK-NNM performs matching based on propensity scores and prognostic scores, which are estimated using the treatment (T) and outcome (Y), respectively. These two scores project the high-dimensional covariates into a low-dimensional space, effectively eliminating confounding bias introduced by high-dimensional covariates while also improving estimation accuracy by incorporating information from both the treatment and the outcome.

Determining the Optimal K Value

The idea of dynamically selecting the number of neighbors K is conceptually similar to the strategy of adaptive bandwidth selection in kernel regression. In kernel regression, the smoothing parameter is dynamically adjusted based on local data density, allowing the bandwidth h for each sample, thus effectively balancing the bias-variance. The DK-NNM method uses sparse learning to construct neighborhoods and adaptively selects matching samples K per sample based on the data neighborhood structure to optimize the quality of matching samples and reduce the bias in causal effect estimation.

The sparse representation learning through self-representation is proposed to reconstruct the space of (\textbf{X}\in {\mathbb {R}}^{n\times d}), where d and n stand for the numbers of covariates and samples, respectively. The representation of the sample (xj) in the linear model is expressed as (xj = x{i}z{i} + \varepsilon {i}), where (zi) represents the dictionary coefficients for the sample (x_i), and (\varepsilon _i) is the error term associated with this representation. The goal of this self-representation is to minimize the reconstruction error. The primary objective is to derive a coefficient matrix (\textbf{Z}). where (\textbf{Z}\in {\mathbb {R}}^{n\times n}) is the reconstructed representation matrix.

Expanding on this, the expression for (\textbf{Z}) is derived as (\textbf{Z} = (\textbf{X}^{\textit{T}}\textbf{X})^{-1}\textbf{X}^{\textit{T}}\textbf{X}). However, in practical scenarios, the matrix (\textbf{X}^{\textit{T}}\textbf{X}) may not be invertible. To circumvent this issue, an (\ell _{2}-)norm regularization term is introduced, to mitigate the problem of invertibility.

(\min _{\textbf{Z}} \left| {\textbf{XZ}- \textbf{X}} \right| _F^2 + \mu \left| {\textbf{Z}} \right| _2^2)

where (\mu) is a tuning parameter and (\left| {\textbf{Z}} \right| _2^2) represents the (\ell _{2}-)norm regularization.

Equation (6) can be solved in a closed form as (\textbf{Z} = (\textbf{X}^{\textit{T}}\textbf{X} + \mu \textbf{I})^{-1}\textbf{X}^{\textit{T}}\textbf{X}), where (\textbf{I}\in {\mathbb {R}}^{n\times n}) is an identity matrix. Nevertheless, numerous studies have shown that this solution, (\textbf{Z}), lacks sparsity.

To identify the optimal K value for each sample, the goal is for each sample to be represented by those individuals exhibiting strong correlations with it. Additionally, aim to compress the coefficients of individuals with weak correlations to zero.

(\min _{\textbf{Z}} \left| {\textbf{XZ}- \textbf{X}} \right| _F^2 + \alpha \left| {\textbf{Z}} \right| _1, \quad \text {s.t.} \textbf{Z} \geqslant 0)

where (\left| {\textbf{Z}} \right| _1) is the (\ell {1}-)norm regularization, ensuring each value in (\textbf{Z}) remains non-negative, as indicated by (\textbf{Z} \geqslant 0). The parameter (\alpha) acts as the tuning parameter for the (\ell _{1}-)norm, playing a critical role in governing the sparsity level of (\textbf{Z}). A higher value of (\alpha) leads to increased sparsity within the matrix.

To adapt the algorithm to complex high-dimensional data, a non-linear dimensionality reduction technique, i.e., Locality Preserving Projections (LPP) is integrated. Unlike variance-based methods such as Principal Component Analysis (PCA), which emphasize global structure, LPP focuses on preserving local geometric relationships in the data. This ensures that samples with similar covariate patterns remain close in the transformed space, improving the stability of nearest-neighbor selection in high-dimensional settings.

LPP is formally defined as: (\varphi (\textbf{Z}) = \text {Tr}(\textbf{Z}^{T}\textbf{X}^{T}\textbf{LXZ})), where (\textbf{L} \in {\mathbb {R}}^{d \times d}) is the Laplacian matrix, capturing local feature similarities. The Laplacian matrix is constructed as: (\textbf{L} = \textbf{D} - \textbf{S}), with (\textbf{S} \in {\mathbb {R}}^{d \times d}) representing sample-wise similarities and D as the diagonal degree matrix.

(\min _{\textbf{Z}} \left| {\textbf{XZ}- \textbf{X}} \right| _F^2 + \beta \varphi (\textbf{Z}) + \alpha \left| {\textbf{Z}} \right| _1, \quad \text {s.t.} \textbf{Z} \geqslant 0)

where the tuning parameter (\beta) is to balance between (\varphi (\textbf{Z})) and (\left| {\textbf{XZ}- \textbf{X}} \right| _F^2).

To prevent underfitting due to excessive regularization, the parameters (\alpha) and (\beta) are set within the empirical range of (10^{-3} \sim 10^{-6}). Within this range, a grid search combined with cross-validation is employed to identify the optimal parameter values. The search set is defined as ({\alpha ,\beta } \in {10^{-3},10^{-4},10^{-5},10^{-6}}).

Upon the successful optimization, the optimal solution (\textbf{Z}^*) is obtained. Each element (z_{ij}) quantifies the relative contribution of sample j in reconstructing sample i, thereby capturing the intrinsic correlation between observations. To ensure the consistency and plausibility of these relationships, constraints are enforced.

tags: #nearest #neighbor #learner #causal #inference