Puma: Efficient Continual Graph Learning with Graph Condensation

Continual graph learning (CGL) is a crucial paradigm for addressing scenarios where graph data is encountered in a streaming fashion. The primary challenge in CGL is the phenomenon of "catastrophic forgetting," where models tend to overwrite previously acquired knowledge when trained on new data. This paper introduces the PsUdo-label guided Memory bAnk (PUMA) framework, an advancement over the Condense and Train (CaT) framework, designed to enhance both the efficiency and effectiveness of replay-based CGL methods. PUMA tackles limitations in existing approaches by leveraging pseudo-labels for more comprehensive graph condensation, adopting a training-from-scratch strategy for balanced learning, and employing optimized encoding mechanisms for improved speed.

The Catastrophic Forgetting Problem in Continual Graph Learning

The core purpose of Continual Graph Learning (CGL) is to enable graph models to continuously update themselves with graph data that is fed in a streaming manner. A significant hurdle in this process is the tendency for the model to forget previously learned knowledge as it trains on new, incoming data. This issue, known as catastrophic forgetting, has become the central focus of CGL research.

Traditional Graph Neural Networks (GNNs), while powerful for graph-based tasks, are not inherently designed for continual learning. When directly applied to streaming graph data, their performance on historical data distributions often deteriorates due to the distribution shift between past and present graphs. This is particularly problematic due to hardware limitations in storage and computation, which make it impractical to access the entire historical graph during training.

Replay-Based Methods and Their Limitations

Recent advancements in CGL have explored replay-based methods to mitigate catastrophic forgetting. These methods update the model using a combination of: (1) the entire new-coming data, and (2) a sampling-based memory bank that stores replayed graphs. The goal of the memory bank is to approximate the distribution of historical data, allowing the model to revisit and reinforce past knowledge. After each model update, a new replayed graph, sampled from the incoming graph, is added to the existing memory bank.

While intuitive and effective, these replay-based methods suffer from two significant issues:

Read also: Student Portal Guide

Inadequate Historical Distribution Capture: Most sampling-based methods struggle to fully capture the distribution of historical data, especially when the storage budget is tight. This means the memory bank may not accurately represent the nuances of past data, leading to incomplete learning.
Data Imbalance: A substantial data imbalance arises from the disparity in scale between complex, new-coming graph data and the typically lightweight memory bank. This imbalance results in unbalanced training, where the model may overemphasize the incoming data at the expense of historical knowledge. Figure 1(a) illustrates this imbalanced learning problem, showing how the incoming graph is often significantly larger than the replayed graphs. Consequently, as depicted in Figures 1(b) and 1(c) for Arxiv and Reddit datasets, the prediction accuracy on data from previous tasks can drastically decrease when the incoming graph is much larger than the replayed graphs.

The Condense and Train (CaT) Framework: A Step Towards Balance

To address these limitations, the Condense and Train (CaT) framework was proposed. CaT introduces a novel approach by condensing the new-coming graph into a small yet informative synthesized replayed graph prior to each model update. This condensed graph is then stored in a Condensed Graph Memory (CGM) alongside historical replay graphs.

In the continual learning phase, CaT employs a "Training in Memory" (TiM) scheme. This approach updates the model directly using the Condensed Graph Memory rather than the entire new-coming graph. By ensuring the condensed synthetic graph has a similar size to other replayed graphs in the memory bank, TiM effectively alleviates the data imbalance problem. Extensive experiments on benchmark datasets have demonstrated the superior performance of the CaT framework in terms of both effectiveness and efficiency.

The CaT framework operates on the principle that if a synthesized replayed graph derived from condensation methods can support model training without sacrificing performance, then the need to use the entire incoming graph for model updates can be bypassed, allowing reliance solely on synthetic graphs. This approach aims to generate small but informative replayed graphs, leveraging recent graph condensation techniques that condense a graph into a smaller synthetic graph using differentiable methods.

Limitations of the CaT Framework and the Genesis of PUMA

Despite the advancements made by CaT, further analysis revealed three key areas for improvement:

Limited Condensation Scope: The graph condensation process in CaT primarily focuses on labelled nodes, potentially neglecting valuable information embedded within unlabelled nodes. This restricted focus can lead to a less comprehensive representation of the original graph's distribution.
Overemphasis on Past Knowledge: The continual training scheme of CaT, while balancing data, tends to overemphasize previously learned knowledge. This can limit the model's capacity to effectively learn from newly added memories, hindering its adaptability.
Time-Consuming Processes: Both the graph condensation and the replaying processes within the CaT framework can be time-consuming, impacting the overall efficiency of the CGL system.

These limitations motivated the development of the PUMA (PsUdo-label guided Memory bAnk) framework, an extension of CaT designed to enhance its efficiency and effectiveness.

Read also: Wiley Efficient Learning: A Review

The PUMA Framework: Enhancing Efficiency and Effectiveness

PUMA builds upon the foundational principles of CaT but introduces several key innovations to overcome its weaknesses:

1. Pseudo-label Guided Graph Condensation

To address the limited scope of condensation in CaT, PUMA expands the coverage of nodes during graph condensation to include both labelled and unlabelled nodes. This is achieved through the utilization of pseudo-labels. By inferring labels for unlabelled nodes, PUMA can incorporate a richer set of information into the condensation process. This ensures that the synthesized replayed graph more accurately approximates the distribution of the original graph, capturing a more complete picture of historical data.

The goal of graph condensation within PUMA is to generate a small yet informative synthesized replayed graph that maintains a similar data distribution to the original incoming graph. This is achieved by minimizing the distance between the distributions of the original and synthesized graphs. Techniques like Maximum Mean Discrepancy (MMD) are employed to empirically calculate this distribution distance. The optimal replayed graph, denoted as $\tilde{\mathcal{G}}{k}^* = {\tilde{\boldsymbol{A}}{k}^{}, \tilde{\boldsymbol{X}}{k}^{}, \tilde{\boldsymbol{Y}}{k}^{*}}$, is the one whose distribution is closest to that of the incoming graph $\mathcal{G}{k} = {\boldsymbol{A}{k}, \boldsymbol{X}{k}, \boldsymbol{Y}{k}}$. The distribution distance function, $\text{Dist}(\cdot,\cdot)$, quantifies this closeness.

2. Training-From-Scratch Strategy for Balanced Learning

To tackle the issue of overemphasizing previously learned knowledge and to promote more balanced training, PUMA introduces a "training-from-scratch" strategy. This approach upgrades the previous continual learning scheme, allowing for a more balanced learning process between historical and new graphs. Instead of solely relying on incremental updates, this strategy can periodically refresh the model's understanding by retraining from a foundational state, but with the benefit of the enriched memory bank. This helps to prevent the model from becoming overly specialized to recent data and improves its plasticity.

In the context of Continual Graph Learning (CGL), particularly in node classification problems, a model is typically required to handle a sequence of tasks, denoted as ${\mathcal{T}{1},\mathcal{T}{2},…\mathcal{T}{K}}$. For the $k$-th task, $\mathcal{T}{k}$, an incoming graph $\mathcal{G}{k} = {\boldsymbol{A}{k}, \boldsymbol{X}{k}, \boldsymbol{Y}{k}}$ arrives, where $\boldsymbol{X}{k} \in \mathbb{R}^{nk \times d}$ is the feature matrix for $nk$ nodes, and $\boldsymbol{A}{k} \in \mathbb{R}^{nk \times nk}$ is the adjacency matrix. The model, parameterized by $\theta$, needs to be updated using $\mathcal{G}_{k}$ while retaining its performance on all previously encountered graphs.

Read also: Algorithm Efficiency Course

CGL settings are broadly categorized into task-incremental learning (task-IL) and class-incremental learning (class-IL). In task-IL, the model only needs to distinguish nodes within the same task. In class-IL, however, the model must classify nodes from all tasks collectively.

For replay-based CGL methods, a memory bank $\mathcal{M}{k-1}$ stores representative graphs from previous tasks, subject to a budget $b$ limiting the maximum number of nodes per replayed graph. During training, both the incoming graph $\mathcal{G}{k}$ and the replayed graphs in $\mathcal{M}{k-1}$ are used. Current methods often employ a weighted loss to manage data imbalance:$$ \mathcal{L}(\mathcal{G}{k}, \mathcal{M}{k-1}; \theta{k}) = \mathcal{L}(\mathcal{G}{k}; \theta{k}) + \alpha \mathcal{L}(\mathcal{M}{k-1}; \theta{k}) $$where $\mathcal{L}(\mathcal{G}{k}; \theta{k})$ is the loss on the current task and $\mathcal{L}(\mathcal{M}{k-1}; \theta{k})$ is the loss on the replayed graphs. The weighting factor $\alpha$ is adjusted based on the node counts $nk$ (incoming graph) and $n{1:k-1}$ (total nodes in memory bank). When $n{1:k-1} < nk$, $\mathcal{L}(\mathcal{M}{k-1}; \theta{k})$ is typically assigned a larger scale factor. However, both balancing methods face training challenges. For instance, in class-IL, the imbalance often stems from differing class sizes. Methods like ER-GNN might restrict learning on the memory bank if its total size significantly exceeds that of the incoming graph. SSM, on the other hand, might compromise current task performance by directly scaling down the training loss, especially when the task is sensitive.

PUMA's training-from-scratch strategy, in contrast, aims to provide a more robust and balanced learning experience by re-evaluating and potentially re-integrating past knowledge with a fresh perspective, rather than solely relying on weighted losses that can be sensitive to scale differences.

3. Accelerated Graph Condensation and Encoding

To improve the overall efficiency, PUMA incorporates optimizations for both the graph condensation and graph encoding processes. This includes employing a "one-time prorogation" and utilizing "wide graph encoders." These techniques are designed to accelerate the graph condensation and encoding stages, thereby reducing the computational overhead and making the CGL framework more practical for real-world applications. The acceleration ensures that the benefits of advanced condensation and balanced training are not offset by excessive processing times.

The graph condensation process in PUMA, aiming for distribution matching, can be formally represented as minimizing the distance between the distribution of the incoming graph $\mathcal{G}{k}$ and the synthesized graph $\tilde{\mathcal{G}}{k}$:$$ \min{\tilde{\boldsymbol{A}}{k}, \tilde{\boldsymbol{X}}{k}, \tilde{\boldsymbol{Y}}{k}} \text{Dist}(\mathcal{G}{k}, \tilde{\mathcal{G}}{k}) $$where $\tilde{\mathcal{G}}{k} = {\tilde{\boldsymbol{A}}{k}, \tilde{\boldsymbol{X}}{k}, \tilde{\boldsymbol{Y}}{k}}$ is the condensed graph. This optimization is performed using differentiable methods, ensuring that gradients can be backpropagated to learn the parameters of the synthesized graph.

Methodological Comparisons and PUMA's Advantages

Figure 2 visually contrasts the typical replay-based CGL framework with the proposed PUMA framework, particularly in a class-incremental learning (class-IL) setting. Traditional replay-based methods train a model $\text{GNN}{k-1}$ using both the memory bank $\mathcal{M}{k-2}$ and the incoming graph $\mathcal{G}{k-1}$. A sampled graph $\tilde{\mathcal{G}}{k-1}$ is then added to update $\mathcal{M}_{k-1}$.

In contrast, PUMA, building upon CaT, maintains a Condensed Graph Memory (CGM) that stores condensed synthetic graphs. Prior to model update, the incoming graph is condensed. The training strategy, TiM, updates the model using only the memory bank. PUMA enhances this by:

Broader Condensation: Incorporating both labelled and unlabelled nodes via pseudo-labels for richer historical data representation.
Balanced Training: Employing a training-from-scratch strategy to prevent over-reliance on past knowledge and ensure better integration of new memories.
Efficiency Boost: Utilizing optimized condensation and encoding methods for faster processing.

These improvements collectively aim to achieve superior performance in terms of both effectiveness (accuracy in retaining knowledge and learning new information) and efficiency (computational speed and resource utilization).

Experimental Validation and Performance

Extensive experiments were conducted on four benchmark datasets to validate the efficacy of the PUMA framework. The results consistently demonstrated that PUMA outperforms existing state-of-the-art CGL methods. The framework's ability to effectively mitigate catastrophic forgetting while maintaining high efficiency was clearly evident across these diverse datasets. The superior performances highlight PUMA's potential as a robust solution for real-world continual graph learning challenges.

tags: #puma #efficient #continual #graph #learning #with