Topological Deep Learning: Bridging Topology and Deep Learning for Enhanced Data Analysis

Artificial intelligence (AI) has revolutionized various industries, assisting organizations in making data-driven decisions and gaining a competitive edge. Deep learning (DL), a subset of machine learning (ML), is a preeminent driving force behind AI. In recent years, the intersection of DL and topology has led to the rise of topological deep learning (TDL). TDL combines the expressive power of deep neural networks (DNNs) with the mathematical rigor of topology. It leverages topological features and representations to overcome the limitations of traditional ML methods in capturing the underlying geometric and structural properties of data.

The Essence of Topological Deep Learning

TDL integrates topological concepts into DL. Initially, in 2017, it primarily denoted the integration of topological features within the input pipeline of a DNN. TDL encompasses a collection of ideas and methodologies that pertain to the use of topological concepts in DL.

TDL methods can be employed via observational or interventional modes. In an observational capacity, these methods enhance the comprehension of existing DL models and their topological formulations. Central to TDL is the notion of persistence homology. Persistent homology is an algebraic topology technique that uses filtration to bridge abstract topology and intricate geometry, resulting in a multiscale analysis of the shape of data. It identifies and quantifies topological features in terms of their topological invariants, such as loops, voids, and higher-dimensional structures within datasets.

By incorporating persistence homology into DL frameworks, TDL can extract meaningful topological signatures from data and exploit them for various ML tasks, including classification, regression, clustering, and anomaly detection. It can also implement and/or incorporate topological structures and topological spaces (domains) in DNNs, activation functions, loss functions, and sophisticated DL architectures like transformers, autoencoders, and reinforcement learning models, ultimately leading to topological transformers and simplicial neural networks.

Advantages of Topological Deep Learning

TDL models offer greater interpretability than black box DL models. TDL algorithms extract topological signatures that provide intuitive insights about the underlying structure of data, thereby facilitating better understanding and interpretation of model predictions. Additionally, TDL is inherently robust to noise and can handle high-dimensional datasets more effectively than traditional methods. By focusing on the intrinsic topological structure of data, the field is able to uncover essential patterns and relationships that may be obscured by noise or irrelevant features.

Applications of Topological Deep Learning

TDL finds broad applicability across various domains, including biology, chemistry, materials science, neuroscience, social networks, and computer vision. Its versatility enables it to analyze diverse data types and address a wide range of real-world problems, such as disease diagnosis, data compression, image recognition, computer vision, neuroscience, drug design, chip design, and graph analysis.

Challenges and Opportunities in Topological Deep Learning

Despite its promise, TDL is still a relatively young field with several challenges and opportunities for further research. Key areas for future exploration include the effective integration of domain-specific knowledge and constraints to enhance the performance and interpretability of TDL models. Additionally, the desire to evolve TDL methods underscores the necessity of profound advancements in topological theories. This direction encompasses the examination of topological formulations beyond homology-such as Laplacian and Dirac-as well as the extension of topological domains to include cell complexes, path complexes, directed flag complexes, directed graphs, hypergraphs, hyperdigraphs, cellular sheaves, knots, links, and curves.

Topological Deep Learning in Detail

Traditional deep learning models excel in processing data on regular grids and sequences. However, scientific and real-world data often exhibit more intricate data domains encountered in scientific computations, including point clouds, meshes, time series, scalar fields graphs, or general topological spaces like simplicial complexes and CW complexes. TDL addresses this by incorporating topological concepts to process data with higher-order relationships, such as interactions among multiple entities and complex hierarchies. This approach leverages structures like simplicial complexes and hypergraphs to capture global dependencies and qualitative spatial properties, offering a more nuanced representation of data.

The mathematical foundations of TDL are algebraic topology, differential topology, and geometric topology. Traditional techniques from deep learning often operate under the assumption that a dataset is residing in a highly-structured space or a Euclidean space. An independent perspective on different types of data originated from topological data analysis, which proposed a new framework for describing structural information of data, i.e., their "shape," that is inherently aware of multiple scales in data, ranging from local information to global information. While at first restricted to smaller datasets, subsequent work developed new descriptors that efficiently summarized topological information of datasets to make them available for traditional machine-learning techniques, such as support vector machines or random forests.

Topological Domains

One of the core concepts in topological deep learning is the domain upon which this data is defined and supported. In case of Euclidean data, such as images, this domain is a grid, upon which the pixel value of the image is supported. In a more general setting this domain might be a topological domain.

Given a finite set S of abstract entities, a neighborhood function on S is an assignment that attach to every point in S a subset of S or a relation. Such a function can be induced by equipping S with an auxiliary structure. Edges provide one way of defining relations among the entities of S. More specifically, edges in a graph allow one to define the notion of neighborhood using, for instance, the one hop neighborhood notion. Edges however, limited in their modeling capacity as they can only be used to model binary relations among entities of S since every edge is connected typically to two entities. In many applications, it is desirable to permit relations that incorporate more than two entities. The idea of using relations that involve more than two entities is central to topological domains.

A rank function on a higher-order domain X is an order-preserving function rk: X → Z, where rk(x) attaches a non-negative integer value to each relation x in X, preserving set inclusion in X. Relations in a higher-order domain are called set-type relations if the existence of a relation is not implied by another relation in the domain. Hypergraphs constitute examples of higher-order domains equipped with set-type relations.

Topological Neural Networks

In practice, to perform the aforementioned tasks, deep learning models designed for specific topological spaces must be constructed and implemented. Central to TDL are topological neural networks (TNNs), specialized architectures designed to operate on data structured in topological domains. Unlike traditional neural networks tailored for grid-like structures, TNNs are adept at handling more intricate data representations, such as graphs, simplicial complexes, and cell complexes.

Let be a topological domain. We define a set of neighborhood functions on . Consider a cell and let for some . A message between cells and is a computation dependent on these two cells or the data supported on them. Denote as the multi-set , and let represent some data supported on cell at layer .

The message is influenced by both the data and associated with cells and , respectively. Additionally, it incorporates characteristics specific to the cells themselves, such as orientation in the case of cell complexes. Messages from neighboring cells are aggregated within each neighborhood. The process of combining messages from different neighborhoods. The aggregated messages influence the state of a cell in the next layer.

Read also: An Overview of Deep Learning Math

While the majority of TNNs follow the message passing paradigm from graph learning, several models have been suggested that do not follow this approach. Maggs et al. leverage geometric information from embedded simplicial complexes, i.e., simplicial complexes with high-dimensional features attached to their vertices. This offers interpretability and geometric consistency without relying on message passing.

Motivated by the modular nature of deep neural networks, initial work in TDL drew inspiration from topological data analysis, and aimed to make the resulting descriptors amenable to integration into deep-learning models. This led to work defining new layers for deep neural networks. Pioneering work by Hofer et al., for instance, introduced a layer that permitted topological descriptors like persistence diagrams or persistence barcodes to be integrated into a deep neural network. This was achieved by means of end-to-end-trainable projection functions, permitting topological features to be used to solve shape classification tasks, for instance. Follow-up work expanded more on the theoretical properties of such descriptors and integrated them into the field of representation learning. Other such topological layers include layers based on extended persistent homology descriptors, persistence landscapes, or coordinate functions. In parallel, persistent homology also found applications in graph-learning tasks.

The Frontier of Relational Learning

TDL is a rapidly growing field that seeks to leverage topological structure in data and facilitate learning from data supported on topological objects, ranging from molecules to 3D shapes. Most TDL architectures can be unified under the framework of higher-order message-passing (HOMP), which generalizes graph message-passing to higher-order domains.

TDL may complement graph representation learning and geometric deep learning by incorporating topological concepts, and can thus provide a natural choice for various machine learning settings.

TDL plays a critical role in the encoding, modeling and analysis of relational data. The topology of the underlying data space determines the choice of possible neural network architectures. Topological domains enable the modeling of data containing multi-way interactions (also known as higher-order relations). TDL captures regularities inherent to manifolds, such as ‘remeshing symmetry’. TDL captures topological equivariances in the data. In summary, TDL takes into account topological characteristics that appear in relational data, and therefore is a natural choice for various machine learning problems.

Topological spaces enrich deep learning methods from a variety of perspectives:

A topological space, modeled as a cell complex, enables a flexible molecular representation. This representation can improve the performance of a deep learning model supported on this space.
Topological neural networks enable the processing of data, for example via higher-order message-passing schemes on a topological space. These networks find a wide range of applications, from computer graphics to drug discovery.
Topological spaces allow hierarchical representations of the underlying data that naturally correspond to pooling operations in deep learning.
The topological characteristics of the underlying data are crucial when selecting a neural network architecture.

In the early stages, the term ‘TDL’ was often used to refer to the incorporation of features generated by persistent homology within the input pipeline of a deep neural network (DNN). However, the term ‘TDL’ refers to the collection of ideas and methods related to the use of topological concepts in deep learning.

In TDL, multi-way interactions between entities constitute useful features capable of embedding topological structure via deep learning algorithms. Higher-order relations capture long-distance or seemingly disparate connections in a system, providing scope for effective or robust message-passing schemes.

Beyond higher-order relations, the topological view may also capture the regularities inherent to manifolds. Examples include ‘remeshing symmetry’ over manifolds, such as being invariant to different triangulations of a sphere or inducing similar behaviors at different meshing resolutions.

TDL is a natural approach to capture ‘topological equivariances’. For example, if a classification algorithm is meant to identify different knots, then it is useful to understand the stabilizer group involving isotopies of the complement. In general, GNNs and GDL are based on ‘standard groups’. For instance, GNNs adopt the permutation group, and applications of GDL in molecular modeling use subgroups of the Euclidean group, such as the special Euclidean group SE(n) or the special orthogonal group SO(n). TDL incoporates more complex homeomorphism groups that act on a space, while trying to preserve some embedding information. This embedding information can include, for example, the arrangement of critical points in Morse-Smale complexes or the nesting of circles which leads to the concept of ‘tree of shapes’ in image processing.

Relational data constitute a main modality of data emerging from natural and artificial systems. In such systems, sets of of objects are interconnected via binary or higher-order relations, and relational data encode features of these interconnections.

Applications and Success Stories

A natural use case of TDL involves attributed graphs, which combine structural information with feature information. Such graphs arise in numerous domains, for example, involving protein structures, drug design, virus analysis, or structural representations of molecules and materials. However, applications that require higher-order topological structures have been limited to intrinsically complex data, such as those arising in biological sciences. Perhaps the most compelling examples and applications of TDL which consistently demonstrate the relevant advantages of TDL over existing methods are the victories of TDL in the D3R Grand Challenges, the discovery of SARS-CoV-2 evolution mechanisms, and the successful forecasting of SARS-CoV-2 variants BA.2, BA.4 and BA.5 about two months in advance.

Compelling applications of TDL have been developed, in which the topological properties of the data empirically demonstrate a competitive edge. Several application areas are plausible candidates for TDL to shine, since their underlying domains give rise to topological structures. Topological structures emerge in several scientific areas, including data compression, natural language processing (NLP), computer vision and computer graphics, chemistry, biological imaging, virus evolution, drug design, neuroscience, protein engineering, chip design, semantic communications, satellite imagery, and materials science. Synergies with researchers from these scientific disciplines are encouraged to develop real-world or impactful applications of TDL.

In addition to showcasing the practical benefits of TDL, applications can play an important role in the development and deployment of TDL models. In particular, standardized datasets derived from applications are instrumental in driving TDL research. The Open Graph Benchmark, a set of benchmark datasets, has been developed to facilitate reproducible graph machine learning research. There is a scarcity of higher-order data. Higher-order data can be collected or synthesized. The TDL application areas are potential candidates to collect higher-order data. Conversely, a graph can be lifted to a higher-order domain. Lifting procedures for graphs supply mechanisms for the generation of synthetic topological data. This is why a survey of lifting procedures and associated message-passing schemes would be useful for TDL. Developing applications of TDL can produce higher-order datasets that naturally arise from the underlying domain. A systematic assessment and generalization of graph-lifting and rewiring algorithms is a plausible path towards synthetic higher-order datasets.

The curation of a collection of higher-order datasets can pave the way for TDL benchmarks. The design of open source and reproducible benchmark suites for TDL requires a minimal collection of higher-order benchmark datasets, as well as implementations of graph-lifting algorithms for generating synthetic datasets in higher-order domains. To ease user experience, a taxonomy of higher-order datasets is a recommended feature, organizing benchmarks, for example, by dataset size and type of learning task. Benchmark suites for TDL are expected to have a comprehensive set of performance metrics that extend beyond predictive performance. The basic components of TDL benchmark suites include higher-order datasets, graph-lifting algorithms, and predictive and stability metrics.

Software Packages for Topological Deep Learning

There are several graph-based learning software packages, such as NetworkX, KarateClub, PyG, and DGL. NetworkX facilitates computations on graphs, and KarateClub implements algorithms for unsupervised learning on graph-structured data. Four software packages provide functionality on higher-order structures, namely HyperNetX, XGI, DHG, and TopoX. HyperNetX enables computations on hypergraphs, while XGI provides similar functionality on hypegraphs and simplicial complexes. DHG is a deep learning package for graphs and hypergraphs. TopoX is a suite of Python packages designed to compute and learn with topological neural networks. The suite consists of three packages, TopoNetX, TopoEmbedX and TopoModelX. TopoNetX supports computations on graphs and higher-order domains, including colored hypergraphs, simplicial complexes, cell complexes, path complexes and combinatorial complexes. TopoEmbedX provides methods to embed higher-order domains into Euclidean domains. TopoModelX implements the majority of topological neural networks surveyed in Papillon et al. pytorch-topological combines several state-of-the-art packages for TDA, including giotto-tda and Ripser, thus enabling the creation of topology-driven algorithms that work on point clouds or structured data such as images. Similarly, torch-ph and TopologyLayer support persistent homology computations, as well as differentiating through these computations to facilitate the development of topology-informed loss functions. However, despite numerous…

tags: #topological #deep #learning #explained