Machine Learning Deciphers the High-Dimensional Organization of the *E. coli* Chromosome

The three-dimensional organization of the bacterial chromosome plays a crucial role in regulating gene expression, DNA replication, and other essential cellular processes. Understanding this organization, however, is challenging due to the high-dimensional nature of the data generated by chromosome conformation capture techniques, such as Hi-C. Recent advances in machine learning (ML) offer powerful tools for extracting meaningful information from these complex datasets. This article explores how ML techniques are being used to unravel the intricate structural and dynamic organization of the Escherichia coli chromosome.

Introduction

The archetypal bacterium Escherichia coli contains a supercoiled circular DNA molecule, 1.6 mm in length and 4.64 Mb in size, that is confined within a spherocylinder measuring only 2-4 μm long. Initially, the E. coli chromosome was viewed as a complex blob of DNA, proteins, and RNA. However, subsequent research has revealed a well-organized structure with distinct domains known as macrodomains (MDs). Chromosome conformation capture techniques, particularly Hi-C, have provided crucial insights into the spatial organization of the genome and higher-order structures. Hi-C generates a high-resolution contact map, the Hi-C matrix, which captures the proximity and contact frequency between different regions of the E. coli chromosome. These Hi-C matrices, when generated for different mutants, offer valuable insights into the functions of nucleoid-associated proteins (NAPs) and their role in maintaining the nucleoid's structure. The resulting chromosomal organization significantly influences the dynamic behavior of chromosomal loci, which move subdiffusively, influenced by their genomic coordinates, demonstrating heterogeneity in their dynamics. Integrating Hi-C data into polymer-based models has enabled data-informed integrative studies, providing a wealth of structural and dynamical details regarding the E. coli chromosome.

The Challenge of High-Dimensionality

The Hi-C-derived chromosomal contact map is a multi-dimensional interaction matrix. Even for a prokaryotic cell like E. coli, the Hi-C matrix can have a dimension as large as 928 x 928 at a 5 Kb resolution. The high dimensionality of this interaction map makes it challenging to discern meaningful information through visual inspection.

Machine Learning to the Rescue

Machine learning techniques have emerged as powerful tools for automated extraction of valuable insights from large dimensional data. While most ML-based investigations have focused on eukaryotic chromosomes, ML-related studies for prokaryotic chromosomes are less developed, primarily due to the challenges posed by lower resolution and smaller quantities of available data.

This article addresses what underlies a ML-derived low-dimensional representation of the Hi-C map of E. coli. First, an artificial neural network (ANN) based framework known as Autoencoder is employed to uncover crucial structural insights embedded within this large Hi-C matrix. A latent space representation of the Hi-C map successfully identifies various MDs with a high degree of accuracy with experimentally derived MDs. In a complementary approach, Hi-C contacts are integrated into a polymer-based model, predicting diffusive dynamics of a large number of chromosomal loci using a supervised machine learning technique called Random Forest (RF) regression. The proposed regression model successfully recovers the coordinate-dependent heterogeneous subdiffusion of chromosomal loci. Moreover, important features from the input data that are crucial in maintaining this dynamical behavior of the loci are extracted. By incorporating only these important features related to Hi-C contacts into the polymer model, loci dynamics are successfully reproduced.

Read also: Transformations in Higher Education

Unsupervised Machine Learning Identifies Intrinsic Structural Patterns

An unsupervised machine learning algorithm known as Autoencoder is employed to unveil the essential structural insights embedded within the Hi-C matrix.

Autoencoders: A Powerful Tool for Dimensionality Reduction

The Autoencoder is a type of unsupervised deep neural network characterized by a dual structure comprising an encoder and a decoder, with a bottleneck in between. The encoder converts the input data from a high-dimensional space to a lower-dimensional representation known as the latent space. Subsequently, the decoder reconstructs the initial input data from this latent space. This process involves the adjustment of model parameters, primarily weights and biases. Each dimension in the latent space corresponds to a latent variable.

In the ML model, the input comprises a single Hi-C probability matrix with dimensions 928 × 928 (4640 kb/5 kb = 928). The Autoencoder architecture is structured with a total of nine sequential layers featuring neuron counts of 928, 500, 200, 100, Ld, 100, 200, 500 and 928, respectively, where Ld denotes the dimension of the latent space.

Choosing the Right Latent Dimension

After setting up the Autoencoder architecture, Ld needs to be chosen judiciously. The variation of the FVE as a function of latent dimension is assessed. A latent dimension of Ld = 3 is chosen, as it helps to achieve an FVE of at least 0.85, meaning that the Autoencoder’s reconstruction accounts for a minimum of of the variance in the input Hi-C data.

Training the Autoencoder

To assess the training robustness across various latent space dimensions (Ld) concerning the number of epochs, the training loss is plotted as a function of epochs for different Ld. The figure clearly illustrates that beyond epochs = 25, the training loss reaches a point of saturation for all Ld > 1. This observation implies that selecting a number of epochs greater than 25 is a prudent choice. In the model, 100 epochs are used.

Read also: Key Trends in Education

Validating the Autoencoder's Performance

A comparison between the input (experimental) and output (reconstructed) matrices is conducted. The genome-wide contact probability map between the experimental and ML (reconstructed by the Autoencoder with Ld = 3) contact probability matrix is compared, along with a histogram showing the difference between the two matrices. The findings reveal a Pearson correlation coefficient (PCC) of 0.94 between the experimental and ML contact probability matrices. Additionally, the absolute difference in the mean values is 0.023, indicating a substantively strong agreement between experimental and ML-derived chromosomal interactions.

Macrodomain Identification

The biological significance of the lower-dimensional representation (Ld = 3) of the input data is assessed. To achieve this, a scatter plot of the latent space data is generated and clustering is conducted using the K-means algorithm. The hypothesis is that each cluster signifies specific domains within the bacterial chromosome, inherently encoded in the Hi-C matrix. Biologically, these large-scale structurally distinct domains are referred to as macrodomains(MDs).

Macrodomains: Organizing Principles of the Bacterial Chromosome

The actual molecular mechanisms governing macrodomain organization remain incompletely understood, and the precise boundaries of these MDs have been found to vary across different reports. For example, in 2000, Niki et al. identified mainly four macrodomains: Ori, Right, Ter and Left. Later, other experimental studies by Valens et al. and Espéli et al. identified two more macrodomains, NSR and NSL. The variations in macrodomain boundaries observed across different studies are primarily attributed due to the applied method itself.

Comparing ML-Derived and Experimentally Determined Macrodomains

A scatter plot illustrates the three dimensional (Ld = 3, χ1, χ2 and χ3) representation of the latent space, with distinct color-coded clusters representing various MDs of the chromosome. From experimental study, there is a priori knowledge regarding the base pairs of each macrodomain. Additionally, through clustering, base pair information for each macrodomain has been obtained. Subsequently, a detailed comparison between experimentally denoted and (ML)-derived MDs is facilitated by schematically drawing the DNA as a circle. The inner and outer circles, featuring various color-coded regions, represent the experimentally denoted and ML-derived macrodomains, respectively, with base pair information annotated in kilo bases (kb). A visual inspection indicates substantial agreement between MDs, barring discrepancies in the NSR, Right, and Ter MDs. Quantitative comparison between actual (experimentally denoted) and predicted (ML-derived) MDs is facilitated by the confusion matrix.

Evaluating the Accuracy of Macrodomain Prediction

The F1-score for each MD is calculated.where TP, TN, FP and FN stand for ‘True Positive’, ‘True Negative’, ‘False Positive’ and ‘False Negative’, respectively. F1-scores exceeding 0.92 for the Left, NSL MDs suggest a strong match between actual and predicted classes. Conversely, lower F1-Scores for the other three MDs indicate a moderate alignment. Nevertheless, the overall accuracy for all classes stands at 0.82, indicative of a robust correlation between experimentally denoted and ML-predicted MDs.

Read also: Higher Education Affordability Crisis

In summary, the unsupervised ML model (Autoencoder) offers a potent automated approach for MDs identification, demonstrating a high degree of accuracy with experimentally derived MDs.

Machine Learning Identifies Genomic Contacts Crucial for E. coli Dynamics

In the preceding section, the intrinsic structural properties of the E. coli chromosome embedded within the Hi-C matrix were explored, which led to an automated discovery of segmented macrodomains in a ML-derived low-dimensional subspace. In this section, the question is posed: Can we identify the crucial subset of chromosomal contacts in Hi-C map, that hold key to the heterogenous, coordinate-dependent diffusivities of chromosomal loci? Towards this end, a ML-based protocol namely Random Forest Regression is employed to extract dynamical information by leveraging the structural properties of the chromosome, such as the pairwise distance between chromosomal beads.

Random Forest Regression: A Supervised Learning Approach

This supervised, trees-based algorithm, initially proposed by Breiman et al., is a potent tool widely used for both classification and regression tasks. Random Forest Regression operates by randomly selecting input data from training datasets and creating an ensemble of trees (forests) based on these features and labels.

tags: #the #coli #higher #learning #definition

Popular posts: