Predicting Type 1 Diabetes: An Advanced Bioinformatics Framework Integrating Synthetic Gene Expression
Type 1 diabetes (T1D) is a common chronic disease in children caused by the autoimmune response against pancreatic cells. Despite active research, the exact causes or any cure for the disease is still unknown. Predicting the outcome of Type 1 diabetes (T1D) plays a vital role in identifying novel risk factors, ensuring early patient care, and designing cohort studies. This article explores an advanced bioinformatics framework for gene expression imputation and islet autoimmunity (IA) prediction in the context of T1D research. Islet autoimmunity (IA), which precedes the clinical onset of T1D, can be used as a marker to study the progression toward T1D.
The TEDDY Study
The Environmental Determinants of Diabetes in the Young (TEDDY) is a longitudinal prospective study that uses a nested case-control cohort to identify risk factors associated with T1D. The TEDDY study is designed to identify the environmental risk factors impacting the development of IA and the onset of T1D. TEDDY enrolls 8676 high-risk children in this study based on the HLA genotype of the children and their first-degree relatives. Follow-up for each child starts at 3 months and lasts until 15 years of age. Children are tested for islet autoantibodies (IAA, GADA, IA-2A, ZnT8A) at each visit, and gene expression is also measured. Visits for the participants are 3 months apart for the first 4 years. After that, it is 3 months for participants with any positive islet autoantibody test and 6 months for the rest. The outcome of interest in this study, IA, is defined as the presence of two consecutive positive tests for any particular islet autoantibody. In other words, if there are consecutive positive tests for at least one of the four autoantibodies, we consider that participant as IA positive. Many risk factors, including HLA genotype, SNPs, dietary factors, family history, sex and seroconversion age, have been investigated in previously published works that narrow down the candidates for a predictive study.
TEDDY is a longitudinal cohort study that collects a vast amount of multi-omics and clinical data from its participants to explore the progression and markers of T1D. However, missing data in the omics profiles make the outcome prediction a difficult task. TEDDY collected time series gene expression for less than 6% of enrolled participants. Additionally, for the participants whose gene expressions are collected, 79% time steps are missing. This limits the opportunity for a comprehensive integrative study involving all participants.
The Challenge of Missing Data
Gene expression changes throughout the timeline of chronic diseases such as diabetes, hypertension, obesity and heart disease; therefore, a periodically measured gene expression may better explain the underlying mechanisms of these diseases compared with cross-sectional gene expression collected once per participant [1]. Some prospective longitudinal cohort studies collect that information. However, these studies tend to suffer from loss to follow-up, which means the time series data will have missing values if participants are absent during scheduled visits when data are collected. Moreover, limited by cost and logistics, data are often collected for a subset of participants, i.e. some participants will have no gene expression data available. An effective data imputation technique is necessary to use the gene expression for downstream analyses.
Researchers have investigated computational methods for handling the missing value problem in gene expression, and several algorithms have been proposed to impute gene expression. The missing gene expression problem can be broadly divided into two groups: (1) the first group contains participants with partially available gene expression, and (2) the second group contains participants with no available gene expressions. Many frameworks have been developed to solve the prior stated problem, which consider global or local relations among genes, domain knowledge and other omics data for imputation. The second group of missing value problems is more apparent in multi-omics analysis, where some participants can be present in another omics type but absent in gene expression. For such conditions, several frameworks have been developed that use other omics data to guide the imputation of gene expression.
Read also: Authentic Flavors at Ahmed Restaurant
As most studies evaluate gene expression profiles at a single time point, most of the available imputation frameworks are also designed to impute such gene expression datasets. The imputation of time series data offers additional challenges because of the time dependency among the time steps from the same participants. A handful of frameworks were proposed for the imputation of time series gene expression data but they do not involve multi-omics data and participants with completely missing gene expression. More recently, some advanced algorithms have been proposed for time series data imputation in other domains.
The Proposed Bioinformatics Framework
This study introduces an advanced bioinformatics framework for gene expression imputation and islet autoimmunity (IA) prediction. The imputation model generates synthetic data for participants with partially or entirely missing gene expression. The prediction model integrates the synthetic gene expression with other risk factors to achieve better predictive performance. The primary objective of this work is to propose a model that will impute partially or entirely missing gene expressions with synthetic data. We employ a deep learning-based model to generate synthetic gene expression from SNP data and available gene expression. We demonstrate that it contains a competitive predictive signal compared with the true gene expression and improves state-of-the-art prediction results. We also explore the importance of time series gene expression in capturing the underlying mechanisms of T1D.
The framework has two main components: a deep learning-based imputation model and a long short-term memory (LSTM)-based classifier. Synthetic gene expression is first generated for missing time steps through the imputation model using SNP and available gene expression. Although gene expression is either partially or completely missing for every participant, SNP data are available for all of them. Therefore, our proposed imputation model is trained to map the SNP data to gene expression and generate the value for missing time steps. The imputation is carried out for each time step separately, i.e. the model is retrained for imputing every time step
Methodology
The rest of the manuscript is organized as follows: TEDDY study setup, our research design and methodology are described in the next section. The Experiments section is dedicated to experimental setups and validation of the results. The Discussion section contains a brief discussion of the results along with our limitations and future directions. Many TEDDY-identified risk factors have been previously explored, and family history, HLA genotype and SNP were shown to be better predictors for IA status. Based on the literature, we include 12 SNPs, HLA genotype and family history in this study. Details about the 12 SNPs can be found in [22]. We performed an exhaustive search for the best SNP combination and found rs4597342, rs12708716, rs4948088 and rs1143678 combined with HLA genotype and family history to be the best-performing combination for IA status prediction. Therefore, we include these variables in further analyses of this study. Risk factors are binarized before feeding them into models. Family history was categorized as first-degree relatives having T1D versus no T1D. SNPs were categorized as major (no copy of minor allele) versus minor (one or two copies of minor alleles). The HLA genotype is defined as DR3/DR4 versus others. The gene expression in TEDDY is a time series with 2013 time steps belonging to 401 children. Gene expression is collected until 72 months at 3 or 6 months intervals. Approximately 79% of time steps are missing for the 401 participants, which significantly impedes its ability to be used in a time series study. In the cohort of 6812 participants, the missing rate rises to 98.77%, as the other 6411 (94.11% of 6812) participants have no available gene expression. Therefore, the gene expression is unusable for downstream analyses involving a cohort of 6812 participants. The number of available participants at each time step is presented in Figure 1, which illustrates that the rate of missing participants increases in later time steps. Moreover, after 48 months, some participants visited every 6 months instead of 3, resulting in an even lower data availability rate. As available data are necessary to train the imputation model, a lower data availability rate disrupts the model training, and thus the quality of the synthetic gene expression. To reduce the impact of missing data and maintain a regular interval of 3 months between consecutive time steps, we set a cutoff of 48 months for gene expression in this study. Therefore, gene expression of each participant consists of 16 time steps corresponding to 3 to 48 months at 3 months intervals. Although setting a cutoff lowers the missing rate to 98.22%, it is still impractical to use gene expression with predictive algorithms without an effective data imputation. Therefore, we propose a deep-learning-based imputation model described in the following subsection that can generate synthetic gene expression at missing time steps from SNPs. We keep 17 039 protein-coding genes in the gene expression. 17 039 features may overfit the model or impose a computational burden with redundant information; so we find an optimal number of genes using forward feature selection that will provide us with the best prediction results. Once we have the optimal number of genes, gene expression, family history and SNPs are merged into a single time series dataset. For family history, HLA genotype and SNPs, the same value for a participant is replicated at every time step.
Deep Learning-Based Imputation Model
Incomplete gene expression is imputed using SNP in the imputation model (DNN). Completed gene expression, SNP, HLA genotype and family history are fed into the classifier (LSTM) to predict IA positive and IA negative participants. For imputing a time step, participants with available gene expression at that time step are separated and randomly divided into training and validation sets with a 70-30 split ratio. All other participants without gene expression are considered as the test set. The training samplesâ SNP data and gene expression are used as input and output to train the model. Let be the total number of participants in the study. SNP data () are available for all participants, whereas the gene expression is available for participants among them. genes from participants in the gene expression , observed in is defined as , where denotes the observation from time step. For imputation of time step, we first train an autoencoder () to find a lower dimensional representation of , given by , where represents transposition and is the embedding size. contains the property of each feature in the observed data at time step, which will be later used in equation (7) to guide the synthetic data generation. Both encoder and decoder are five layers feed-forward neural networks. The encoder finds the lower dimensional embedding from the original data, which is fed to the decoder as it tries to reconstruct the original data from the embedding. The autoencoder is trained with the reconstruction loss for 100 epochs using Adam optimizer and a learning rate of 0.0001. Output from the encoder and decoder are given by equations (1) and (2), respectively, and the network is trained following equation (3). Incomplete gene expression is imputed using autoencoders , and multilayer perceptron (MLP) .Then we move forward to the imputation of missing values. It consi…
Experimental Results
Comprehensive experiments on TEDDY datasets show that:
- Our pipeline can effectively integrate synthetic gene expression with family history, HLA genotype and SNPs to better predict IA status at 2 years (sensitivity 0.622, AUC 0.715) compared with the individual datasets and state-of-the-art results in the literature (AUC 0.682).
- The synthetic gene expression contains predictive signals as strong as the true gene expression, reducing reliance on expensive and long-term longitudinal data collection.
- Time series gene expression is crucial to the proposed improvement and shows significantly better predictive ability than cross-sectional gene expression.
- Our pipeline is robust to limited data availability.
tags: #tanvir #ahmed #ucf #research

