Deep Learning for 3D Denoising: A Comprehensive Guide with Arterial Spin Labeling (ASL) Application

Introduction

The field of image denoising has witnessed significant advancements with the advent of deep learning (DL), particularly in the realm of 3D imaging. This article delves into the application of DL techniques, specifically convolutional neural networks (CNNs) and transformers, for denoising 3D images, with a focus on arterial spin labeling (ASL) MRI data. ASL is a non-invasive MRI technique that uses blood water as an endogenous tracer to measure cerebral blood flow (CBF). However, a major challenge in ASL is the low signal-to-noise ratio (SNR) due to the small amount of blood delivered per unit of time. DL-based denoising methods offer a promising avenue to improve ASL image quality and reduce scan time.

Arterial Spin Labeling (ASL) and the Need for Denoising

In a typical ASL experiment, two types of images are acquired: label and control. Label images are acquired with the arterial magnetization labeled, while control images are acquired with the arterial magnetization unperturbed. A delay is typically required to allow the arterial blood to reach the tissue/capillaries before image acquisition. After subtracting the label and control images, the tissue signal is canceled out, and the signal difference, or ASL signal, is proportional to the amount of arterial blood delivered during the measurement.

One of the major challenges in ASL is the low signal‐to‐noise ratio (SNR), because only a small amount of blood is delivered per unit of time. Even in the brain, which is a highly perfused organ due to high metabolic demands, only about 1% of the volume is replaced by freshly delivered arterial blood, equivalent to a 1% MR signal change. As a result, when the tissue signal is subtracted out, its fluctuations remain and scale with the tissue signal before subtraction, leading to an SNR‐challenging situation. ASL acquisition is usually repeated several times for averaging, which requires a relatively long scan time (e.g., 3-5 min) to reach an acceptable SNR level.

Deep Learning-Based Denoising Pipeline for ASL Data

The overall denoising processing for ASL data typically involves the following steps:

Raw ASL Image Generation: Raw ASL images are generated after subtraction of the label and control images, resulting in N_t time points as the input.

Read also: ViT-based Denoising
Pre-Averaging: Every N_av time points are averaged in the pre-averaging step to produce ASL images of different levels of SNR. Averaging of ASL images before DL denoising was more advantageous than averaging after.
DL Denoising: These N_t/N_av images are used to train different DL denoising models and produce corresponding individually denoised images in the testing/referencing phase.
Post-Averaging: The DL-denoised images will be further averaged in the post-averaging step, to produce the final ASL image.

Key Considerations for Optimizing DL-Based Denoising

Several factors influence the performance of DL-based denoising for ASL data, including:

The Impact of Including Calibration Scans (M0)

Including M0 was almost always beneficial, with a dependence on the SNR of the input ASL images. Including additional contrasts such as T1w and T2w anatomical information is advantageous in end‐to‐end DL tasks. In this study, we limited the use of anatomical information to only include the M0 images (i) to maximize the method's utility, as almost all ASL scans have M0, but other anatomical images (e.g., T1w) may not be readily available; and (ii) to minimize the confounding contribution from non-ASL information in examining the denoising behaviors of DL models. The DL models were trained and tested without and with the M0 images included, under different averaging strategies.

Read also: Explore 3D denoising techniques

Averaging Strategies: Windowed vs. Interleaved

Windowed averaging outperformed interleaved averaging, supporting the practice of reducing scan time. To gain a better understanding of the effects of averaging strategies and corresponding temporal noise patterns, we further explored two different averaging approaches at the pre‐averaging stage: interleaved and windowed averaging, as shown in Figure 1B, with Nav_pre = 2, 4, 8, and 16. Interleaved averaging was studied to test whether including the information of perfusion signal fluctuations spanning a wider range of time is beneficial, although windowed averaging is more reflective of the actual acquisition practice, especially with scan time reduction. Comparing these two would help determine whether and how much reducing scan time in practice, such as with windowed averaging, will compromise the accuracy of perfusion estimation when the observation period is limited.

SNR Matching Between Training and Inference Data

Matching the SNR levels of the images in training and inferencing was important for optimal performance. The effect of including calibration scans (M0) was studied and compared across images of different levels of signal‐to‐noise ratio (SNR). To more closely study the effect of SNR on denoised image‐quality improvement, the ASL images were also dichotomized into low‐SNR and high‐SNR groups using k‐means clustering on SSIM and normalized mean absolute error (NMAE) of the input images at the single‐time‐point level (Figure 1C) for separate comparisons. The results from this experiment were also used to guide the following experimental design.

Data Augmentation

Data augmentation was used in the training phase, including flipping along x and y, in‐plane transpose, random in‐plane shift (x and y), and rotation (−45° to 45°).

Normalization

The scaling factor was recorded and could be reapplied to denoised ASL images to recover the original intensity levels. This normalization method (i) preserves the quantification capability of the ASL data; (ii) removes potential internal mean shift in the input distribution (i.e., when combined with the batch normalization method and the nonlinearity of the activation functions in DL training, it helps improve the training stability); and (iii) allows direct pooling and comparison of the results across subjects as the cerebral blood flow (CBF) across subjects and the raw ASL signal acquired on different scanners can vary significantly.

Network Architectures for 3D Denoising

3D U-Net

The 3D U-Net architecture is a popular choice for 3D image denoising tasks. It consists of an encoder path that downsamples the input image to extract features at multiple scales, and a decoder path that upsamples the features to reconstruct the denoised image. Skip connections are used to connect corresponding layers in the encoder and decoder paths, which helps to preserve fine-grained details. were used before a max‐pooling layer, which reduces the dimension of the 3D images by half along each dimension. The initial number of the output channels is set to 16 for the first 3D convolution at the highest resolution and then doubles for the first layer of 3D convolution at each reduced resolution. These convolution/pooling blocks were repeated 4 times, resulting in a tensor size of 4 × 4 × 2 × 128 at the central layer, where the residual connection was applied after a three‐layer convolution. Symmetrically along the decoder arm, three layers of 3D convolution were applied before the resolution doubles using a strided convolution with a stride size of 2. Then the up‐scaled tensors were concatenated using skip connections, with the corresponding tensors from the decoder arm before the next up‐scaling block. were included with a dropout rate of 0.05. The final activation function was linear after combining all the channels (64 × 64 × 32 × 16) to produce the final denoised images at the resolution of 64 × 64 × 32. images (64 × 64 × 32) were included in the models, they were treated as an additional input channel.

Vision Transformers (ViT)

Vision Transformers (ViT) have emerged as a powerful alternative to CNNs for various computer vision tasks, including image denoising. ViTs divide an image into patches and treat them as tokens, similar to words in a sentence. Self-attention mechanisms are then used to learn relationships between the tokens, allowing the model to capture long-range dependencies in the image.

Evaluation Metrics

The quality of the ASL images can be evaluated using several commonly used metrics, including:

Structural Similarity Index (SSIM): SSIM measures the similarity between two images in terms of luminance, contrast, and structure.
Peak Signal-to-Noise Ratio (PSNR): PSNR measures the ratio between the maximum possible power of a signal and the power of corrupting noise.
Normalized Mean Absolute Error (NMAE): NMAE measures the average absolute difference between the predicted and ground truth values, normalized by the mean of the ground truth values.

We did not perform CBF quantification because of the large variability of absolute CBF in different subjects. Instead, NMAE can serve as the indicator for perfusion quantification error, where the percentage perfusion error can be directly calculated (i.e., NMAE is equivalent to a scaled absolute CBF percentage error, given the normalization

Generalizability of DL-Based Denoising

The generalizability of DL-based denoising refers to its ability to perform well on data that is different from the data it was trained on. In the context of ASL denoising, generalizability is important because ASL data can vary significantly depending on the scanner, acquisition parameters, and patient population.

To examine the generalizability of DL denoising, experiments can be performed using low-SNR ground truth in training. Can the DL denoising approach improve ASL image quality beyond that of the ground truth used in training? To answer this, the training data and the “ground truth” image were restricted to a subset of the data (i.e., using 25% and 50% of the total time points, corresponding to GT25% and GT50%, or 8 and 16 time points, respectively); in the testing phase, the previous GT images, GT25% and GT50% were used as the input, and then the denoised images were compared with the real GT (GT100%) to test whether the DL denoising performance extends beyond the training (i.e., the generalizability of DL‐based denoising).

Dual Denoising: A Robust Framework for 3D Representation Learning

Recent research has focused on learning robust and well-generalized 3D representations from pre-trained vision language models such as CLIP. One promising approach is Dual Denoising, a novel framework that combines a denoising-based proxy task with a novel feature denoising network for 3D pre-training. Additionally, parallel noise inference can be utilized to enhance the generalization of point cloud features under cross-domain settings.

The overview of our robust distillation algorithm is shown in Figure 4. It is composed of two branches. The upper branch, named as Point Denoising AutoEncoder (PointDAE), conducts denoising for point cloud following a pre-defined noise scheduler, performs as a proxy task. The lower branch is named as feature denosing. It converts the input standard gaussian noise to CLIP feature hierarchically under the guidance of point cloud features from the upper branch. The two branches are densely connected by multiple cross-attention modules. A stop gradient operation [36] is implemented on them during pre-training to avoid representation collapse.

Diffusion Models for Image Denoising

Diffusion models are a class of generative models that have shown remarkable results in image generation and denoising. They work by gradually adding noise to an image until it becomes pure noise, and then learning to reverse this process to reconstruct the original image. Diffusion models are fundamentally different from all the previous generative methods. Intuitively, they aim to decompose the image generation process (sampling) in many small “denoising” steps. The intuition behind this is that the model can correct itself over these small steps and gradually produce a good sample.

Forward Diffusion Process

The forward diffusion process involves gradually adding Gaussian noise to an input image x₀ through a series of T steps. This process can be modeled as a Markov chain, where each step depends only on the previous one.

Reverse Diffusion Process

The reverse diffusion process involves learning to reverse the noising process to reconstruct the original image. This is typically done using a neural network that is trained to predict the noise added at each step of the forward process.

Conditional Image Generation: Guided Diffusion

A crucial aspect of image generation is conditioning the sampling process to manipulate the generated samples. Here, this is also referred to as guided diffusion. To turn a diffusion model pθp\theta into a conditional diffusion model, we can add conditioning information yy at each diffusion step. pθ(x0:T∣y)=pθ(xT)∏t=1Tpθ(xt−1∣xt,y)p\theta(\mathbf{x}{0:T} \vert y) = p{\theta}(\mathbf{x}T) \prod^T{t=1} p\theta(\mathbf{x}{t-1} \vert \mathbf{x}_t, y)The fact that the conditioning is being seen at each timestep may be a good justification for the excellent samples from a text prompt.

tags: #3d #denoising #machine #learning #vit #tutorial