3D Denoising with Machine Learning: A ViT and Hugging Face Tutorial

Diffusion models have emerged as a powerful class of generative models, capable of producing diverse and high-resolution images. These models have garnered significant attention, especially after successful large-scale training endeavors by organizations such as OpenAI, Nvidia, and Google. Architectures like GLIDE, DALLE-2, Imagen, and the open-source Stable Diffusion are based on diffusion models.

The core idea behind diffusion models involves breaking down the image generation (sampling) process into a series of small "denoising" steps. This iterative approach allows the model to refine its representation gradually, leading to better sample quality. While this refinement strategy shows promise, the iterative nature of diffusion models can make them slower at sampling compared to other generative models like GANs.

The Diffusion Process

The fundamental concept behind diffusion models is relatively straightforward. The process begins with an input image x0\mathbf{x}_0, to which Gaussian noise is gradually added over a series of T steps. This is referred to as the forward process. It's important to note that this forward process is distinct from the forward pass in a neural network. This forward process generates the targets for the neural network, representing the image after applying t noise steps.

Subsequently, a neural network is trained to reverse the noising process and recover the original data. By effectively modeling this reverse process, the model can generate new data. This is known as the reverse diffusion process, which forms the basis for sampling in generative models.

Forward Diffusion: A Markov Chain

Diffusion models can be understood as latent variable models, utilizing a hidden continuous feature space. While they share similarities with variational autoencoders (VAEs), diffusion models are typically formulated using a Markov chain of T steps. The Markov chain assumption implies that each step depends only on the preceding one. Unlike flow-based models, diffusion models are not restricted to a specific type of neural network.

Read also: Explore 3D denoising techniques

Given a data point x0\textbf{x}0 sampled from the real data distribution q(x)q(x) (x0∼q(x)\textbf{x}0 \sim q(x)), the forward diffusion process is defined by adding noise. At each step of the Markov chain, Gaussian noise with variance βt\beta{t} is added to xt−1\textbf{x}{t-1}, resulting in a new latent variable xt\textbf{x}{t} with distribution q(xt∣xt−1)q(\textbf{x}t |\textbf{x}_{t-1}).

Mathematically, this can be expressed as:q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(\mathbf{x}t \vert \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 - \betat} \mathbf{x}{t-1}, \betat\mathbf{I})

Since we are in the multi-dimensional scenario I\textbf{I} is the identity matrix, indicating that each dimension has the same standard deviation βt\betat. Note that q(xt∣xt−1)q(\mathbf{x}t \vert \mathbf{x}{t-1}) is still a normal distribution, defined by the mean μ\boldsymbol{\mu} and the variance Σ\boldsymbol{\Sigma} where μt=1−βtxt−1\boldsymbol{\mu}t =\sqrt{1 - \betat} \mathbf{x}{t-1} and Σt=βtI\boldsymbol{\Sigma}t=\betat\mathbf{I}. Σ\boldsymbol{\Sigma} will always be a diagonal matrix of variances (here βt\beta_t)

The posterior probability is defined as:q(x1:T∣x0)=∏t=1Tq(xt∣xt−1)q(\mathbf{x}{1:T} \vert \mathbf{x}0) = \prod^T{t=1} q(\mathbf{x}t \vert \mathbf{x}_{t-1})

Using this formulation, one can move from the input data x0\mathbf{x}0 to xT\mathbf{x}{T} in a tractable manner. To produce a sample xt\mathbf{x}t, the following distribution can be used:xt∼q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)\mathbf{x}t \sim q(\mathbf{x}t \vert \mathbf{x}0) = \mathcal{N}(\mathbf{x}t; \sqrt{\bar{\alpha}t} \mathbf{x}0, (1 - \bar{\alpha}t)\mathbf{I})

Read also: ASL Application in Deep Learning

Since βt\betat is a hyperparameter, αt\alphat and αˉt\bar{\alpha}t can be precomputed for all timesteps. This enables sampling noise at any timestep tt and obtaining xt\mathbf{x}t in a single step.

Variance Schedule

The variance parameter βt\betat can be either fixed to a constant value or defined as a schedule over the TT timesteps. This variance schedule can take various forms, such as linear, quadratic, or cosine. The original DDPM authors employed a linear schedule, increasing from β1=10−4\beta1= 10^{-4} to βT=0.02\beta_T = 0.02.

Reverse Diffusion: Reversing the Noise

As T→∞T \to \infty, the latent variable xTxT approaches an isotropic Gaussian distribution. The goal is to learn the reverse distribution q(xt−1∣xt)q(\mathbf{x}{t-1} \vert \mathbf{x}{t}) so that we can sample xTxT from N(0,I)\mathcal{N}(0,\mathbf{I}), run the reverse process, and obtain a sample from q(x0)q(x_0), effectively generating a new data point from the original data distribution.

Approximating the Reverse Process with a Neural Network

In practice, q(xt−1∣xt)q(\mathbf{x}{t-1} \vert \mathbf{x}{t}) is often intractable, as statistical estimates require computations involving the data distribution. Therefore, we approximate it with a parameterized model pθp_{\theta}, such as a neural network.

By applying the reverse formula for all timesteps (pθ(x0:T)p\theta(\mathbf{x}{0:T}), also called trajectory), we can move from xT\mathbf{x}T to the data distribution:pθ(x0:T)=pθ(xT)∏t=1Tpθ(xt−1∣xt)p\theta(\mathbf{x}{0:T}) = p{\theta}(\mathbf{x}T) \prod^T{t=1} p\theta(\mathbf{x}{t-1} \vert \mathbf{x}_t)

By conditioning the model on timestep tt, it learns to predict the Gaussian parameters (mean μθ(xt,t)\boldsymbol{\mu}\theta(\mathbf{x}t, t) and covariance matrix Σθ(xt,t)\boldsymbol{\Sigma}\theta(\mathbf{x}t, t)) for each timestep.

Training a Diffusion Model

The combination of the forward process qq and the reverse process pp resembles a variational autoencoder (VAE). The model can be trained by optimizing the negative log-likelihood of the training data.

The loss function can be decomposed into three terms:

Reconstruction Term: Eq(x1∣x0)[logpθ(x0∣x1)]\mathbb{E}{q(x1 \vert x0)} [log p{\theta} (\mathbf{x}0 \vert \mathbf{x}1)], similar to the ELBO in a VAE. This term is learned using a separate decoder.
Gaussian Distance: DKL(q(xT∣x0)∣∣p(xT))D{KL}(q(\mathbf{x}T \vert \mathbf{x}0) \vert\vert p(\mathbf{x}T)), which measures how close xT\mathbf{x}_T is to the standard Gaussian. This term has no trainable parameters and is often ignored during training.
Denoising Steps: ∑t=2TLt−1\sum{t=2}^T L{t-1}, also referred as LtLt, which represents the difference between the desired denoising steps pθ(xt−1∣xt))p{\theta}(\mathbf{x}{t-1} \vert \mathbf{x}t)) and the approximated ones q(xt−1∣xt,x0)q(\mathbf{x}{t-1} \vert \mathbf{x}t, \mathbf{x}0). Maximizing the likelihood involves learning these denoising steps LtLt.

By conditioning on x0\textbf{x}0, the reverse diffusion step q(xt−1∣xt,x0)q(\mathbf{x}{t-1} \vert \mathbf{x}t, \mathbf{x}0) becomes tractable. This allows sampling xt\textbf{x}t at noise level tt conditioned on x0\textbf{x}0.

Simplified Loss Function

Instead of directly predicting the mean of the distribution, the model can predict the noise ϵ\boldsymbol{\epsilon} at each timestep tt. A simplified loss term can be used:

Ltsimple=Ex0,t,ϵ[∥ϵ−ϵθ(aˉtx0+1−aˉtϵ,t)∣∣2]Lt^\text{simple} = \mathbb{E}{\mathbf{x}0, t, \boldsymbol{\epsilon}} \Big[|\boldsymbol{\epsilon}- \boldsymbol{\epsilon}{\theta}(\sqrt{\bar{a}t} \mathbf{x}0 + \sqrt{1-\bar{a}_t} \boldsymbol{\epsilon}, t ) ||^2 \Big]

Optimizing this simplified objective has been shown to outperform optimizing the original ELBO.

Architecture: U-Net

The model architecture typically used in diffusion models is a U-Net. A U-Net is a symmetric architecture with input and output of the same spatial size. It uses skip connections between encoder and decoder blocks of corresponding feature dimension. The input image is first downsampled and then upsampled to reach its initial size.

In the original DDPM implementation, the U-Net consists of Wide ResNet blocks, group normalization, and self-attention blocks. The diffusion timestep tt is specified by adding a sinusoidal position embedding into each residual block.

Conditional Image Generation: Guided Diffusion

Conditional image generation, also known as guided diffusion, involves conditioning the sampling process to manipulate the generated samples. This can be achieved by incorporating image embeddings into the diffusion process to "guide" the generation. Guidance involves conditioning a prior data distribution p(x)p(\textbf{x}) with a condition yy, such as a class label or an image/text embedding, resulting in p(x∣y)p(\textbf{x}|y).

To turn a diffusion model pθp\theta into a conditional diffusion model, conditioning information yy can be added at each diffusion step:pθ(x0:T∣y)=pθ(xT)∏t=1Tpθ(xt−1∣xt,y)p\theta(\mathbf{x}{0:T} \vert y) = p\theta(\mathbf{x}T) \prod^T{t=1} p\theta(\mathbf{x}{t-1} \vert \mathbf{x}_t, y)

Guided diffusion models aim to learn ∇log⁡pθ(xt∣y)\nabla \log p\theta( \mathbf{x}t \vert y). This can be expressed as:∇log⁡pθ(xt∣y)=∇log⁡pθ(xt)+s⋅∇log⁡(pθ(y∣xt))\nabla \log p\theta(\mathbf{x}t \vert y) = \nabla \log p\theta(\mathbf{x}t) + s \cdot \nabla \log (p\theta( y \vert\mathbf{x}t ))

Classifier Guidance

Classifier guidance uses a second model, a classifier fϕ(y∣xt,t)f\phi(y \vert \mathbf{x}t, t), to guide the diffusion toward the target class yy during training. The classifier is trained on the noisy image xt\mathbf{x}t to predict its class yy. The gradients ∇log⁡(fϕ(y∣xt))\nabla \log (f\phi( y \vert\mathbf{x}_t )) are then used to guide the diffusion.

A class-conditional diffusion model can be built with mean μθ(xt∣y)\mu\theta(\mathbf{x}t|y) and variance Σθ(xt∣y)\boldsymbol{\Sigma}\theta(\mathbf{x}t |y). The mean is perturbed by the gradients of log⁡fϕ(y∣xt)\log f\phi(y|\mathbf{x}t) of class yy, resulting in:μ^(xt∣y)=μθ(xt∣y)+s⋅Σθ(xt∣y)∇xtlogfϕ(y∣xt,t)\hat{\mu}(\mathbf{x}t |y) =\mu\theta(\mathbf{x}t |y) + s \cdot \boldsymbol{\Sigma}\theta(\mathbf{x}t |y) \nabla{\mathbf{x}t} logf\phi(y \vert \mathbf{x}_t, t)

Classifier-Free Guidance

Classifier-free guidance can be defined as:∇log⁡p(xt∣y)=s⋅∇log(p(xt∣y))+(1−s)⋅∇logp(xt)\nabla \log p(\mathbf{x}t \vert y) =s \cdot \nabla log(p(\mathbf{x}t \vert y)) + (1-s) \cdot \nabla log p(\mathbf{x}_t)

This approach avoids the need for a separate classifier model. Instead, a conditional diffusion model ϵθ(xt∣y)\boldsymbol{\epsilon}\theta (\mathbf{x}t|y) is trained together with an unconditional model ϵθ(xt∣0)\boldsymbol{\epsilon}\theta (\mathbf{x}t |0) using the same neural network.

Stable Diffusion

Stable Diffusion is a latent diffusion model that generates images from text prompts. It operates in a lower-dimensional latent space, reducing memory and compute requirements compared to pixel-space diffusion models.

The key components of Stable Diffusion are:

VAE (Variational Autoencoder): The VAE has an encoder and a decoder. The encoder transforms images into latent representations, and the decoder converts denoised latents back into images.
U-Net: The U-Net iteratively denoises latent image representations conditioned on text embeddings.
Text-Encoder: The text-encoder transforms the input prompt into an embedding space that the U-Net can understand.

The stable diffusion model takes both a latent seed and a text prompt as input. The U-Net denoises the random latent image representations conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm.

Inference with Stable Diffusion

The Stable Diffusion model can be run in inference using the StableDiffusionPipeline. The number of inference steps and the guidance scale can be adjusted to control the quality and adherence to the prompt.

num_inference_steps: Controls the number of denoising steps. A smaller number of steps results in faster generation but potentially lower quality.
guidance_scale: Increases the adherence to the text prompt, potentially at the cost of image quality or diversity. Values between 7 and 8.5 are typically recommended.

Vision Transformer (ViT) Models

Vision Transformers (ViTs) are transformer models adapted for image data. In ViTs, an image is divided into patches, which are then treated as tokens and fed into a transformer encoder. The self-attention mechanism in the transformer allows the model to learn relationships between different parts of the image.

Transformers.js

Transformers.js is a JavaScript library that allows running pre-trained transformer models in the browser. It uses ONNX Runtime to execute the models and supports various data types for optimization.

tags: #3D #denoising #machine #learning #ViT #Hugging