Sparse Dictionary Learning: A Comprehensive Guide

Sparse dictionary learning, also known as sparse encoding, is a powerful technique used in machine learning and signal processing. It aims to represent data as a sparse linear combination of basis elements, also known as atoms, from a learned dictionary. This article provides a comprehensive guide to sparse dictionary learning, covering its theoretical foundations, algorithms, applications, and benefits.

Introduction

Dictionary learning is a concept that is often misunderstood in the machine learning world, despite its frequent presence in research papers. The idea of a 'dictionary' comes from the building of bases for a given vector space in linear algebra. The matrix of bases is also called the dictionary matrix, and the building algorithm that is intended to generate a dictionary matrix gives the name of "dictionary learning".

Theoretical Background

Dictionary learning is a sparse encoding schema, where encoding schema is lossless from an analytical implementation perspective. It is guaranteed to have a global optimum encoding from a numerical implementation perspective because the analytical re-formulation is a convex optimization problem.

To understand dictionary learning, familiarity with concepts of bases, spanning sets, and ranking in linear algebra is helpful. Simply put, dictionary learning is a numerical solution for calculating the bases on a given set of vectors. We cannot use the analytical solution because it is an NP-complete problem from the computational complexity perspective. Hence, there is no guarantee we could have the solution, especially with a large matrix.

A dictionary (bases matrix) consists of atoms (bases). Atoms do not need to be orthogonal explicitly and may be an over-complete spanning set (violating the linearly independent property of bases).

Dictionary learning trains under a constraint optimization problem where the constraints increase the sparsity of the coefficient above a given threshold. Optimization is done for minimizing the difference of the multiplication of the learned matrix ‘D’ and its coefficient matrix ‘R’ from the input data matrix ‘X’.

Sparsity

In this tutorial, we will discuss the notion of sparsity. Neuroscience and AI both use notions of sparsity to describe efficient representations of the world. Sparse activity means that only a small number of neurons or units are active at any given time. Computationally, this helps reduce energy consumption and focuses computational efforts on the most salient features of the data. Sparse connections refers to the selective interaction between neurons or nodes, for example, through a graph that is far from fully connected. Computationally, this can focus processing power where it is most needed. In the brain, we see sparsity of both types, and researchers have made many theories about its benefits. This tutorial will explore a few of these benefits.

Sparsity can be measured using different metrics:

(\ell0) pseudo-norm - the count of non-zero values: (|\mathbf{x}|{\ell0}=\sumi \mathbb{1}{xi\neq0}) where (\mathbb{1}) is an indicator function equal to 1 if and only if the argument is true.
Kurtosis - a fourth-order measure that quantifies the “tails” of the distribution: (\kappa=\mathbb{E}_x \left(\frac{x-\mu}{\sigma}\right)^4).

Sparsity in Real Life

Under what scenarios do we encounter sparsity in real life? In this first section, we will explore various contexts and methods through which natural sparsity manifests in real-world signals. For the first exercise, we will analyze a video of a bird in San Francisco to extract temporal sparsity. Try to gradually increase the threshold parameter ((\theta)) from 0 to 240 in intervals of 20 and plot the result for each value. You might notice that, at first, the kurtosis value decreases (around till $\theta = 140$), and then it drastically increases (reflecting the desired sparsity property). If we take a closer look at the kurtosis formula, it measures the expected value (average) of standardized data values raised to the 4th power. That being said, if the data point lies in the range of standard deviation, it doesn’t contribute to the kurtosis value almost at all (something less than 1 to the fourth degree is small), and most of the contribution is produced by extreme outliers (lying far away from the range of standard deviation). So, the main characteristic it measures is the tailedness of the data - it will be high when the power of criticality of outliers will overweight the “simple” points (as kurtosis is an average metric for all points).

Temporal Sparsity

In this section, you will increase temporal sparsity in a natural 1D time series by temporal differencing. Denote the pixel value at time (t) by (pixelt). In code, define these absolute temporal differences to compute temporaldiff by applying np.diff on the signal sig and then applying np.abs to get absolute values. The NumPy function np.diff takes in the signal as well as an additional parameter n which compares the value of element i with i-n (the default value is n=1). Create an array of 10 different (\tau) values: (taus = [1, 11, 21… Compare the histograms of temporal differences for each different tau. Instead of differences separated by delay (\tau), we’ll compute differences between one value and the average over a range of delays. This is closer to what brains actually do. We define a diff_box function, which accepts a signal and the window size as inputs. Internally, it computes the difference between the signal and the average signal over a delayed time window. Observe the results for window = 10.

Read also: Unlocking Hans Wehr

Sparse Coding

Sparse coding is a coding strategy where the inputs are represented by a linear combination of features, most with zero coefficients but a few that are nonzero or active. Often, this is applied with an overcomplete basis set: we use more features that are necessary to cover the input space. However, unexpectedly, when the slide was inserted and removed from the projector, the neurons responded robustly.

In 1996, Olshausen and Field [2] demonstrated that sparse coding could be a good model of the early visual system, particularly in V1.

The (\ell_0) pseudo-norm is defined as the number of non-zero features in the signal. Particularly, let (h \in \mathbb{R}^{J}) be a vector with (J) “latent activity” features. Let’s assume that we have a simple linear model where we want to capture the observations (y) using the linear model (D) (which we will later call dictionary). (D)’s features (columns) can have sparse weights denoted by (h). For instance, in the brain, (D) can represent a basis of neuronal networks while (h) can capture their sparse time-changing contributions to the overall brain activity (e.g. To enforce that (h) is sparse, we penalize the number of non-zero features with penalty (\lambda).

One method to find a sparse solution for a linear decomposition with (\ell_0) regularization is OMP (Orthogonal Matching Pursuit) [4]. In this context, a “dictionary” is a collection of features that we use to represent the target signal.

Orthogonal Matching Pursuit (OMP)

Now, we will explore the effect of increasing the number of non-zero coefficients. We define the number of non-zero coefficients to be the cardinality. Below, please run OMP with increasing numbers of non-zero coefficients (increasing cardinality), from 1 to 101 in intervals of 5. We will then compare the accuracy and reconstruction performance of each fitted model to exmamine the relationship between cardinality and signal reconstruction. You should already have an idea what the performance might look like when considering a very low cardinality and a very high cardinality.

While OMP is a common and simple practice for finding the sparse representation of a dictionary of features, it presents multiple challenges. The (\ell_0) norm is not convex, making it hard to identify the sparse vector. Another challenge is computational complexity. Approaches addressing these issues exist.

Dictionary Learning Algorithms

Dictionary learning algorithms are often established on an optimization process involving the iteration between two stages: sparse approximation and dictionary update. The optimization formulation and algorithms for both stages will be discussed. The links and differences among these algorithms, such as the well-established maximum of optimal directions (MOD) and K-SVD, and the recently proposed Simultaneous Codeword Optimisation (SimCO), and Large step Gradient Descent (LGD) will be detailed. Extensions, for example, to analysis operator learning, will be addressed.

Method of Optimal Direction (MOD)

The Method of Optimal Direction (MOD) is efficient in low dimensional space and guarantees an optimal solution when n > minimal sample size. However, it is computationally heavy on higher dimensional input vector spaces because it is calculating the pseudo-inverse of R.

K-SVD

K-SVD is another method that is based on the concept of k-means.

Simultaneous Codeword Optimization (SimCO) and Large Step Gradient Descent (LGD)

These algorithms represent more recent advances in dictionary learning.

Practical Example of Dictionary Learning

For implementation purposes, here I used sk-learn in-built DictionaryLearning module. Also, sk-learn module support a lot more configurations on dictionary learning than here I implemented, Therefore I encourage you to follow this link to get more insight on the module.

The encoding vector space generated by our implementation is more sparse than the R matrix I illustrated above. Also, now you could glimpse, with the sparsity increases the encoding vectors be much simpler & precise.

Why Sparsity is Important

Broadly speaking, the higher the sparsity of encoding, the higher the resulting dictionary resilience to the noise. Therefore, the algorithm is fed by this vector space could easily increase the accuracy while decreasing processing time. However, the “importance of sparsity in ML” is a topic to discuss on another day.

Dictionary learning is being a sparse encoding algorithm. And in machine learning almost all the time the algorithms are backed by a representation learning system or data encoding system. However, Dictionary learning comes with some bonuses as well. if the k > minimal sample size and vectors weren’t affected by the noise it’s guaranteed that the resulting dictionary could give the optimal encoding to any point in the trained vector space, even for vectors that haven’t been seen in the training time. also, being encodings sparser and orthogonal makes it optimal to use in a NN or a general ML algorithm.

Autoencoders and Bottlenecks

Suppose we have some neural network weights which describe how to do a task. You might want to know how the network solved the task, following some guidelines we might like e.g. solving a given task. Firstly, neural networks are freaking huge. In a typical neural network we don’t see the concepts that it sees.

We want to recover the features of g. Dictionary Learning can be seen as trying to find the inverse map to g.

Consider the AutoEncoder set-up on a 1-Layer Transformer. The autoencoder features are sparse. The autoencoder takes the original data as input, passes it through a bottleneck, and tries to reconstruct the inputs. The most surprising thing about this approach is that it works so well.

Monosemanticity

Sparse dictionary learning suggests that these are ~monosemantic features! In some sense, this is the more fundamental part.

Scaling and Universality

Different random initialisations learn similar features. This is a step towards universality but doesn’t show the whole thesis.

Applications of Dictionary Learning

Dictionary learning has a wide range of applications across various fields:

Image Denoising: Dictionary learning can be used to remove noise from images by representing the image as a sparse combination of atoms from a learned dictionary.
Image Processing: Dictionary learning is used in image processing for tasks such as image compression, super-resolution, and inpainting.
Video Processing: Similar to image processing, dictionary learning is applied to video processing tasks such as video compression, denoising, and object tracking.
Audio-Visual Joint Tracking: Dictionary learning can be used for tracking objects in videos by jointly learning dictionaries for both audio and visual data.
Blind Source Separation: Dictionary learning can be used to separate mixed signals into their individual source components.
Neural Network Training: Dictionary learning is used in neural network training to analyze EEG brain imageries.
Signal Processing: Dictionary learning is widely used in signal processing for tasks such as signal reconstruction, classification, and feature extraction.
Remote Sensing: Dictionary learning is used in remote sensing for tasks such as land cover classification and change detection.
Data Compression: Dictionary learning is used in data compression to represent data using a smaller number of coefficients.

Advantages of Dictionary Learning

Dictionary learning offers several advantages over other representation learning techniques:

Sparsity: Dictionary learning promotes sparsity in the representation, which can lead to more efficient and interpretable models.
Adaptability: Dictionary learning learns the dictionary directly from the data, allowing it to adapt to the specific characteristics of the data.
Overcomplete Representation: Dictionary learning can use an overcomplete dictionary, which can capture more complex features in the data.
Robustness: Sparse representations are more robust to noise and outliers in the data.
Interpretability: The learned dictionary can provide insights into the underlying structure of the data.

Dictionary Learning in the Context of Mechanistic Interpretability

Dictionary learning can be seen as a step very close to the dream of Mechanistic Interpretability. Mechanistic Interpretability seeks to understand how neural networks solve tasks by identifying the features and algorithms they learn.

By finding a "dictionary" which tells us how much of each feature goes into each neuron, we can understand what the network is "thinking" about. For example, if certain neurons activate (or "fire") then the network is thinking about cats. If other neurons fire, then the network is thinking about the color blue.

By manipulating the features with which we can use to think in neuron space, you can influence a model’s behaviour.

Challenges and Future Directions

While dictionary learning has shown great promise, there are still several challenges to overcome:

Computational Complexity: Dictionary learning can be computationally expensive, especially for large datasets.
Non-Convexity: The optimization problem in dictionary learning is non-convex, which can make it difficult to find the global optimum.
Dictionary Size: Choosing the appropriate dictionary size can be challenging.
Scalability: Scaling dictionary learning to very large datasets is an ongoing area of research.

Future research directions include developing more efficient algorithms, addressing the non-convexity of the optimization problem, and exploring new applications of dictionary learning.

tags: #sparse #dictionary #learning #tutorial