Deep Learning SIREN Applications: A Comprehensive Overview

The increasing level of automation in various sectors, particularly in road vehicles, demands robust systems capable of interpreting the environment. This is achieved through multi-modal sensors, including acoustic sensors, which capture the rich urban soundscape. Sound detection plays a vital role in monitoring and identifying emergency situations. Deep learning models, especially those utilizing sinusoidal activation functions (SIRENs), have emerged as powerful tools for various applications, including siren identification and implicit neural representations. Let's delve into the diverse applications of deep learning SIRENs, exploring their strengths, weaknesses, and potential for future development.

Siren Identification with Frequency Tracking

Siren identification algorithms have seen significant advancements, with deep learning models achieving state-of-the-art performance. These models demonstrate robustness to the diversity of siren signals and prominent background noise. Most state-of-the-art solutions rely on a spectrogram-based time-frequency representation of sound signals fed to 2D convolutional neural networks (CNN). In this context, siren identification systems face several challenges, including the need for low complexity on-vehicle models and vast generalization ability to face the diverse urban soundscape. Moreover, the amount of available data can be limited, and datasets are unlikely to capture the diversity of the target scenes.

Novel Features for Data-Efficient Siren Identification

Aiming for data efficiency and low complexity, novel features for siren identification based on frequency tracking have been proposed. Adopting the single-parameter ANF design, a CNN model using two features, namely the tracked fundamental frequency and the power ratio between the tracked sinusoidal component, extracted by the ANF, and the full audio signal. The task is solved using the proposed architecture (ANFNet), that we compare with the spectrogram-based baseline [13], denoted as VGGSiren.

The network is a VGG-inspired [19] 2D-CNN composed of three blocks, each containing two 2D convolutional layers and a max pooling operation, followed by a 10-neurons FC layer and the single-neuron output layer. where n is the time index, q denotes the discrete-time shift operator defined such that, for an input signal y(n), q−k⁢y⁢(n)=y⁢(n−k); ρ<1 is a fixed hyperparameter denoting the radius of the complex conjugate pole pair and a⁢(n)=2⁢cos⁡[2⁢π⁢f⁢(n)/fs] is the single filter parameter, f⁢(n) being the notch frequency and fssubscript𝑓𝑠 the sampling frequency. The given N-samples long input signal y⁢(n) is filtered by the direct-form II of H⁢(q−1,n−1) using a joint delay line for the feedforward and the feedback paths of the IIR filter. The estimation procedure consists in the recursive update of the covariance of the prediction error p^⁢(n), the Kalman gain k⁢(n), and the parameter estimate a^⁢(n). These steps involve scalar operations and require a memory of 2 past samples.

ANFNet Architecture and Training

At each time step, the estimated parameter a^⁢(n) contains information on the frequency tracked by the ANF, that is retrieved as f^⁢(n)=(fs/2⁢π)⁢arccos⁡[a^⁢(n)/2] and will be used as a first feature for the siren identification network. It is important to notice that the tracked frequency is not necessarily the fundamental frequency, but the one with the highest energy. We then expand the above formulation to introduce a second feature that we call power ratio, expressing the ratio between the power of the suppressed sinusoidal component and of the input signal. where λ=e−1/(τ⁢fs), with τ constituting a first additional hyperparameter representing the time constant for recursive averaging.

In Fig. 1 the f^^ and Pratiosubscript𝑃ratio features are shown for a noise sample and two different siren samples (wail and yelp) extracted from the sireNNet dataset [24], that will be used for the experimental evaluation. In the top row, the tracked f^^ is overlaid to the full spectrogram: nevertheless, we remark that the frequency estimate is obtained directly from the time-domain signal without computing the spectrogram. We solve the siren identification problem using the ANFNet network, that processes the f^^ and Pratiosubscript𝑃ratio features extracted from a single channel, 2 stimes2second-long audio sample: the two features are stacked into a 2-channel vector provided as input to the first layer. The architecture (see Tab. 1) contains three 1D convolutional layers (Conv1D) with, respectively, 10, 20 and 40 filters having kernel size 16, 8 and 4. Each of the first two Conv1D layers is followed by a max pooling operation (MaxPool) for dimensionality reduction. After the third one, a global average pooling operation (GlobAvgPool) is used as interface between the convolutional part and the classification head, composed of two fully connected (FC) layers with 40 and 20 neurons, respectively, and a single neuron output layer. We use the ReLU activation function in each hidden layer and the sigmoid activation in the output layer, and introduce dropout layers with 0.25 drop probability after each FC layer to prevent overfitting.

Datasets and Implementation Details

For training, we use the sireNNet dataset [24], containing a total of 421 noise and 1254 siren samples including different types of sirens. All samples have a duration of 3 stimes3second, and since half of the siren files are artificially generated for data augmentation purposes, we exclude them and use only the 627 non-augmented siren samples. We divide this dataset into training, validation and test data with ratios [0.8,0.1,0.1]. In order to perform a data-efficient evaluation, we split the training set into subsets of different size similarly to [25]: in particular, we create subsets containing an increasing percentage of the full training set, with ratios 0.25%,0.5%,1%,2%,4%,8%,16%,32%,64% and 100% (i.e., the entire training set). 10 folds are randomly generated for each subset, in order to compute the mean and standard deviation of the results. The validation and test sets are always used without additional splitting. To further evaluate the generalization performance in a cross-dataset setting, we also use a subset of 210 audio files randomly extracted from the dataset [26] (that we will call LSSiren) for testing; this dataset contains siren and noise files with lengths between 3 stimes3second and 15 stimes15second. All files of both datasets have been re-sampled to 16 kHz and converted to mono; moreover, since we use 2 stimes2second samples as input, we take only the first two seconds of each file of the sireNNet dataset, and divide the LSSiren files in non-overlapping 2 stimes2second segments.

We implement the proposed ANFNet and the baseline VGGSiren using Pytorch Lightning [27]: for the KalmANF algorithm, we set the hyperparameters ρ=0.99,σw=10−5,σe=0.66,qdown=5,τ=0.02, all chosen by manual tuning based on the best loss obtained on the validation set. For VGGSiren, to compute the mel-spectrogram we use a 1024 samples Hann window with 512 samples overlap, and 128 mel channels. As a result, VGGSiren has a total of 53.9 k floating point 32-bits parameters, thus being 7 times larger than the proposed ANFNet. For VGGSiren, we apply peak normalization to the mel-spectrograms, whereas for ANFNet we normalize the f^^ feature to fs/2 (the Pratiosubscript𝑃ratio feature is normalized by definition). In all experiments we train both models for 400 epochs using the binary cross-entropy loss function, the Adam optimizer [28] with learning rate between 0.001 and 0.005, a batch size between 2 and 32, both depending on the size of the training split, and select the best model based on the validation loss. We train both models on the 10 folds of each sireNNet subset. Note that the 0.25%,0.5% and 1% splits contain, respectively, only 2, 4 and 9 samples, making the problem extremely challenging and comparable to that of few-shot learning (without pre-training).

Evaluation and Results

First, we evaluate in-domain performance on the sireNNet test set and report in Fig. 2 the average and standard deviation (shaded area) of the F1-score. In Tab. 2 we report the average F1-score and AUPRC obtained with the two models for each training split. We then evaluate the models on the LSSiren data (cross-dataset setting) and report the results in the bottom plot of Fig. 2 and in Tab. 3. Again, the performance of both decreases as the training dataset size decreases. In this case, the proposed ANFNet significantly outperforms the baseline on all subsets.

These results indicate that the proposed features help the network capture the difference between siren and noise classes also when limited data is available, suggesting their potential for data-efficient learning. Moreover, the evaluation underlines that the proposed features ensure an enhanced robustness to domain shift compared to the mel-spectrogram. In Fig. 2 it is also visible that the standard deviation is reduced compared to VGGSiren, showing that ANFNet is less sensitive to the choice of training samples. Finally, ANFNet has a lower complexity, with a 7 times smaller network size (7.7 k parameters vs. the 53.9 k of VGGSiren).

Implicit Neural Representations (INRs) and SIRENs

Implicit neural representations (INRs) have emerged as a powerful technique for representing various signals, entities, or systems. INRs map a set of coordinates to signals, providing a continuous and differentiable representation. This continuity allows INRs to capture fine details as a function of input coordinates, making them suitable for tasks where data resolution is a limitation. Among existing INR-based works, multi-layer perceptrons with sinusoidal activation functions (SIRENs) see widespread applications. The sinusoidal activation function itself can serve as the activation for other modern neural networks for different applications in the same way as the more traditional activation functions in the ReLU or sigmoid families.

Addressing Limitations of Sinusoidal Activation Functions

Recent investigations have suggested that the use of sinusoidal activation functions could be sub-optimal due to their limited supported frequency set as well as their tendency to generate over-smoothed solutions. A simple solution to mitigate such an issue is to change the activation function at the first layer from sin⁡(x) to sin⁡(sinh⁡(2⁢x)). Various recent researchers have either pointed out or directly demonstrated that whilst an INR with sinusoidal activation function is able to learn smooth representations, the fitted representation could be too smooth in some scenarios. In particular, the fine details are smoothed out in the learned representations.

A recent effort attempts to address this issue of sinusoidal activation functions by introducing a second-order term such that the outputs of the linear combination after each layer is activated by sin⁡(x⁢(1+|x|)) rather than sin⁡(x). Since the frequency of the activation function varies for different initialized biases, they also propose to initialize the bias within a larger range instead of the common practice. It is possible to show, however, that the inclusion of the second order term will cause the network to gradually bias towards higher frequencies over the network layers, which is not preferable since such bias can lead to overfitting. Such overfitting behavior can be demonstrated via a simple experiment.

H-SIREN: A Modified Activation Function

The design of SIREN largely preserves the supported frequency set over the layers. Therefore, it is possible to adjust the supported frequency set of SIREN by only modifying the activation function at the first layer. For that first layer, we extend from the idea of Liu et al. to infinite orders instead of truncating at second order, eventually leading to the activation function of sin⁡(sinh⁡(2⁢x)) after some stabilization and simplification efforts. We will demonstrate the effectiveness of the H-SIREN activation function in a series of tasks that include implicit representation of signals. In particular, we report results on image fitting, video fitting, video super-resolution, fitting signed distance functions, neural radiance field, and graph neural network-based fluid flow simulation.

Overfitting and Frequency Control

With the original SIREN activation function, the frequency spectrum at initialization is largely consistent across layers, and the maximum frequency only grows slowly over the layers. When we extend to orders higher than one, however, it is noticed that the spectrum gradually evolves towards higher frequency with the increase of layer number. Such an increase in bias towards higher frequencies will eventually cause the network to overfit when used to learn low frequency representations. To demonstrate this, we construct a mini-experiment, where SIREN and FINER with 5 hidden layers of layer width 64 are used to fit the function y=f⁢(x)=sin⁡(30⁢x)/10⁢x. A total of 200 points are sampled with equal distance within x∈[−1,1], and the fitted function after 1000 iterations of gradient descent is evaluated at 2000 equidistant points within x∈[−1,1]. It is clear that the bias towards high frequencies cause severe overfitting. In order to control the frequency increase, we only apply Eq. 6 to the first layer of the network, whilst keeping the sinusoidal activation function for other layers. The resulting network preserves the frequency distribution over the layers, and supports a large range of frequencies. It is also biased towards low frequencies, which makes it more robust to overfitting, as demonstrated in Fig.

Read also: An Overview of Deep Learning Math

Applications of SIRENs

SIRENs find applications in various domains, including:

Image Fitting: SIRENs can accurately represent images, capturing fine details and gradients.
Video Representation: SIRENs are capable of representing videos, offering a continuous mapping from pixel locations to pixel values.
Video Super-Resolution: SIRENs can be used to enhance the resolution of videos, generating high-quality frames from low-resolution inputs.
Neural Radiance Fields (NeRF): SIRENs play a crucial role in NeRF, enabling novel view synthesis by learning a mapping from location and view angle to color and opacity.
Representing Shapes with Signed Distance Functions: SIRENs can be trained to represent shapes using signed distance functions (SDFs), providing a metric for measuring the distance of a point from the boundary of the shape.

Advantages of SIRENs

Accurate Representation: SIRENs allow for an accurate representation of images and other signals.
Continuous Mapping: SIRENs provide a continuous mapping from pixel locations to pixel values, enabling the learning of gradients and Laplacians.
Flexibility: SIRENs can be used for various applications, including image fitting, video representation, and NeRF.
Constraint Satisfaction: SIRENs can satisfy complex constraints, making them suitable for challenging datasets.
The derivative of a SIREN is also a SIREN - cosine is just a shifted sine. No other commonly used non-linearities such as tanh or ReLU has this property. This becomes useful when we want to match not just the image, but its derivatives too.

tags: #deep #learning #siren #applications