Applications of Deep Learning Models in VEGAS and Social Intelligence

Deep learning models are increasingly being used in various fields, including video analysis, social intelligence, and medical diagnostics. This article explores the applications of these models in VEGAS, particularly in the context of social intelligence, and also discusses their use in diagnosing Parkinson’s disease (PD).

The Challenge of Social Intelligence

Artificial Social Intelligence (ASI) aims to improve machine comprehension and interaction in complex social situations. The Social-IQ benchmark is a prime example of this challenge, using a multiple-choice question (MCQ) task to assess a model's ability to reason about social dynamics by integrating video, audio, and subtitles. Initially, Social-IQ posed difficulties for advanced VideoQA methods. Relying solely on the question and answer (BlindQA) resulted in near-random accuracy. Recent multimodal approaches that incorporate Large Language Models (LLMs) have more than doubled this accuracy. However, these improvements are largely due to the LLMs' shortcut effect, where they exploit inherent spurious correlations between questions and options, often disregarding visual context in the answering process. This raises concerns about whether these models truly understand multimodal social traits and whether the selected answers reflect genuine reasoning.

VEGAS: Visually Grounded Answering

To address the limitations of current models, the VEGAS framework has been developed to enhance visual engagement and reduce reliance on language shortcuts. VEGAS integrates video, image, and audio encoders to process inputs from various modalities, along with a word embedding layer for text encoding. The framework consists of two main components: Language Guided Sampling (LGS) and GIFT (Grounded Inference Fine-Tuning).

Language Guided Sampling (LGS)

The LGS module equips the model with the ability to sample question-relevant video frames in social interactions, guided by language cues in the form of explicit descriptions, causal questioning, and nuanced differentiation. To enable effective LGS supervision in the absence of timestamp annotations, a novel sampling strategy is used to provide the model with more relevant visual frames. The LGS module first encodes all candidate frames using a video encoder. Then, it selects frames that align with the language hint (either a question or its fusion with an answer). The Temporal Alignment Module (TAM) then restores the temporal relationships among the sampled frames.

The LGS module addresses the issue of disordered temporal relationships among sampled frame features, which can lead to misinterpretations of social activities. The TAM restores the order of the sampler output by constructing new relationships using a CLIP attention module.

To train the LGS modules effectively, targeted data is crafted due to the lack of timestamp annotations. This includes:

Basic localization ability based on explicit description: A composite video is created by randomly selecting, shuffling, and merging video clips from the Video-ChatGPT dataset. The training sample is then defined as (V, Q1, A1), where the QA pair from clip1 is used for video captioning training.
Ability to capture frames that have causal relations with language hints: QA samples from the temporal subset of Next-QA are used.
Ability to distinguish key frames based on nuanced details in less varying social interactions: The TVQA dataset is used, and a self-refinement strategy is employed to enforce the model to recall and locate these dynamics according to its own memory.

GIFT: Grounded Inference Fine-Tuning

The goal of GIFT is to learn an effective understanding of sampled visual features, which requires advanced abilities from the subsequent reasoning modules. To achieve that, the Social Traits Projector (STP) is integrated to learn transformation for fundamental emotional traits (video, image, and audio) into the language space. Following this, joint fine-tuning of STP and LLM is performed using an expansive multimodal social interaction dataset.

The STP module is initialized using the linear visual projector from Video-LLaVA, which ensures optimal alignment between visual and language modalities through captioning tasks. The STP module is fine-tuned using emotion recognition as the proxy task. Mel-spectrogram features are extracted from audio tracks and encoded with a Vision Transformer (ViT). The STP and the LLM are fine-tuned together on an extensive multimodal social interaction dataset, which is helpful for joint reasoning and human-aligned answering.

Experimental Results and Analysis

The VEGAS framework has been evaluated on the Social-IQ-2.0 benchmark. The results show that VEGAS significantly suppresses language shortcuts, improving visual context utilization. The evident accuracy improvement of VEGAS highlights the consistency between its reasoning and the correct selection.

Open-ended QA examples demonstrate the effectiveness of VEGAS in understanding and reasoning about social interactions. VEGAS-generalist, which is trained on a more diverse dataset, also shows promising results.

Read also: Comprehensive Look at UNLV

Deep Learning for Parkinson's Disease Diagnosis

Deep learning models are also being applied in the medical field, particularly for the diagnosis of Parkinson's disease (PD). PD is a neurodegenerative disorder that affects millions of people worldwide. Early diagnosis is crucial for effective interventions, but it is often delayed due to a shortage of neurologists. Computer-aided diagnostic (CAD) tools based on deep learning can help automate the diagnosis of PD.

A systematic review of studies published between January 2011 and July 2021 identified 63 studies that proposed deep learning models for the automated diagnosis of PD. These models use various types of modalities, including brain analysis (SPECT, PET, MRI, and EEG) and motion symptoms (gait, handwriting, speech, and EMG).

Brain Analysis

Deep learning models have been used to analyze brain images from SPECT, PET, MRI, and EEG to detect PD. These models can identify patterns and biomarkers that are indicative of the disease.

Motor Symptoms

Deep learning models have also been used to analyze motor symptoms such as gait, handwriting, speech, and EMG to diagnose PD. These models can detect subtle changes in movement and speech that may be indicative of the disease.

Advantages of Deep Learning

Deep learning models offer several advantages over traditional machine learning models for PD diagnosis. They can learn high-dimensional data without the need for manual feature extraction and selection. This eliminates the need for expert knowledge in feature engineering, making the models more accessible to healthcare professionals.

Limitations and Future Directions

Despite the promising results, there are still limitations to the use of deep learning models for PD diagnosis. These include the need for large datasets, the lack of interpretability of the models, and the potential for bias. Future research should focus on addressing these limitations and developing more robust and reliable deep learning models for PD diagnosis.

VEGAS Deep Learning Models Software

VEGAS Deep Learning Models is a Windows program developed by MAGIX Software GmbH. It is often used for video editing and analysis tasks. The software can be uninstalled from a computer using the standard Windows uninstall process or with the help of specialized uninstallers like Advanced Uninstaller PRO.

tags: #Vegas #deep #learning #models #applications