Ali Alman & Carnegie Mellon University: Advancing Neural Information Processing

Carnegie Mellon University (CMU) consistently stands at the forefront of artificial intelligence and machine learning research. Its contributions to the field are exemplified by the numerous papers presented at leading conferences such as the Neural Information Processing Systems (NeurIPS). The 38th NeurIPS conference in 2024 saw CMU present 194 papers, showcasing the breadth and depth of their research endeavors. This article delves into some of the innovative research areas explored in these papers, highlighting key advancements and contributions to various subfields within machine learning.

Stylus: Composing Adapters for Customized Image Generation

Generating high-fidelity, customized images can be computationally expensive, often requiring scaling base models with more data or parameters. An alternative approach involves using fine-tuned adapters-small modules that fine-tune base models for specific tasks. The open-source community has amassed a vast collection of these adapters, exceeding 100,000. However, many lack clear descriptions and are highly customized, which hinders their effective use.

To tackle this challenge, CMU researchers introduced Stylus, a system designed to match prompts with relevant adapters and automatically compose them for better image generation. Stylus builds upon the idea of combining multiple adapters and uses a three-stage process:

Summarizing adapters: Improving descriptions and embeddings of adapters.
Retrieving relevant adapters: Matching prompts with suitable adapters.
Composing adapters: Combining adapters based on prompt keywords to ensure a strong match.

The authors also presented StylusDocs, a curated dataset of 75,000 adapters with pre-computed embeddings, for evaluation purposes. Stylus represents a significant step toward making the vast repository of adapters more accessible and usable for customized image generation.

Federated Q-Learning: Balancing Sample and Communication Complexity

Federated Q-learning involves multiple agents collaboratively learning the optimal Q-function for an unknown Markov Decision Process. This paper addresses the challenges of federated Q-learning, where agents collectively learn an optimal strategy in a complex environment. The study focuses on the fundamental limitations and proposes a novel algorithm to optimize both sample complexity and communication costs.

Read also: Wrestling Executive and Politics

The authors first established a fundamental limitation: any Federated Q-learning algorithm that achieves linear speedup in sample complexity relative to the number of agents must incur a communication cost of at least Ω(1/1−γ), where γ is the discount factor. They then introduced a new algorithm, Fed-DVR-Q, which is the first to achieve both optimal sample complexity and communication complexity simultaneously. This breakthrough paves the way for more efficient and scalable federated reinforcement learning.

Aligner-Encoder: Simplifying Automatic Speech Recognition

Traditional automatic speech recognition (ASR) models often involve complex alignment processes between audio input and text output. This paper introduces a new transformer-based approach that simplifies this alignment process. The proposed “Aligner-Encoder” model combines efficient training techniques and a lightweight decoder, resulting in significantly faster performance while maintaining competitive accuracy.

Unlike traditional models, the encoder itself aligns audio information internally, reducing the complexity of decoding. This innovative approach streamlines the ASR process, making it more efficient and practical for real-world applications.

Streaming Algorithms for Top Eigenvector Approximation

This work focuses on streaming algorithms for approximating the top eigenvector of a matrix when its rows are presented in a random order. The authors introduce a new algorithm that works efficiently when there is a sufficient gap between the largest and second-largest eigenvalues of the matrix. Their approach uses a small amount of memory, depending on the number of “heavy rows” (rows with large norms), and produces highly accurate results. This has implications for various applications, including data analysis and dimensionality reduction.

C-JEPA: Enhancing Unsupervised Visual Representation Learning

Recent advancements in unsupervised visual representation learning have highlighted the Joint-Embedding Predictive Architecture (JEPA) as an effective method for extracting visual features from unlabeled images using masking strategies. However, JEPA faces two key challenges: its reliance on Exponential Moving Average (EMA) fails to prevent model collapse, and its predictions struggle to accurately capture the average representation of image patches.

To address these issues, this work introduces C-JEPA, a new framework that combines JEPA with a variance-invariance-covariance regularization strategy called VICReg. This approach improves stability, prevents collapse, and ensures better learning of consistent representations. C-JEPA offers a more robust and reliable approach to unsupervised visual representation learning.

CooHOI: Cooperative Human-Object Interaction for Humanoid Robots

Enabling humanoid robots to collaborate on tasks like moving large furniture requires coordination between multiple robots. Existing methods struggle due to a lack of motion capture data for multi-humanoid collaboration and the inefficiency of training multiple agents together.

To overcome this, the authors introduce Cooperative Human-Object Interaction (CooHOI), a framework that uses a two-phase learning approach: first, individual humanoids learn object interaction skills from human motion data, and then they learn to work together using multi-agent reinforcement learning. By focusing on shared object dynamics and decentralized execution, the robots achieve coordination through implicit communication. CooHOI represents a significant step toward enabling more effective and natural collaboration between humanoid robots.

DiffTORI: Differentiable Trajectory Optimization for Reinforcement Learning

This paper presents DiffTORI, a framework that uses differentiable trajectory optimization as a policy representation for reinforcement and imitation learning. Trajectory optimization, a common tool in control, is parameterized by a cost and a dynamics function, and recent advances now allow gradients of the loss to be computed with respect to these parameters.

This enables DiffTORI to learn cost and dynamics functions end-to-end, addressing the “objective mismatch” in previous model-based RL methods by aligning the dynamics model with task performance. DiffTORI offers a more principled and effective approach to reinforcement learning by leveraging the power of differentiable trajectory optimization.

Run-Length Tokenization: Efficient Video Transformer Training

Video transformers are notoriously slow to train due to the large number of input tokens, many of which are repeated across frames. Existing methods to remove redundant tokens often introduce significant overhead or require dataset-specific tuning, limiting their practicality.

This work introduces Run-Length Tokenization (RLT), a simple and efficient method inspired by run-length encoding, which identifies and removes repeated patches in video frames before inference. By replacing repeated patches with a single token and a positional encoding to reflect its duration, RLT reduces redundancy without requiring tuning or adding significant computational cost. RLT significantly accelerates video transformer training without sacrificing accuracy.

In-Context Abstraction Learning: Improving Task Examples for LLMs and VLMs

This work introduces In-Context Abstraction Learning (ICAL), a method that enables large-scale language and vision-language models (LLMs and VLMs) to generate high-quality task examples from imperfect demonstrations. ICAL uses a vision-language model to analyze and improve inefficient task trajectories by abstracting key elements like causal relationships, object states, and temporal goals, with iterative refinement through human feedback.

These improved examples, when used as prompts, enhance decision-making and reduce reliance on human input over time, making the system more efficient. ICAL represents a promising approach to improving the performance and efficiency of LLMs and VLMs.

Place3D: Optimizing LiDAR Placement for Reliable Driving Perception

This work focuses on improving the reliability of driving perception systems under challenging and unexpected conditions, particularly with multi-LiDAR setups. Most existing datasets rely on single-LiDAR systems and are collected in ideal conditions, making them insufficient for real-world applications.

To address this, the authors introduce Place3D, a comprehensive pipeline that optimizes LiDAR placement, generates data, and evaluates performance. Their approach includes three key contributions: a new metric called the Surrogate Metric of the Semantic Occupancy Grids (M-SOG) for assessing multi-LiDAR configurations, an optimization strategy to improve LiDAR placements based on M-SOG, and the creation of a 280,000-frame dataset capturing both clean and adverse conditions. Place3D provides valuable tools and resources for developing more robust and reliable driving perception systems.

Learn-To-be-Efficient: Training Efficient Large Language Models

Large Language Models (LLMs), while powerful, are computationally expensive. This paper explores how LLMs can be made more efficient by maximizing activation sparsity-where only some model parameters are used during inference. The authors propose a novel training algorithm, Learn-To-be-Efficient (LTE), that encourages LLMs to activate fewer neurons, striking a balance between efficiency and performance. LTE offers a promising approach to reducing the computational cost of LLMs without sacrificing their capabilities.

Learning Social Welfare Functions from Policymaker Decisions

This work explores whether it is possible to understand or replicate a policymaker’s reasoning by analyzing their past decisions. The problem is framed as learning social welfare functions from the family of power mean functions. Two learning tasks are considered: one uses utility vectors of actions and their corresponding social welfare values, while the other uses pairwise comparisons of welfares for different utility vectors. The authors demonstrate that power mean functions can be learned efficiently, even when the social welfare data is noisy. This research has implications for understanding and potentially automating policy decision-making.

Group Representation Theory for Machine Learning

Authors: Timothy Chu, Josh Alman, Gary L. The authors introduce a linear-algebraic tool based on group representation theory to solve three important problems in machine learning. First, they investigate fast attention algorithms for large language models and prove that only low-degree polynomials can produce the low-rank matrices required for subquadratic attention, thereby showing that polynomial-based approximations are essential. Second, they extend the classification of positive definite kernels from Euclidean distances to Manhattan distances, offering a broader foundation for kernel methods. This work demonstrates the power of mathematical tools in advancing machine learning research.

Differentially Private Learning of Gaussian Mixtures

This work examines the problem of learning mixtures of Gaussians while ensuring approximate differential privacy. The authors demonstrate that it is possible to learn a mixture of k arbitrary d-dimensional Gaussians with significantly fewer samples than previous methods, achieving optimal performance when the dimensionality d is much larger than the number of components k. For univariate Gaussians, they establish the first optimal bound, showing that the sample complexity scales linearly with k, improving upon earlier methods that required a quadratic dependence on k. This research contributes to the development of privacy-preserving machine learning techniques.

Sequoia: Scalable Speculative Decoding for Large Language Models

As the use of large language models (LLMs) increases, serving them quickly and efficiently has become a critical challenge. Speculative decoding offers a promising solution, but existing methods struggle to scale with larger workloads or adapt to different settings. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. By employing a dynamic programming algorithm, Sequoia optimizes the tree structure for speculated tokens, improving scalability. It also introduces a novel sampling and verification method that enhances robustness across various decoding temperatures. Sequoia represents a significant advancement in efficient LLM serving.

The Impact of Data Corruption on Diffusion Model Training

Diffusion models have demonstrated impressive capabilities in generating high-quality images, audio, and videos, largely due to pre-training on large datasets that pair data with conditions, such as image-text or image-class pairs. However, even with careful filtering, these datasets often include corrupted pairs where the conditions do not accurately represent the data. This paper provides the first comprehensive study of how such corruption affects diffusion model training. By synthetically corrupting datasets like ImageNet-1K and CC3M, the authors show that slight corruption in pre-training data can surprisingly enhance image quality, diversity, and fidelity across various models. They also provide theoretical insights, demonstrating that slight condition corruption increases entropy and reduces the 2-Wasserstein distance to the ground truth distribution. This research sheds light on the surprising robustness of diffusion models to data corruption.

Generalization Bounds for Large Language Models via Martingales

Large language models (LLMs) with billions of parameters are highly effective at predicting the next token in a sequence. While recent research has computed generalization bounds for these models using compression-based techniques, these bounds often fail to apply to billion-parameter models or rely on restrictive methods that produce low-quality text. Existing approaches also tie the tightness of bounds to the number of independent documents in the training set, ignoring the larger number of dependent tokens, which could offer better bounds. This work uses properties of martingales to derive generalization bounds that leverage the vast number of tokens in LLM training sets. This research provides a more refined understanding of the generalization capabilities of large language models.

tags: #Ali #Alman #Carnegie #Mellon #University