Deep Reinforcement Learning for Recommender Systems: A Comprehensive Tutorial

Recommender systems have become increasingly vital in today's information-saturated world, playing a crucial role in mitigating the problem of information overload, particularly in user-oriented online services. These systems aim to identify a set of objects (i.e., items) that best match users’ explicit or implicit preferences by utilizing user and item interactions to improve matching accuracy. With the rapid advancement of deep neural networks (DNNs) in recent decades, recommendation techniques have achieved promising performance. However, most existing DNN-based methods suffer from certain drawbacks in practice, which advanced techniques like Deep Reinforcement Learning (DRL), Automated Machine Learning (AutoML), and Graph Neural Networks (GNNs) aim to address.

The Growing Importance of Recommender Systems

In today's digital landscape, users are overwhelmed with an abundance of online content. Recommender systems help users navigate this vast sea of information by understanding their preferences, past decisions, and characteristics through the analysis of interaction data such as impressions, clicks, likes, and purchases.

Consider these scenarios:

Netflix: A user wanting to watch all the series on Netflix would need approximately 10.27 years of continuous viewing, assuming no breaks and watching 24 hours a day.
General: Users require suggestions to meet their needs efficiently, without wasting time.

Recommender systems provide personalized suggestions, recommending favorite songs, desired products, or web series to watch. These platforms use machine learning models to generate relevant recommendations for each user.

Feedback Data in Recommender Systems

Recommender systems rely on feedback data, such as ratings, clicks, likes, or dislikes, to learn user preferences. This feedback can be categorized into two types:

Explicit Feedback: Direct user ratings indicating how much they like something, such as star ratings for products or thumbs up/down for videos.
Implicit Feedback: User actions like purchases, browsing history, or listening habits. This feedback is abundant and easy to collect but can be less precise and potentially noisy.

This feedback is used to create a user-item rating matrix (rui), where entries can be numerical values (for explicit feedback) or boolean values (for implicit feedback), indicating the interaction between a user and an item.

Types of Recommendation Systems

Several types of recommendation systems exist, each with its approach and techniques:

Content-Based Recommender Systems

Content-based systems focus on the characteristics of items and users’ preferences as expressed through their interactions with those items. For example, a movie recommendation system might tag movies with genres like "action" or "comedy," and users are profiled based on their personal details or previous interactions with movies. The system then recommends items by matching the attributes of items with users’ preferences.

Item Profiles: Each item is described by a profile that captures its essential characteristics, such as cast, director, release year, and genres for movies.
User Profiles: User profiles are built from data on how users have interacted with items, summarizing a user’s likes and dislikes based on past behavior.
Similarity Measurement: Cosine similarity calculates the cosine of the angle between the item profile vector and the user profile vector to find similar items.

These systems use similarity search techniques to list nearby feature vectors of a given input, providing relevant recommendations. Term Frequency-Inverse Document Frequency (TF-IDF) is used to find the importance of a word in a document, helping understand user and item characteristics.

Collaborative Filtering (CF)

Collaborative filtering predicts a user’s preferences by using the collective knowledge and behaviors of a large pool of users. CF systems can be classified based on various approaches:

Memory-Based CF: Uses the entire user-item interaction matrix to make direct recommendations based on similarities between users or items. It is straightforward but can struggle with large, sparse matrices and generally deals with implicit feedback.
Model-Based CF: Uses machine learning models to predict interactions between users and items. Techniques like matrix factorization, clustering, SVD, and deep learning are used to learn latent features from the data, improving prediction accuracy.

There are two main types of collaborative filtering:

User-Based CF: Recommends items to a user based on the preferences of similar users in the database.
Item-Based CF: Finds similar items to recommend based on the user’s past preferences.

K-Nearest Neighbors (KNN)

To make a recommendation, KNN identifies the “neighbors” of a user or item, which are most similar based on certain features or past behavior. For example, KNN might compare users based on their movie ratings. The process involves computing the similarity between users or items and selecting the top ‘k’ most similar neighbors.

Session-Based and Sequence-Based Recommender Systems

Session-based and sequence-based recommender systems predict the next user action by analyzing past interactions. Session-based systems focus on the current session’s actions, while sequence-based systems use the order of all past interactions. These systems face challenges like varying session lengths, action types, and user anonymity. Conventional methods include K-nearest neighbors and Markov chains, while advanced approaches use deep learning models like RNNs and attention mechanisms.

Advancements in Recommendation Systems

Over the years, recommender systems have seen numerous advancements, with matrix factorization being a key technique.

Matrix Factorization

Matrix factorization decomposes the user-item interaction matrix into two lower-dimensional matrices representing users and items. These matrices capture latent features that describe the preferences of users and the characteristics of items. These latent features are then used to predict user recommendations.

Read also: An Overview of Deep Learning Math

SVD (Singular Value Decomposition): Used for implicit feedback, measuring user confidence in their preferences. The confidence matrix (cui) is defined as cui = 1 + 𝛼tui, where tui represents how much of a movie (i) user (u) has watched, and 𝛼 is a constant.
Probabilistic Matrix Factorization: For explicit feedback, a linear model represents user-item interactions, and the algorithm learns latent vectors for users and items by minimizing a regularized mean squared error (MSE) loss over known ratings. Stochastic gradient descent (SGD) and alternating least squares (ALS) are common optimization methods.

The SVD++ algorithm can handle both explicit and implicit feedback simultaneously, which is useful because users often interact with many items but rate only a few.

Logistic Regression

Logistic regression is a commonly used linear model in recommendation systems, especially for predicting click-through rates (CTR). This model predicts the probability of an event occurring, with values between 0 and 1, and can use side information such as user demographics, past behaviors, item attributes, and contextual information, helping to address the cold start problem.

Traditional logistic regression assigns a weight to each feature but does not consider how features interact with each other. Additional techniques like the degree-2 polynomial (Poly2) model and Factorization Machines (FMs) are used to capture these interactions. Field-aware Factorization Machines (FFMs) extend the basic FM model by grouping features into different fields, allowing the model to capture more nuanced interactions between features from different fields.

Deep Learning Models in Recommendation Systems

Deep learning (DL) models have significantly advanced recommender systems by leveraging vast amounts of data and complex architectures. Unlike traditional machine learning methods, DL models improve as more data is introduced, increasing accuracy and flexibility, making them ideal for personalized recommendations.

Training and Inference Phases

The training phase involves teaching the model to predict user-item interaction probabilities by using historical data on user interactions. Techniques like backpropagation and gradient descent are used to optimize the model. In the inference phase, the trained model predicts new user-item interactions through candidate generation, candidate ranking, and filtering.

Embeddings

Embeddings are a core component of DL recommender systems, transforming categorical data into dense vector representations. The model captures similarities between entities, such as users and items, in a high-dimensional space. Users with similar preferences will have similar embedding vectors.

Network Architectures and Models

DL recommender systems utilize various network architectures, including feedforward neural networks, multilayer perceptrons (MLPs), convolutional neural networks (CNNs), and recurrent neural networks (RNNs). Popular models include Neural Collaborative Filtering, Variational Autoencoders, Google’s Wide and Deep model, and Meta’s Deep Learning Recommendation Model.

Neural Collaborative Filtering (NCF): A neural network that provides collaborative filtering based on user and item interactions, treating matrix factorization from a non-linearity perspective.
Variational Autoencoders for Collaborative Filtering (VAE-CF): A neural network that provides collaborative filtering based on user and item interactions, consisting of an encoder and a decoder.
Wide & Deep: A class of networks that use the output of two parts working in parallel-wide model and deep model-whose outputs are summed to create an interaction probability.
Deep Learning Recommendation Model (DLRM): Computes the feature interaction explicitly while limiting the order of interaction to pairwise interactions.

Reinforcement Learning for Recommender Systems

Most recommendation systems learn user preferences and item popularity from historical data and retrain models at periodic intervals. However, they are designed to maximize the immediate reward of making users click or purchase and don’t consider long-term rewards such as user activeness. Furthermore, they tend to adopt a greedy approach and overemphasize item popularity, neglecting to explore new items (i.e., cold-start problem). Reinforcement Learning (RL) can learn to optimize for long-term rewards, balance exploration and exploitation, and continuously learn online.

Contextual Bandits

Multi-armed bandits are a form of classical reinforcement learning that balances exploration and exploitation. Contextual bandits take it a step further by collecting and observing the context before each action and choosing actions based on the context. In recommendations and search, the context would be data about the customer and environment.

Value-Based Methods

Value-based methods learn the optimal value function, which maps either the state to a value or the state-action to a value. Using the value function, the agent acts by choosing the action that has the highest value in each state. Deep Q-Networks (DQN) are used, incorporating negative feedback by including skipped items as negative signals about user preferences.

Policy-Based Methods

Policy-based methods learn the policy function that maps the state to action directly, without having to learn Q-values. As a result, they perform better in continuous and stochastic environments and tend to be more stable, given a sufficiently small learning rate. REINFORCE is used for YouTube recommendations, where the input is the sequence of user historical interactions, and the output predicts the next action to take.

Actor-Critic Methods

Actor-critic combines the best of value-based and policy-based methods by splitting the model into two: one for computing the action based on state and another to produce the Q-value of the state-action. The actor takes state as input and outputs the best action, while the critic evaluates the action by computing the value function and providing feedback to the actor.

NVIDIA Merlin: A Framework for Deep Recommender Systems

To meet the computational demands for large-scale deep learning recommender systems, NVIDIA introduced Merlin - a Framework for Deep Recommender Systems. NVIDIA teams have achieved success in RecSys competitions using this framework. Merlin accelerates the life cycle of deep learning for recommendation, which can be split into training and inference. It exploits data parallelism through columnar data processing, providing higher performance and cost savings.

NVTabular: A feature engineering and preprocessing library for recommender systems.
HugeCTR: A GPU-accelerated deep neural network training framework designed to distribute training across multiple GPUs and nodes.

Scalable Deep Learning Recommender Systems on Databricks

Scaling recommender systems can be challenging, especially with millions of users or thousands of products. Databricks offers essential components for data processing, feature engineering, model training, monitoring, governance, and serving, which can be combined to create a state-of-the-art recommender system.

A common approach to address scalability issues involves a two-stage process: an efficient "broad search" followed by a more computationally intensive "narrow search" on the most relevant items. Databricks uses Mosaic Streaming as the data loader and TorchDistributor as the orchestrator for distributed training.

Two Tower Model

The Two Tower model is an efficient architecture for large-scale recommender systems, comprising two parallel neural networks: the "query tower" for users and the "candidate tower" for products. Each tower processes its input to generate dense embeddings, and the model predicts user-item interactions by computing the similarity between these embeddings.

Deep Learning Recommendation Model (DLRM)

The Deep Learning Recommendation Model (DLRM) by Meta is a sophisticated architecture designed for large-scale recommendation systems. It efficiently handles both categorical (sparse) and numerical (dense) features, using lookup tables to embed categorical features and processing these embeddings along with numerical features through a feature interaction layer.

Training Recommendation Models

Training recommendation models involves data preprocessing and loading with Mosaic Streaming, and parallelizing model training with TorchRec and the TorchDistributor. Mosaic Streaming optimizes the training process on large datasets stored in cloud environments, addressing challenges in distributed data loading. TorchRec, built on PyTorch, provides sparsity and parallelism primitives for large-scale recommender systems, while TorchDistributor facilitates distributed training with PyTorch on Databricks.

Logging with MLflow

MLflow is used to log key items, like model hyperparameters, metrics, and the model’s state_dict.

tags: #deep #reinforcement #learning #recommender #systems #tutorial