Unsupervised Embedding Learning with BERT: A Comprehensive Tutorial

Introduction

In recent years, the field of Natural Language Processing (NLP) has witnessed rapid advancements, largely driven by sophisticated machine learning models like BERT (Bidirectional Encoder Representations from Transformers). BERT's ability to grasp the nuances of language has made a significant impact, even influencing the core algorithms behind Google Search. This article aims to provide a comprehensive understanding of BERT, exploring its architecture, training methodologies, and practical applications in unsupervised embedding learning.

Understanding BERT

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. Developed by Google AI in 2018, it represents a significant leap in NLP. BERT is designed to pre-train deep bidirectional representations from unlabeled text, conditioning on both the left and right contexts. This bidirectional approach allows BERT to capture contextual information more effectively than previous models that processed text in only one direction.

The Importance of Contextual Understanding

Proper language representation is the ability of machines to grasp the general language. Context-free models like word2Vec or Glove generate a single word embedding representation for each word in the vocabulary. For example, the term “crane” would have the exact representation in “crane in the sky” and in “crane to lift heavy objects.” Contextual models represent each word based on the other words in the sentence. BERT builds upon recent work and clever ideas in pre-training contextual representations, including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, the OpenAI Transformer, ULMFit, and the Transformer.

BERT Architecture and Functionality

Transformer-Based Architecture

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based language model architecture. It consists of multiple layers of self-attention and feed-forward neural networks. A fundamental Transformer consists of an encoder for reading text input and a decoder for producing a task prediction. There is only a need for the encoder element of BERT because its goal is to create a language representation model.

Bidirectional Contextual Encoding

BERT utilizes a bidirectional approach to capture contextual information from preceding and following words in a sentence. Unlike its predecessors, which read text in one direction, BERT comprehends words in sentences by considering both their left and right context.

BERT's Training Process: Unsupervised Pre-training and Supervised Fine-tuning

BERT works by leveraging the power of unsupervised pre-training followed by supervised fine-tuning. During pre-training, the model is trained on a large dataset to extract patterns. During fine-tuning the model is trained for downstream tasks like Classification, Text-Generation, Language Translation, Question-Answering, and so forth.

BERT falls into a self-supervised model. That means, it can generate inputs and labels from the raw corpus without being explicitly programmed by humans. Remember the data it is trained on is unstructured. BERT was pre-trained with two specific tasks: Masked Language Model and Next sentence prediction. The former uses masked input like “the man [MASK] to the store” instead of “the man went to the store”. This restricts BERT to see the words next to it which allows it to learn bidirectional representations as much as possible making it much more flexible and reliable for several downstream tasks.

During the training, BERT uses special types of tokens like [CLS], [MASK], [SEP] et cetera, that allow BERT to distinguish when a sentence begins, which word is masked, and when two sentences are separated.

Input and Output Embeddings

The input to the BERT encoder is a stream of tokens first converted into vectors. Embeddings of Segments: Each token receives a marking designating Sentence A or Sentence B.

Implementing BERT

Implementing BERT (Bidirectional Encoder Representations from Transformers) involves utilizing pre-trained BERT models and fine-tuning them on the specific task. This includes tokenizing the text data, encoding sequences, defining the model architecture, training the model, and evaluating its performance. BERT’s implementation offers powerful language modeling capabilities, allowing for influential natural language processing tasks such as text classification and sentiment analysis.

Read also: Technology Trends Explored

Practical Example: Spam Detection with BERT

The objective is to create a system that can classify SMS messages as spam or non-spam. This system aims to improve user experience and prevent potential security threats by accurately identifying and filtering out spam messages.

We have several SMS messages, which is the problem. However, some of them are spam. Our goal is to create a system that can instantly determine whether or not a text is spam.

Steps for Implementation

Importing Libraries and Datasets: Imports the necessary libraries and datasets for the task at hand.
Data Splitting: Dividing a dataset for trains into train, validation, and test sets. The resulting sets, namely traintext, valtext, and testtext, are accompanied by their respective labels: trainlabels, vallabels, and testlabels.
Loading Pre-trained BERT Model and Tokenizer: The BERT-base pre-trained model is imported using the AutoModel.frompretrained() function from the Hugging Face Transformers library. The BERT tokenizer is also loaded using the BertTokenizerFast.frompretrained() function. The tokenizer is responsible for converting input text into tokens that BERT understands. For tokenization, BERT uses WordPiece.
Tokenization and Encoding: We utilize the BERT tokenizer to tokenize and encode the sequences in the training, validation, and test sets. For uniformity in sequence length, a maximum length of 25 is established for each set. When the padtomax_length=True parameter is set, the sequences are padded or truncated accordingly.
Conversion to Tensors: To convert the tokenized sequences and corresponding labels into tensors using PyTorch. The “torch. For each set (training, validation, and test), the tokenized input sequences are converted to tensors using “torch. tensor(tokenstrain[‘inputids’])”. Similarly, the attention masks are converted to tensors using a “torch. tensor(tokenstrain[‘attentionmask’])”.
Data Loaders: The creation of data loaders using PyTorch’s TensorDataset, DataLoader, RandomSampler, and SequentialSampler classes. We use the RandomSampler to randomly sample the training set, ensuring diverse data representation during training. To facilitate efficient iteration and batching of the data during training and validation, we employ the DataLoader.
Defining the Model Architecture: The BERTArch class extends the nn.Module class and initializes the BERT model as a parameter. By setting the parameters of the BERT model not to require gradients (param.requiresgrad = False), we ensure that only the parameters of the added layers are trained during the training process. The architecture consists of a dropout layer, a ReLU activation function, two dense layers (with 768 and 512 units, respectively), and a softmax activation function. To initialize an instance of the BERTArch class with the BERT model as an argument, we pass the pre-trained BERT model to the defined architecture, BERTArch. The model is moved to the GPU by calling the to() method and specifying the desired device (device) to leverage GPU acceleration.
Optimizer Definition: The AdamW optimizer from the Hugging Face import the Transformers library. The optimizer is then defined by passing the model parameters (model. parameters()) and the learning rate (lr) of 1e-5 to the AdamW optimizer constructor. To convert the class weights to a tensor, move it to the GPU and defines the loss function with weighted class weights.
Training Function: A training function that iterates over batches of data performs forward and backward passes, updates model parameters and computes the training loss. # clip the the gradients to 1.0. # model predictions are stored on GPU. # predictions are in the form of (no. of batches, size of batch, no. # reshape the predictions in form of (number of samples, no.
Evaluation Function: An evaluation function that evaluates the model on the validation data. It computes the validation loss, stores the model predictions, and returns the average loss and predictions. # reshape the predictions in form of (number of samples, no.
Model Training Loop: To train the model for the specified number of epochs. It tracks the best validation loss, saves the model weights if the current validation loss is better, and appends the training and validation losses to their respective lists.
Prediction on Test Data: To make predictions on the test data using the trained model and converts the predictions to NumPy arrays.

Advantages of BERT

Contextual Understanding

BERT captures contextual information, allowing it to understand the meaning of words in different contexts. It handles polysemy (words with multiple meanings) and captures complex linguistic patterns, improving performance on various NLP tasks compared to traditional word embeddings.

Versatility

Fine-tuning BERT enables its application in various tasks, such as sequence labeling, text generation, text summarization, and document classification, among others. It has a wide range of applications beyond just text classification.

BERT vs. Traditional Language Models

Traditional language models, such as word2vec or GloVe, generate fixed-size word embeddings. In contrast, BERT generates contextualized word embeddings by considering the entire sentence context, allowing it to capture more nuanced meaning and context in language.

Read also: Hidden Patterns in Data

Applications of BERT

Real-World Applications

BERT’s versatility empowers its application to real-world problems across industries, encompassing customer sentiment analysis, chatbots, recommendation systems, and more.

Use Cases

BERT helps Google better surface (English) results for nearly all searches. Here’s an example of how BERT helps Google better understand specific searches like: Pre-BERT Google surfaced information about getting a prescription filled. Post-BERT Google understands that “for someone” relates to picking up a prescription for someone else and the search results now help to answer that.

Challenges and Solutions

Addressing the Mismatch Between Pre-training and Fine-tuning

The authors noted however, that since the [MASK] token will only ever appear in the training data and not in live data (at inference time), there would be a mismatch between pre-training and fine-tuning. To mitigate this, not all masked words are replaced with the [MASK] token.

Sentence Relatedness with BERT

BERT representations can be double-edged sword gives the richness in its representations. In experiments with BERT, it can often be misleading with conventional similarity metrics like cosine similarity.

Alternatives to BERT

DistilBERT

Demand for smaller BERT models is increasing in order to use BERT within smaller computational environments (like cell phones and personal computers). 23 smaller BERT models were released in March 2020. DistilBERT offers a lighter version of BERT; runs 60% faster while maintaining over 95% of BERT’s performance.

tags: #unsupervised #embedding #learning #BERT #tutorial