Logistic Regression in Machine Learning: A Comprehensive Guide

Logistic Regression stands as a cornerstone algorithm in the realm of Machine Learning, particularly renowned for its efficacy in addressing classification problems. Unlike its counterpart, Linear Regression, which excels at predicting continuous values, Logistic Regression is adept at forecasting binary outcomes, such as yes/no, 0/1, or true/false scenarios. Logistic regression is a statistical model used to predict the probability of a binary outcome based on independent variables. It is commonly used in machine learning and data analysis for classification tasks.

Understanding Logistic Regression

At its core, Logistic Regression predicts the probability that a given input belongs to a specific class. The output, constrained between 0 and 1, represents probabilities. Logistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It is used for binary classification where the output can be one of two possible categories such as Yes/No, True/False or 0/1. Logistic regression models the probability of the default class (e.g. the first class). Logistic regression is a powerful classification technique by estimating the likelihood of an input belonging to a particular class. This estimation is inherently a probability prediction, which must be converted into binary values (0 or 1) to make class predictions. Logistic regression is also still highly relevant in performing statistical testing in the context of behavioral and social science research, and the data science field at large.

Cost Function: The Guiding Light

A cost function, synonymous with a loss function, quantifies the error between actual values (ground truth) and model-predicted values. A primary objective in most machine learning algorithms is to minimize this cost, thereby enhancing prediction accuracy.

Linear Regression: Mean Squared Error (MSE)

In Linear Regression, the Mean Squared Error (MSE) is conventionally employed as the cost function. It computes the average squared difference between predicted and actual values, facilitating adjustments to the model’s parameters to minimize error.

The formula for MSE is:

MSE = 1/n Σ(y(i) - y^(i))^2

Where:

n represents the number of training examples.
y(i) denotes the predicted value for the i-th training example.
y^(i) signifies the actual value for the i-th training example.

Since the graph of this cost function is convex (U-shaped), we can use Gradient Descent to find the optimal model parameters. Gradient Descent helps find the global minimum of the function, ensuring we have the best model.

Why MSE Falls Short for Logistic Regression

Unlike Linear Regression, Logistic Regression is tailored for classification tasks, yielding a probability between 0 and 1. Logistic Regression employs the sigmoid function to predict these probabilities.

Attempting to use the Mean Squared Error for Logistic Regression presents several challenges:

Nonlinearity of the Sigmoid Function: The sigmoid function introduces nonlinearity, rendering the cost function non-convex when plugged into the MSE formula. This can lead to multiple local minima, complicating the search for the optimal solution via Gradient Descent.
Squaring Errors: The MSE squares the difference between the predicted probability and the actual class label (0 or 1). When the prediction is far from the actual value, the error gets magnified. However, because the outputs of Logistic Regression are probabilities (values between 0 and 1), squaring these small differences can make it difficult for the model to learn effectively.

The Solution: Log Loss (Cross-Entropy)

Instead of MSE, Logistic Regression uses a different cost function called Log Loss (or Cross-Entropy). Log Loss penalizes incorrect predictions more effectively than MSE and helps the model learn to improve. It works by taking the logarithm of predicted probabilities, allowing the model to focus on making confident predictions closer to the true class labels. The Log Loss function is:

Log Loss = -1/m Σ [yᵢ log(hθ(xᵢ)) + (1-yᵢ) log(1-hθ(xᵢ))]

Where:

m is the number of training examples
yᵢ is the true class label for the i-th example (either 0 or 1).
h_θ(xᵢ) is the predicted probability for the i-th example, as calculated by the logistic regression model.
θ is the model parameters

This cost function is convex, meaning it has a single global minimum. Gradient Descent can easily optimize this function, helping the model find the best parameters for classification.

Types of Logistic Regression

Logistic regression can be classified into three main types based on the nature of the dependent variable:

Binomial Logistic Regression: This type is used when the dependent variable has only two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most common form of logistic regression and is used for binary classification problems.
Multinomial Logistic Regression: This is used when the dependent variable has three or more possible categories that are not ordered. For example, classifying animals into categories like "cat," "dog" or "sheep." It extends the binary logistic regression to handle multiple classes.
Ordinal Logistic Regression: This type applies when the dependent variable has three or more categories with a natural order or ranking. Examples include ratings like "low," "medium" and "high." It takes the order of the categories into account when modeling.

Multiclass Classification: Extending Logistic Regression

In multi-class classification, the goal is to categorize data into more than two categories (e.g., classify animals into cat, dog, or bird). Most machine learning algorithms, however, are designed for binary classification, where the goal is to classify data into one of two categories (like “spam” vs “not spam”).

But don’t worry! We can adapt these binary classification models for multi-class classification using two common techniques: One-Vs-Rest (OvR) and One-Vs-One (OvO). Let’s break down these methods in a simple way so we can understand how they work and when to use each one.

One-Vs-Rest (OvR)

OvR is a technique where we take a multi-class classification problem and split it into multiple binary classification problems. For each class, we create a binary classifier that tries to distinguish one class from the rest.

Let’s say we have three classes: “Red,” “Blue,” and “Green.” With OvR, we will create 3 separate binary classification problems:

Problem 1: Is it “Red” or not Red (i.e., Blue or Green)?
Problem 2: Is it “Blue” or not Blue (i.e., Red or Green)?
Problem 3: Is it “Green” or not Green (i.e., Red or Blue)?

For each binary problem, a model will be trained. When we want to make a prediction for a new data point, each model will give a probability for its specific class (e.g., how confident it is that the data belongs to the “Red” class). We then choose the class with the highest probability (i.e., the model that is the most confident).

Example:

Imagine we are trying to classify a fruit as either an apple, orange, or banana based on features like color, size, and shape, using One-Vs-Rest:

Model 1 will classify whether the fruit is an apple or not (could be an orange or banana).
Model 2 will classify whether the fruit is an orange or not (could be an apple or banana).
Model 3 will classify whether the fruit is a banana or not (could be an apple or orange).

During prediction, if the “apple” model predicts with 0.7 probability, the “orange” model with 0.2, and the “banana” model with 0.1, the classifier will predict the fruit is an apple, as 0.7 is the highest probability.

Advantages:

Simple to implement.
Works well with many binary classifiers, like Logistic Regression or Support Vector Machines (SVM).

Disadvantages:

We need one model per class. So, if we have 100 classes, we will need to train 100 models.
Some models may be slower when applied to large datasets because we need to fit one model per class.

One-Vs-One (OvO)

In OvO instead of creating one binary classifier per class, we create a classifier for each pair of classes. This means we break the problem into binary classification tasks for every pair of classes.

Let’s take the same example of three classes: “Red,” “Blue,” and “Green.” Instead of three binary classification problems (as in OvR), OvO would create the following binary classification problems:

Problem 1: “Red” vs “Blue”
Problem 2: “Red” vs “Green”
Problem 3: “Blue” vs “Green”

For each pair of classes, we train a separate binary classifier. Then, during prediction, each model makes its prediction, and the class that gets the most “wins” across all the models is selected as the final prediction. This is like a voting system, where each binary classifier votes for its preferred class, and the class with the most votes wins.

Example:

For the fruit classification problem (apple, orange, banana), OvO would create classifiers for:

Apple vs Orange
Apple vs Banana
Orange vs Banana

When predicting the class for a new fruit:

The “Apple vs Orange” model votes for either apple or orange.
The “Apple vs Banana” model votes for either apple or banana.
The “Orange vs Banana” model votes for either orange or banana.

The final classification is based on which fruit gets the most votes.

Advantages:

Each binary classifier deals with only two classes, which makes the classifiers simpler and sometimes more accurate.
Can handle large datasets well, especially when binary classifiers are fast to train.

Disadvantages:

The number of models grows quickly as the number of classes increases. For (n) classes, the number of binary classifiers is given by the formula: Number of Models=n(n−1)/ 2
More models means more storage and longer training time.

Creating Extra Features to Improve Models

When building machine learning models, it’s common to create new features from the original ones to improve the model’s performance. Let’s say we have 3 features (or variables) in our data: f1, f2, and f3. These could represent anything like the height, weight, or age of a person.

Example: First-Degree Features

If we keep things simple, we could create an equation where these features are only multiplied by some numbers (called coefficients), like this:

a⋅f1+b⋅f2+c⋅f3+d

In this equation, a, b, c, and d are just numbers that the model learns during training. This equation is called first-degree because the features are used as they are - there are no squares or multiplied terms.

Adding More Features: Second-Degree Terms

Now, to improve the model, we can add second-degree features. These are new features created by multiplying the original ones with themselves or with each other. For example:

f1⋅f1
f2⋅f2
f3⋅f3
f1⋅f2 (interaction between f1 and f2)
f1⋅f3
f2⋅f3

So, now we have 9 features in total instead of 3. These new features help the model capture more complex patterns in the data, like curves, which a simple straight-line model can’t.

We can also add even more complex features, like third-degree terms, such as:

f1⋅f2⋅f3

As we add more of these higher-degree terms, the model can fit more complicated shapes to the data, like curves or parabolas. This helps the model better understand the data and create more flexible decision boundaries.

The Problem of Overfitting

However, there’s a downside to adding too many features, especially very high-degree ones. The model might start to fit the training data too well. It may capture not only the general patterns but also the noise or random quirks in the data. This is called overfitting.

Overfitting means the model becomes so good at predicting the training data that it performs poorly when given new, unseen data (test data).

The model is “tricked” into thinking the noise is important, so when it encounters new data, it makes mistakes because the patterns it learned aren’t general enough.

In the first figure, we use a degree 1 equation (a straight line). This line is too simple and doesn’t fit the data well - lots of mistakes are made.

In the second figure, we use a degree 2 equation. This curve fits the data much better and balances fitting the data without overcomplicating things. This is the optimal solution because it captures the overall trend even though a few points are misclassified.

In the third figure, we use a degree 5 equation. The decision boundary is very complex and tries to fit every single point perfectly.

Logistic Regression vs. Linear Regression

Though different, logistic regression and linear regression often show up in similar contexts, as they are part of a larger, related mathematical toolset. For example, if one were to try to predict the most likely temperature for a day in the future, a linear regression model would be a good tool for the job. Logistic regression models, by contrast, attempt to calculate or predict the odds for two or more options out of a fixed list of choices. Since they are built to address separate use cases, the two models make different assumptions about the statistical properties of the values they’re predicting and are implemented with different statistical tools. Logistic regression typically assumes a statistical distribution that applies to discrete values, such as a Bernoulli distribution, while linear regression might use a Gaussian distribution. Logistic regression often requires larger datasets to work effectively, while linear regression is usually more sensitive to influential outliers. The differences between these models cause them to perform better for their specific ideal use cases. Logistic regression will be more accurate for predicting categorical values, and linear regression will be more accurate when predicting continuous values. The two techniques are often confused with each other though, since their outputs can be repurposed with straightforward mathematical calculations. A logistic regression model’s output can be applied, after a transformation, to the same kinds of problems as a linear model’s output, saving on the cost of training two separate models.

Assumptions of Logistic Regression

Understanding the assumptions behind logistic regression is important to ensure the model is applied correctly, main assumptions are:

Independent observations: Each data point is assumed to be independent of the others means there should be no correlation or dependence between the input samples.
Binary dependent variables: It takes the assumption that the dependent variable must be binary, means it can take only two values. For more than two categories SoftMax functions are used.
Linearity relationship between independent variables and log odds: The model assumes a linear relationship between the independent variables and the log odds of the dependent variable which means the predictors affect the log odds in a linear way.
No outliers: The dataset should not contain extreme outliers as they can distort the estimation of the logistic regression coefficients.
Large sample size: It requires a sufficiently large sample size to produce reliable and stable results.

Understanding Sigmoid Function

The sigmoid function is a important part of logistic regression which is used to convert the raw output of the model into a probability value between 0 and 1.
This function takes any real number and maps it into the range 0 to 1 forming an "S" shaped curve called the sigmoid curve or logistic curve. Because probabilities must lie between 0 and 1, the sigmoid function is perfect for this purpose.
In logistic regression, we use a threshold value usually 0.5 to decide the class label.
- If the sigmoid output is same or above the threshold, the input is classified as Class 1.
- If it is below the threshold, the input is classified as Class 0.
- This approach helps to transform continuous input values into meaningful class predictions.The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. Where e is the base of the natural logarithms (Euler’s number or the EXP() function in your spreadsheet) and value is the actual numerical value that you want to transform.

How does Logistic Regression work?

Logistic regression model transforms the linear regression function continuous value output into categorical value output using a sigmoid function which maps any real-valued set of independent variables input into a value between 0 and 1. This function is known as the logistic function.

Suppose we have input features represented as a matrix:

X = \begin{bmatrix} x_{11} & … & x_{1m}\ x_{21} & … & x_{2m} \ \vdots & \ddots & \vdots \ x_{n1} & … & x_{nm} \end{bmatrix}

tags: #logistic #regression #in #machine #learning #explained

Logistic Regression in Machine Learning: A Comprehensive Guide

Understanding Logistic Regression

Cost Function: The Guiding Light

Linear Regression: Mean Squared Error (MSE)

Why MSE Falls Short for Logistic Regression

The Solution: Log Loss (Cross-Entropy)

Types of Logistic Regression

Multiclass Classification: Extending Logistic Regression

One-Vs-Rest (OvR)

Example:

Advantages:

Disadvantages:

One-Vs-One (OvO)

Example:

Advantages:

Disadvantages:

Creating Extra Features to Improve Models

Example: First-Degree Features

Adding More Features: Second-Degree Terms

The Problem of Overfitting

Logistic Regression vs. Linear Regression

Assumptions of Logistic Regression

Understanding Sigmoid Function

How does Logistic Regression work?

Popular posts:

Company

For Learners

Connect with us