The Mathematical Underpinnings of Machine Learning: A Comprehensive Guide

Mathematics serves as the bedrock upon which machine learning algorithms are built. A strong grasp of mathematical concepts is essential for understanding how these algorithms learn from data, optimize their performance, and make informed decisions. This article explores the key mathematical areas that underpin machine learning, providing a roadmap for aspiring data scientists and machine learning engineers.

Why Mathematics Matters in Machine Learning

Mathematics provides the theoretical foundation for understanding how machine learning algorithms work. Concepts like calculus and linear algebra enable fine-tuning of models for better performance. Knowing the math helps troubleshoot issues in models and algorithms. Topics like deep learning, NLP and reinforcement learning require strong mathematical foundations.

Data Representation and Transformation

Mathematical concepts are crucial for effectively representing and transforming data, a fundamental step in machine learning.

Algorithm Training and Optimization

Mathematics provides the tools necessary to train and optimize machine learning algorithms, ensuring they learn efficiently and effectively.

Decision-Making Under Uncertainty

Many real-world problems involve uncertainty. Mathematics provides the framework for making informed decisions in the face of incomplete or noisy data.

The Essential Mathematical Toolkit

The amount of math required for machine learning depends on your goals. The core mathematical areas for machine learning include:

Linear Algebra
Calculus
Probability and Statistics

Linear Algebra: The Language of Data

Linear algebra is the backbone of machine learning and data science. It provides the tools to represent and manipulate high-dimensional data efficiently. Some consider linear algebra to be the mathematics of the 21st century. Linear algebra acts as a stage or a platform over which all the machine learning algorithms cook their results.

Key Concepts in Linear Algebra

Vectors: Represent features of a dataset.
Matrices: Store large amounts of data (e.g., pixel values in images).
Dot Product & Matrix Multiplication: Essential for neural networks. Matrix multiplication is the composition of linear transformations.
Eigenvalues & Eigenvectors: Used in Principal Component Analysis (PCA) for dimensionality reduction.

Linear Algebra acts as the systematic basis of the representation for simultaneous linear equations.

Linear Algebra in Action: Representing a Dataset

Every machine learning dataset can be represented as a matrix, making it easy to process using matrix operations.

import numpy as np# Example dataset (3 samples, 2 features)X = np.array([[1.2, 2.3], [2.1, 3.1], [0.8, 1.5]])# Transpose the matrixX_T = X.Tprint("Original Matrix:\n", X)print("Transposed Matrix:\n", X_T)

Vector Spaces

To have a good understanding of linear algebra, start with vector spaces. The textbook definition is intimidating and abstract, so let’s talk about a special case first. You can think of each point in the plane as a tuple x = (x₁, x₂), represented by an arrow pointing from the origin to (x₁, x₂).

Read also: Revolutionizing Remote Monitoring

You can add these vectors together and multiply them with scalars. Algebraically, it simply goes as but it’s easier to visualize:

The Euclidean plane is the prototypical model of a vector space. Tuples of n elements form n-dimensional vectors, making up the Euclidean space

In general, a set of vectors V is a vector space over the real numbers if you can add and scale vectors in a straightforward way.

When thinking about vector spaces, it helps to model them as tuples in Euclidean space mentally.

Measuring Distance in Vector Spaces

When you have a good understanding of vector spaces, the next step is to understand how to measure distance in vector spaces.

Read also: Boosting Algorithms Explained

By default, a vector space in itself gives no tools for this. How would you do this on the plane? You have probably learned that there, we have the famous Euclidean norm defined by Although the vector notation and the square root symbol make this feel intimidating, the magnitude is just the Pythagorean theorem in disguise.

This can be generalized further: in three dimensions, the Euclidean norm is the repeated application of the Pythagorean theorem.

This is a special case of a norm. In general, a vector space V is normed if there is a function ‖ ⋅ ‖: V → [0, ∞) such that where x and y are any two vectors.

Again, this might be scary, but this is a simple and essential concept. There are a bunch of norms out there, and the most important is the p-norm family, defined for any p ∈ [0, ∞) by (with p = 2 giving the Euclidean norm) and the supremum norm

Norms can be used to define a distance by taking the norm of the difference:

The 1-norm is called the Manhattan norm (or taxicab norm), because the distance between two points depends on how many “grid jumps” you have to perform to get from x to y. Sometimes, like for p = 2, the norm comes from a so-called inner product, which is a bilinear function 〈 ⋅, ⋅ 〉: V × V → [0, ∞) such that

A vector space with an inner product is called an inner product space. An example is the classical Euclidean product

On the other hand, every inner product can be turned into a norm by

When the inner product for two vectors is zero, we say that the vectors are orthogonal to each other. (Try to come up with some concrete examples on the plane to understand the concept more deeply.)

Basis and Orthonormal Basis

Although vector spaces are infinite (in our case), you can find a finite set of vectors that can be used to express all vectors in the space. For example, on the plane, we have where This is a special case of a basis and an orthonormal basis.

In general, a basis is a minimal set of vectors v₁, v₂, …, vₙ ∈ V such that their linear combinations span the vector space:

A basis always exists for any vector space. (It may not be a finite set, but that shouldn’t concern us now.) Without a doubt, a basis simplifies things greatly when talking about linear spaces.

When the vectors in a basis are orthogonal to each other, we call it an orthogonal basis. If each basis vector’s norm is 1 for an orthogonal basis, we say it is orthonormal.

Linear Transformations

The key objects related to vector spaces are linear transformations. If you have seen a neural network before, you know that the fundamental building blocks are layers of the form f(x) = σ(Ax + b), where A is a matrix, b and x are vectors, and σ is the sigmoid function. (Or any activation function, really.) Well, the part Ax is a linear transformation.

In general, the function L: V → W is a linear transformation between vector spaces V and W if holds for all x, y in V, and all real number a.

To give a concrete example, rotations around the origin in the plane are linear transformations.

Undoubtedly, the most crucial fact about linear transformations is that they can be represented with matrices, as you’ll see next in your studies.

Matrices

If linear transformations are clear, you can turn to the study of matrices. (Linear algebra courses often start with matrices, but I would recommend it this way for reasons to be explained later.)

The most important operation for matrices is matrix multiplication, also known as the matrix product. In general, if A and B are matrices defined by then their product can be obtained by

This might seem difficult to comprehend, but it is pretty straightforward. Take a look at the figure below, demonstrating how to calculate the element in the 2nd row, 1st column of the product.

The reason why matrix multiplication is defined the way it is because matrices represent linear transformations between vector spaces. Matrix multiplication is the composition of linear transformations.

Determinants

In my opinion, determinants are hands down one of the most challenging concepts to grasp in linear algebra. Depending on your learning resource, it is usually defined by either a recursive definition or a sum that iterates through all permutations. None of them is tractable without significant experience in mathematics.

To summarize, the determinant of a matrix describes how the volume of an object scales under the corresponding linear transformation. If the transformation changes orientations, the sign of the determinant is negative.

You will eventually need to understand how to calculate the determinant, but I wouldn’t worry about it now.

Eigenvalues and Eigenvectors

A standard first linear algebra course usually ends with eigenvalues/eigenvectors and some special matrix decompositions like the Singular Value Decomposition.

Let’s suppose that we have a matrix A. The number λ is an eigenvalue of A if there is a vector x (called an eigenvector) such that Ax = λx holds. In other words, the linear transformation represented by A is a scaling by λ for the vector x. This concept plays an essential role in linear algebra. (And practically in every field that uses linear algebra extensively.)

Matrix Decompositions

At this point, you are ready to familiarize yourself with a few matrix decompositions. If you think about it for a second, what type of matrices are the best from a computational perspective? Diagonal matrices! If a linear transformation has a diagonal matrix, it is trivial to compute its value on an arbitrary vector:

Most special forms aim to decompose a matrix A into a product of matrices, where hopefully at least one of the matrices is diagonal. Singular Value Decomposition, or SVD in short, the most famous one, states that there are special matrices U, V, and a diagonal matrix Σ such that A = U Σ V holds. (U and V are so-called unitary matrices, which I don’t define here; suffice to know that it is a special family of matrices.)

SVD is also used to perform Principal Component Analysis, one of the simplest and most well-known methods for dimensionality reduction.

Resources for Learning Linear Algebra

Linear algebra can be taught in many ways. The path I outlined here was inspired by the textbook Linear Algebra Done Right by Sheldon Axler. For an online lecture, I would recommend the Linear Algebra course from MIT OpenCourseWare, an excellent resource.

Calculus: The Engine Behind Optimization

Machine learning models learn by optimizing a function - typically a loss function that needs to be minimized. Calculus helps with gradient-based optimization methods like Gradient Descent. Calculus is the study of differentiation and integration of functions. Essentially, a neural network is a differentiable function, so calculus will be a fundamental tool to train neural networks, as we will see.

Key Concepts in Calculus

Derivatives: Measures the rate of change.
Gradients: Used to adjust model parameters in deep learning.
Chain Rule: Helps compute derivatives in multi-layered networks.

Calculus in Action: Finding the Minimum of a Function

Optimization algorithms like Gradient Descent rely on derivatives to update model weights and minimize error.

import sympy as spx = sp.Symbol('x')f = x**2 - 4*x + 4 # A simple quadratic functiondf = sp.diff(f, x) # Compute derivativecritical_points = sp.solve(df, x)print("Derivative of function:", df)print("Critical Point:", critical_points)

Differentiation

To familiarize yourself with the concepts, you should make things simple and study functions of a single variable for the first time. By definition, the derivative of a function is defined by the limit where the ratio for a given h is the slope of the line between the points (x, f(x)) and (x+h, f(x+h)).

In the limit, this is essentially the slope of the tangent line at the point x. The figure below illustrates the concept.

Differentiation can be used to optimize functions: the derivative is zero at local maxima or minima. (However, this is not true in the other direction; see f(x) = x³ at 0.)

Points where the derivative is zero are called critical points. Whether a critical point is a minimum or a maximum can be decided by looking at the second derivative:

tags: #mathematics #behind #machine #learning