The Mathematics of Deep Learning: An Overview

Deep learning algorithms are built upon core mathematical concepts. A proper elaboration of these concepts leads to a better understanding of the problems these algorithms are designed to solve. This article provides a detailed discussion of the mathematics required for deep learning.

Why Mathematics is Essential for Deep Learning

Mathematics plays a vital role in the field of deep learning. It helps in:

Selecting the correct algorithm by considering its complexity, training time, features, and accuracy.
Approximating the right confidence interval and unpredictability.
Selecting an algorithm's acceptance plan and choosing its parameter settings.

Applications of Deep Learning Algorithms

Deep learning algorithms have a wide range of applications, including:

Image Colorization: Deep neural networks can colorize black and white pictures and videos.
Image Super-Resolution: Pixel recursive super-resolution, developed by Google Brain Researchers, can predict a precise image from a blurred image.
Lip Reading: Deep learning neural networks developed by Oxford University can read a person's lips and convert them directly into text, without needing the sound of a person speaking.
Location Detection: Deep learning neural networks can detect the location where a picture was taken and display it on a map.
Endangered Species Detection: Convolutional neural networks can detect endangered whale species, aiding in their conservation.
Self-Driving Cars: Self-driving cars can detect traffic and choose an optimal path.

Deep learning algorithms are also being implemented in earthquake prediction, music composition, entertainment, healthcare, and robotics.

Deep Neural Networks (DNNs)

Deep learning, a subset of machine learning, revolves around neural networks and representation learning. A deep neural network (DNN) is an artificial neural network with multiple hidden layers between the input and output layers. The term "deep" refers to the use of multiple hidden layers in a network.

Read also: Why Mathematics Matters

A Simple 3-Layer Network

Consider a task to build a classifier for binary digits that predicts whether the given input digit is 0 or 1. A network can be created with n neurons for the input layer, p neurons for the first hidden layer, q neurons for the second hidden layer, and 1 neuron for the output layer, which corresponds to the two classes in the model (whether the digit is 0 or 1).

The number of neurons in the input layer depends on the shape of features in the input data (e.g., 28x28 pixels for images), while the number of neurons in the output layer is determined by the number of classes in the dataset. The hidden layers receive the output of their previous layer as input, produce a weighted sum, and combine them with the associated activation functions to produce the next layer’s input. This process repeats until the end of the network.

Forward Propagation

Forward propagation involves writing down the mathematical formulas for a neural network. For a 3-layer network, the formulas are as follows:

Z1 = W1 . X + b1
A1 = f(Z1)
Z2 = W2 . A1 + b2
A2 = g(Z2)
Z3 = W3 . A2 + b3
A3 = Sigmoid(Z3)

Where:

A(l) is the activation vector for layer l.
Z(l) is the weighted sum of inputs for layer l.
W(l) is the weight matrix for the connections between layer (l-1) and layer l.
b(l) is the bias vector for layer l.
"g" and "f" are the two activation functions, such as the Swish or the Leaky ReLU activation function.

These formulas represent the forward propagation through the network.

Loss Function

The next step in building any machine learning model involves calculating the loss function. This helps to understand the model’s generalization capability better and answer the question "How far away from the actual value?". Then, backpropagation is used to adjust the weights and biases to reduce the loss. In general, "Loss" refers to the error of a single observation, while "Cost" represents the average error of the entire dataset (average of loss).

The formula for Binary Cross Entropy cost is:

J(θ) =−(ylog(hθ(x))+(1−y)log(1−hθ(x)))

Where:

J(θ) represents the "Cost function."
hθ(x) represents the "Hypothesis function," which produces the predicted output of the network given the input features x and the model parameters θ. More simply, this is called an "Activation function."
hθ(x) is the probability that the input x belongs to the positive class (class 1).
1−hθ(x) is the probability that the input x belongs to the negative class (class 0).

Backward Propagation

The process of backpropagation involves taking derivatives from the loss function with respect to associated parameters (i.e., weights and biases), which requires a strong understanding of partial derivatives and the chain rule.

The following equations are updated formulas for Forward Propagation alongside the cost:

Z1 = W1 . X + b1
A1 = f(Z1)
Z2 = W2 . A1 + b2
A2 = g(Z2)
Z3 = W3 . A2 + b3
A3 = Sigmoid(Z3)
J(A3, y)=−(ylog(hθ(x))+(1−y)log(1−hθ(x)))

The process starts by taking the derivative of the loss function with respect to the most recent layer’s parameter, in this case, “W3”. This is denoted as ∂W3. W3 is not directly in the loss function but within another chained function: Z3. Therefore, it's necessary to start from the loss and take the partial derivate using the chain rule, up to Z3. That is, starting from loss to A3, and finally Z3.

Calculating Derivatives

The derivative of the log(n) is 1 / n; or in other words, 1 / variable. The previous step of loss involves Sigmoid in the Output layer as the activation function. Sigmoid has a derivative that is computationally easy to calculate and well-suited for gradient descent. It is denoted as follows:

The derivative of the Sigmoid using the Quotient Rule simplifies to A3(1-A3), where A(l) is the activation vector for layer l, calculated by Sigmoid.

The next step involves finding the derivative of Z3 and calculating ∂W3, which shows the result of all the derivatives multiplied together. The partial derivative of a sum is the derivative of a function with respect to one of its variables (in this case, W3) while holding all other variables constant (or 0).

Similarly, the derivative of the loss with respect to “W2” and “W1” are calculated using the appropriate formulas, following the chain rule.

Derivatives of Bias Terms

The variables “∂Z1”, “∂Z2”, and “∂Z3” are saved to calculate the derivatives of the Bias terms, which are “∂b1”, “∂b2”, and “∂b3”. These variables are used to compute the derivative of the biases element-wise.

Namely:

db1 = Sum(dZ1, 1)
db2 = Sum(dZ2, 1)
db3 = Sum(dZ3, 1)

These broadcast the derivatives of Biases element by element by mentioning 'axis=1'.

Cost Function for Multiple Observations

In machine learning, instead of using the term "Loss," the average of the loss is commonly referred to as "Cost." By calculating the cost, we can assess the effectiveness of our models and optimize them accordingly. To calculate the cost, we need to add up all the individual losses and then divide the result by the number of observations.

Pseudocode for Backpropagation

The pseudocode below illustrates the backpropagation process:

# Perform linear transformations and activation functionsZ1 = (W1 * X) + b1A1 = f(Z1)Z2 = (W2 * A1) + b2A2 = g(Z2)Z3 = (W3 * A2) + b3A3 = Sigmoid(Z3)# Write the cost function formula for Binary-Cross Entropyepsilon = 1e-8 # Prevent dividing by zeroA3 += epsilonloss = -y * log(A3) + (1 - y) * log(1 - A3)cost = mean(loss)m = X.shape[1] # Number of training examples# gradients (derivatives) for output layerdZ3 = A3 - ydW3 = (dZ3 * A2.T) / mdb3 = Sum(dZ3, 1) / m# gradients for second hidden layerdZ2 = (W3.T * dZ3) * g_derivative(Z2)dW2 = (dZ2 * A1.T) / mdb2 = Sum(dZ2, 1) / m# gradients for first hidden layerdZ1 = (W2.T * dZ2) * f_derivative(Z1)dW1 = (dZ1 * X.T) / mdb1 = Sum(dZ1, 1) / m

Explanation of Shapes

Z1 = (W1 ⋅ X) + b1
- Shape: (n, m)
- W1: (n, d)
- X: (d, m)
- b1: (n, 1)
A1 = f(Z1)
- Shape: (n, m) (same as Z1)
Z2 = (W2 ⋅ A1) + b2
- Shape: (p, m)
- W2: (p, n)
- A1: (n, m)
- b2: (p, 1)
A2 = g(Z2)
- Shape: (p, m) (same as Z2)
Z3 = (W3 ⋅ A2) + b3
- Shape: (q, m)
- W3: (q, p)
- A2: (p, m)
- b3: (q, 1)
A3 = Sigmoid(Z3)
- Shape: (q, m) (same as Z3)

Advanced Mathematical Topics

A deeper understanding of deep learning may require knowledge of advanced mathematical topics such as:

Hamiltonian Calculus
Halleys Calculus
Complex Numbers
Quaternions
Sedenions
Quadratic Functions
NP Problems
Advanced Probabilities and Statistics

tags: #mathematics #of #deep #learning #overview