Understanding Hyperparameters in Machine Learning

When venturing into the realm of machine learning (ML) and deep learning (DL), grasping the fundamental terminology is paramount. Among the initial concepts that can cause confusion are "parameters" and "hyperparameters." This article aims to clarify these terms, highlighting their significance and impact on model performance.

Parameters vs. Hyperparameters: Key Differences

In ML/DL, a model is defined or represented by the model parameters, while hyperparameters are parameters whose values control the learning process and determine the values of model parameters that a learning algorithm ends up learning.

Model Parameters

Parameters are internal to the model and are learned from the training data during the training process, defining the model’s predictions. Model training typically starts with parameters being initialized to some values (random values or set to zeros). As training/learning progresses, the initial values are updated using an optimization algorithm (e.g., gradient descent). These parameters adjust during training to optimize the performance of the model. At the end of the learning process, we have the trained model parameters which effectively is what we refer to as the model.

Example: In a linear regression model, represented as Ŷ = β₀ + β₁X, the parameters are the coefficients β₀ (intercept) and β₁ (slope).

Hyperparameters

Hyperparameters, on the other hand, are set before training even begins. As a machine learning engineer designing a model, you choose and set hyperparameter values that your learning algorithm will use. They are external to the training data and must be set manually or through automated tuning methods. Hyperparameters are used by the learning algorithm when it is learning but they are not part of the resulting model. The hyperparameters that were used during training are not part of this model.

Examples:
- Learning rate in optimization algorithms
- Choice of activation function in a neural network (NN) layer
- Maximum depth of a decision tree
- Number of hidden layers in a neural network
- Regularization strength

Simply put, parameters in machine learning and deep learning are the values your learning algorithm can change independently as it learns, and these values are affected by the choice of hyperparameters you provide. So you set the hyperparameters before training begins and the learning algorithm uses them to learn the parameters.

Read also: Your Guide to Nursing Internships

The Importance of Hyperparameter Tuning

Setting the right hyperparameter values is very important because it directly impacts the performance of the model that will result from them being used during model training. Hyperparameter tuning is the practice of identifying and selecting the optimal hyperparameters for use in training a machine learning model. It is an experimental practice, with each iteration testing different hyperparameter values until the best ones are identified. This process is critical to the performance of the model as hyperparameters govern its learning process. Optimal hyperparameter configurations lead to strong model performance in the real world. Good hyperparameter tuning means a stronger performance overall from the machine learning model according to the metrics for its intended task.

The goal of hyperparameter tuning is to balance the bias-variance tradeoff. Each machine learning algorithm favors its own respective set of hyperparameters, and it’s not necessary to maximize them in all cases.

Bias and Variance

Bias is the divergence between a model’s predictions and reality. Models with low bias are accurate.
Variance is the sensitivity of a model to new data. Models with low variance are consistent. A reliable model should deliver consistent results when migrating from its training data to other datasets.

Hyperparameter Tuning Methods

Hyperparameter tuning centers around the objective function, which analyzes a group, or tuple, of hyperparameters and calculates the projected loss. Optimal hyperparameter tuning minimizes loss according to the chosen metrics. Data scientists have a variety of hyperparameter tuning methods at their disposal, each with its respective strengths and weaknesses.

Grid search: Grid search is a comprehensive and exhaustive hyperparameter tuning method. After data scientists establish every possible value for each hyperparameter, a grid search constructs models for every possible configuration of those discrete hyperparameter values. In this way, grid search is similar to brute-forcing a PIN by inputting every potential combination of numbers until the correct sequence is discovered. GridSearchCV is a brute-force technique for hyperparameter tuning. It trains the model using all possible combinations of specified hyperparameter values to find the best-performing setup. It is slow and uses a lot of computer power which makes it hard to use with big datasets or many settings. It works using below steps:
- Create a grid of potential values for each hyperparameter.
- Train the model for every combination in the grid.
- Evaluate each model using cross-validation.
- Select the combination that gives the highest score.
For example if we want to tune two hyperparameters C and Alpha for a Logistic Regression Classifier model with the following sets of values:

Read also: The Return of College Football Gaming

C = [0.1, 0.2, 0.3, 0.4, 0.5]
Alpha = [0.01, 0.1, 0.5, 1.0]
The grid search technique will construct multiple versions of the model with all possible combinations of C and Alpha, resulting in a total of 5 * 4 = 20 different models. The best-performing combination is then chosen.
Random search: RandomizedSearchCV picks random combinations of hyperparameters from the given ranges instead of checking every single combination like GridSearchCV. In each iteration it tries a new random combination of hyperparameter values and records the model’s performance for each combination. After several attempts it selects the best-performing set. Random search differs from grid search in that data scientists provide statistical distributions instead of discrete values for each hyperparameter. A randomized search pulls samples from each range and constructs models for each combination. Randomized search is preferable to grid search in situations where the hyperparameter search space contains large distributions-it would simply require too much effort to test each discrete value.
Bayesian optimization: Bayesian Optimization takes a smarter approach. It treats hyperparameter tuning like a mathematical optimization problem and learns from past results to decide what to try next. Bayesian optimization is a sequential model-based optimization (SMBO) algorithm in which each iteration of testing improves the sampling method of the next. Based on prior tests, Bayesian optimization probabilistically selects a new set of hyperparameter values that is likely to deliver better results. The probabilistic model is referred to as a surrogate of the original objective function. The better the surrogate gets at predicting optimal hyperparameters, the faster the process becomes, with fewer objective function tests required.

Read also: Transfer pathways after community college
- Build a probabilistic model (surrogate function) that predicts performance based on hyperparameters.
- Update this model after each evaluation.
- Use the model to choose the next best set to try.
- Repeat until the optimal combination is found.
The surrogate function models:
P(score(y)∣hyperparameters(x))
Here the surrogate function models the relationship between hyperparameters x and the score y. By updating this model iteratively with each new evaluation Bayesian optimization makes more informed decisions. Common surrogate models used in Bayesian optimization include:
- Gaussian Processes
- Random Forest Regression
- Tree-structured Parzen Estimators (TPE)
Successive Halving: The process of statistically determining the relationship between an outcome-in this case, the best model performance-and a set of variables is known as regression analysis. This “early stopping” is achieved through successive halving, a process that whittles down the pool of configurations by removing the worst-performing half after each round of training.

Key Hyperparameters in Popular Algorithms

Each machine learning algorithm has its own set of hyperparameters that can significantly impact performance. Here's a look at some key hyperparameters in popular algorithms:

1. Neural Networks

Neural networks take inspiration from the human brain and are composed of interconnected nodes that send signals to one another.

Learning Rate: Learning rate sets the speed at which a model adjusts its parameters in each iteration. These adjustments are known as steps. A high learning rate means that a model will adjust more quickly, but at the risk of unstable performance and data drift.
Learning Rate Decay: Learning rate decay sets the rate at which the learning rate of a network drops over time, allowing the model to learn more quickly.
Batch Size: Batch size sets the amount of samples the model will compute before updating its parameters. It has a significant effect on both compute efficiency and accuracy of the training process.
Number of Hidden Layers: The number of hidden layers in a neural network determines its depth, which affects its complexity and learning ability. Fewer layers make for a simpler and faster model, but more layers-such as with deep learning networks-lead to better classification of input data.
Number of Nodes/Neurons per Layer: The number of nodes or neurons per layer sets the width of the model.
Momentum: Momentum is the degree to which models update parameters in the same direction as previous iterations, rather than reversing course.
Epochs: Epochs is a hyperparameter that sets the amount of times that a model is exposed to its entire training dataset during the training process.
Activation Function: Activation function introduces nonlinearity into a model, allowing it to handle more complex datasets.

2. Support Vector Machine (SVM)

Support vector machine (SVM) is a machine learning algorithm specializing in data classification, regression and outlier detection.

C: C is the ratio between the acceptable margin of error and the resulting number of errors when a model acts as a data classifier. A lower C value establishes a smooth decision boundary with a higher error tolerance and more generic performance, but with a risk of incorrect data classification.
Kernel: Kernel is a function that establishes the nature of the relationships between data points and separates them into groups accordingly. Depending on the kernel used, data points will show different relationships, which can strongly affect the overall SVM model performance. Linear, polynomial, radial basis function (RBF), and sigmoid are a few of the most commonly used kernels.
Gamma: Gamma sets the level of influence support vectors have on the decision boundary. Support vectors are the data points closest to the hyperplane: the border between groups of data. Higher values pull strong influence from nearby vectors, while lower values limit the influence from more distant ones.

3. XGBoost

XGBoost stands for “extreme gradient boosting” and is an ensemble algorithm that blends the predictions of multiple weaker models, known as decision trees, for a more accurate result.

Learning Rate: learning_rate is similar to the learning rate hyperparameter used by neural networks. This function controls the level of correction made during each round of training.
n_estimators: n_estimators sets the number of trees in the model.
max_depth: max_depth determines the architecture of the decision tree, setting the maximum amount of nodes from the tree to each leaf-the final classifier.
minchildweight: min_child_weight is the minimum weight-the importance of a given class to the overall model training process-needed to spawn a new tree.

4. Linear Regression

While often considered a simpler algorithm, Linear Regression benefits from hyperparameters when dealing with multicollinearity or the risk of overfitting.

Regularization Parameter (alpha for Ridge/Lasso Regression): Regularization techniques like Ridge (L2) and Lasso (L1) add a penalty term to the cost function to shrink the model’s coefficients. This helps prevent the model from becoming too complex and fitting the noise in the training data. alpha (in scikit-learn) controls the strength of the regularization.
- A higher alpha increases the penalty, leading to smaller coefficients and a simpler model, which can help with overfitting but might underfit if set too high.
- A lower alpha reduces the penalty, making the model more flexible and potentially leading to overfitting if not carefully managed.

5. Logistic Regression

Used for binary and multi-class classification, Logistic Regression also employs regularization to improve its generalization ability.

C (Inverse of Regularization Strength): Similar to alpha in linear regression, C controls the regularization strength. However, C is the inverse of the regularization parameter.
- A higher C means weaker regularization, allowing the model to fit the training data more closely, potentially leading to overfitting.
- A lower C means stronger regularization, forcing the model to have smaller coefficients and potentially underfitting.
Penalty (L1, L2): Specifies the type of regularization to be applied.
- L1 (Lasso): Can drive some feature coefficients to exactly zero, effectively performing feature selection.
- L2 (Ridge): Shrinks coefficients towards zero but rarely makes them exactly zero.

6. Decision Tree

Decision Trees learn by recursively splitting the data based on feature values. Hyperparameters control the structure and complexity of these trees.

max_depth: The maximum depth of the tree. A deeper tree can capture more complex relationships but is more prone to overfitting.
minsamplessplit: The minimum number of samples required to split an internal node. Higher values prevent the creation of very specific splits based on small subsets of data.
minsamplesleaf: The minimum number of samples required to be at a leaf node. Similar to min_samples_split, this helps prevent the tree from becoming too sensitive to individual data points.
criterion: The function used to measure the quality of a split (e.g., 'gini' for Gini impurity or 'entropy' for information gain in classification).

7. K-Nearest Neighbors (KNN)

KNN is a non-parametric algorithm that classifies or regresses data points based on the majority class or average value of their nearest neighbors.

n_neighbors: The number of neighboring data points to consider when making a prediction.
- A small n_neighbors can make the model sensitive to noise in the data.
- A large n_neighbors can smooth the decision boundaries but might miss local patterns.
weights: The weight assigned to each neighbor.
- uniform: All neighbors are weighted equally.
- distance: Neighbors closer to the query point have a greater influence.
metric: The distance metric to use (e.g., 'euclidean', 'manhattan', 'minkowski'). The choice of metric can significantly impact the results depending on the data distribution.

Advantages of Hyperparameter Tuning

Improved Model Performance: Finding the optimal combination of hyperparameters can significantly boost model accuracy and robustness.
Reduced Overfitting and Underfitting: Tuning helps to prevent both overfitting and underfitting resulting in a well-balanced model.
Enhanced Model Generalizability: By selecting hyperparameters that optimize performance on validation data the model is more likely to generalize well to unseen data.
Optimized Resource Utilization: With careful tuning resources such as computation time and memory can be used more efficiently avoiding unnecessary work.
Improved Model Interpretability: Properly tuned hyperparameters can make the model simpler and easier to interpret.

Challenges in Hyperparameter Tuning

Dealing with High-Dimensional Hyperparameter Spaces: The larger the hyperparameter space the more combinations need to be explored. This makes the search process computationally expensive and time-consuming especially for complex models with many hyperparameters.
Incorporating Domain Knowledge: It can help guide the hyperparameter search, narrowing down the search space and making the process more efficient.

Best Practices for Hyperparameter Tuning

Start with Defaults (Scikit-learn’s defaults are often reasonable).
Use Cross-Validation (Avoid overfitting with KFold or StratifiedKFold).
Prioritize Impactful Hyperparameters (e.g., n_neighbors in KNN matters more than weights).
Log Experiments (Track performance with tools like MLflow or Weights & Biases).

The Danger of Neglecting Hyperparameter Tuning

The failure to correctly tune and report hyperparameters has recently been identified as a key impediment to the accumulation of knowledge in computer science. Without such tuning, it is impossible to compare the performance of two different models. Such “hyperparameter deception” has confused scientific progress in various subfields in computer science where machine learning plays a key role, including natural language processing, computer vision, and generative models.

tags: #hyperparameters #in #machine #learning #explained