Linear Regression in Machine Learning: A Comprehensive Overview

Linear Regression is a cornerstone statistical method employed to model the linear relationship between a dependent variable and one or more independent variables. It's a fundamental technique for predicting outcomes by fitting a straight line to observed data points, offering ease of interpretation and application. This article provides an in-depth explanation of linear regression, covering its principles, assumptions, and practical implementation.

Introduction to Linear Regression

Linear Regression is a supervised learning algorithm that models the relationship between a dependent variable and one or more independent variables. The algorithm identifies the best-fit straight line relationship (linear equation) between these variables. This statistical method is then used to predict the outcome of future events and is quite useful for predictive analysis. For instance, consider determining a person's salary based on their years of work experience. Typically, as experience increases, so does salary. Here, years of experience is the independent variable.

Linear regression serves as the foundation for many advanced algorithms, such as logistic regression and neural networks. In simple linear regression, there is one independent variable and one dependent variable. The model estimates the slope and intercept of the line of best fit, which represents the relationship between the variables.

Linear regression is a relatively simple statistical regression technique used for predictive analysis in machine learning. It illustrates the linear relationship between the independent (predictor) variable (X-axis) and the dependent (output) variable (Y-axis). The graph represents the linear relationship between the output (y) and predictor (X) variables. The blue line is referred to as the best-fit straight line.

Goal of Linear Regression

The goal of the linear regression algorithm is to obtain the optimal values for B₀ and B₁ to define the best-fit line. In simple terms, the best-fit line is the line that best approximates the given scatter plot.

Read also: Linear Algebra: An Overview

Gradient Descent and Optimization

Gradient Descent is an optimization algorithm that optimizes the cost function (objective function) to reach the optimal minimal solution. To find the optimum solution, we need to reduce the cost function (MSE) for all data points.

Imagine a U-shaped pit. You are standing at the uppermost point, and your motive is to reach the bottom. Suppose there is a treasure at the bottom, and you can only take a discrete number of steps to reach the bottom. If you opted to take one step at a time, you would get to the bottom in the end but this would take a longer time. If you decide to take larger steps each time, you may achieve the bottom sooner but, there’s a probability that you could overshoot the bottom of the pit and not even near the bottom.

To update B₀ and B₁, we take gradients from the cost function. We need to minimize the cost function J. One way to achieve this is to apply the batch gradient descent algorithm. In batch gradient descent, the values are updated in each iteration. The partial derivates are the gradients, and they are used to update the values of B₀ and B₁.

Advantages of Linear Regression

Linear regression offers several key advantages:

Simplicity and interpretability: It’s a relatively easy concept to understand and apply. The resulting simple linear regression model is a straightforward equation that shows how one variable affects another.
Prediction: Linear regression allows you to predict future values based on existing data.
Foundation for other techniques: It serves as a building block for many other data science and machine learning methods.
Widespread applicability: Linear regression can be used in various fields, from finance and economics to science and social sciences.

In essence, linear regression provides a solid foundation for understanding data and making predictions.

Read also: Deep Dive into Linear Algebra

Evaluating Linear Regression Models

The strength of any linear regression model can be assessed using various evaluation metrics:

R-squared: R-squared explains the amount of variation that is captured by the developed model. It always ranges between 0 & 1.
Residual Sum of Squares (RSS): RSS is defined as the sum of squares of the residual for each data point in the plot/data.
Total Sum of Squares (TSS): TSS is defined as the sum of errors of the data points from the mean of the response variable.
Root Mean Squared Error (RMSE): The Root Mean Squared Error is the square root of the variance of the residuals. It specifies the absolute fit of the model to the data i.e. how close the observed data points are to the predicted values.
Residual Standard Error (RSE): To make this estimate unbiased, one has to divide the sum of the squared residuals by the degrees of freedom rather than the total number of data points in the model. This term is then called the Residual Standard Error(RSE).

R-squared is a better measure than RSME. Because the value of Root Mean Squared Error depends on the units of the variables.

Assumptions of Linear Regression

Regression is a parametric approach, which means that it makes assumptions about the data for analysis. These assumptions include:

Linearity: The relationship between the independent and dependent variables is linear.
Independence of residuals: The error terms should not be dependent on one another (like in time-series data wherein the next value is dependent on the previous one). There should be no correlation between the residual terms.
Normal distribution of residuals: The mean of residuals should follow a normal distribution with a mean equal to zero or close to zero. This is done to check whether the selected line is the line of best fit or not.
Equal variance of residuals: The error terms must have constant variance. This phenomenon is known as Homoscedasticity. The presence of non-constant variance in the error terms is referred to as Heteroscedasticity.

Hypothesis Testing in Linear Regression

Once you have fitted a straight line on the data, you need to ask, “Is this straight line a significant fit for the data?” Or “Is the beta coefficient explain the variance in the data plotted?” And here comes the idea of hypothesis testing on the beta coefficient.

F statistic: It is used to assess whether the overall model fit is significant or not. the small change that instead of having one beta variable, you will now have betas for all the variables used.

Overfitting and Underfitting

Overfitting: When more and more variables are added to a model, the model may become far too complex and usually ends up memorizing all the data points in the training set. This phenomenon is known as the overfitting of a model. When a model has low bias and higher variance it ends up memorizing the data and causing overfitting. Overfitting causes the model to become specific rather than generic.
Underfitting: When the model fails to learn from the training dataset and is also not able to generalize the test dataset, is referred to as underfitting. When a model has high bias and low variance it ends up not generalizing the data and causing underfitting. It is unable to find the hidden underlying patterns in the data. This usually leads to low training accuracy and very low test accuracy.

Multicollinearity

Variance Inflation Factor (VIF): Pairwise correlations may not always be useful as it is possible that just one variable might not be able to completely explain some other variable but some of the variables combined could be ready to do this. Thus, to check these sorts of relations between variables, one can use VIF. VIF explains the relationship of one independent variable with all the other independent variables. The common heuristic followed for the VIF values is if VIF > 10 then the value is high and it should be dropped. And if the VIF=5 then it may be valid but should be inspected first.

Bias and Variance

Bias: Bias is a measure to determine how accurate a model’s predictions are likely to be on future unseen data. Complex models, assuming there is enough training data available, can make accurate model predictions. Whereas the models that are too naive, are very likely to perform badly concerning model predictions. Generally, linear algorithms have a high bias which makes them fast to learn and easier to understand but in general, are less flexible.
Variance: Ideally, a model should have lower variance which means that the model doesn’t change drastically after changing the training data(it is generalizable).

Linear Regression in Python

This is the section where you’ll find out how to perform the regression in Python. We will use Advertising sales channel prediction data. ‘Sales’ is the target variable that needs to be predicted.

Read also: Matrix Course Navigation

The first step is to fire up your Jupyter notebook and load all the prerequisite libraries in your Jupyter notebook. Let us now import data into a DataFrame. A DataFrame is a data type in Python. Let us plot the scatter plot for target variable vs. predictor variables in a single plot to get the intuition.

We can use sklearn or statsmodels to apply linear regression. And after assigning the variables you need to split our variable into training and testing sets. You’ll perform this by importing traintestsplit from the sklearn.model_selection library.

By default, the statsmodels library fits a line on the dataset which passes through the origin. But in order to have an intercept, you need to manually use the add_constant attribute of statsmodels.

Now, let’s see the evaluation metrics for this linear regression operation. As you can see, this code gives you a brief summary of the linear regression. The coefficient for TV is 0.054, with a very low p-value. The coefficient is statistically significant. R - squared is 0.816 Meaning that 81.6% of the variance in Sales is explained by TV. F-statistics has a very low p-value(practically low).

Now that you have simply fitted a regression line on your train dataset, it is time to make some predictions on the test data.

Apart from statsmodels, there is another package namely sklearn that can be used to perform linear regression. We will use the linear_model library from sklearn to build the model. There’s one small step that we need to add, though. When there’s only a single feature, we need to add an additional column in order for the linear regression fit to be performed successfully.

Linear Regression: Questions and Answers

Q1. What are the parameters of a linear regression?
- A. Linear regression has two main parameters: slope (weight) and intercept. The slope represents the change in the dependent variable for a unit change in the independent variable. The intercept is the value of the dependent variable when the independent variable is zero. The goal is to find the best-fitting line that minimizes the difference between predicted and actual values.
Q2. What is the formula for a linear regression line?
- A. The formula for a linear regression equation is:y = mx + bWhere y is the dependent variable, x is the independent variable, m is the slope (weight), and b is the intercept. It represents the best-fitting straight line describing the relationship between the variables by minimizing squared differences between actual and predicted values.
Q3. What is the application of linear regression?
- A. Linear regression is used in statistics, economics, finance, and more to analyze relationships between variables. For instance, I was predicting stock prices in finance or studying the impact of advertising on sales in marketing. A basic example involves predicting weight based on height, where a linear equation is used to estimate weight from height data.
Q4. What is a basic example of linear regression?
- A. A basic linear regression example involves predicting a person’s weight based on height. In this scenario, height is the independent variable, while weight is the dependent variable. The relationship between height and weight is modeled using a simple linear equation, where the weight is estimated as a function of the height.
Q5. What is the linear regression algorithm?
- A. Linear regression predicts the relationship between variables by fitting a straight line that minimizes the differences between predicted and actual values. It analyzes and forecasts data trends.

tags: #linear #regression #in #machine #learning