Advanced – Linear Regression

Linear regression serves as a fundamental stepping stone into the world of machine learning, embodying both simplicity and the power of predictive analytics. Conceptually, it rests on a graceful mathematical framework that elegantly unravels its potential and delineates its limitations. This guide will walk you through the mathematical fundamentals, offering a clear exposition of its foundational principles. We will also assess how we can confront and circumvent these limitations, paving the way for more sophisticated analytical endeavors.

Envision a set of data points, each holding a story which can be better understood through the lens of linear regression. For i = 1, 2, 3, ..., k, consider each data point x_i as a vector within a d-dimensional space, \mathbb{R}^d, where each dimension corresponds to a different feature or measurement. Imagine the features of x_i as coordinates plotted on a graph; each one marks a unique position on the axes, sketching the narrative of your dataset in a geometrical tableau.

Then we may compile these data points into a k-by-d data matrix, akin to a spreadsheet where each row is a data point and every column a distinct feature.


Corresponding to each data point in our matrix, there is an outcome or response, represented as a component of y in the vector space \mathbb{R}^{k}.

Linear regression rests upon four critical assumptions that shape our analysis:

The relationship between the independent variables x_i and the dependent variable y_i is linear, implying that it can be graphically represented by a straight line—or a hyperplane when d > 1—within the context of the model.

While the complexities of real-world phenomena often defy simple patterns, in the realm of linear regression, we distill these intricacies into a linear model that seeks the hyperplane most closely aligned with our data. We can describe this linear relationship using an equation that aims to determine the optimal “weights” or coefficients ( w) of our variables, while accounting for possible errors or noise in the data ( \xi_i):

y_i = (x_i^T, 1)*w + \xi_i

= \tilde{y_i} + \xi_i

In this equation, the term (x_i^{T} ,1) represents the feature vector x_i​ after being transposed and augmented with a 1 to incorporate the bias term as part of the weight vector w. This bias term accounts for any offset from the origin in the model. The product of these vectors gives us the predicted value \tilde{y_i} based on our model. Finally, \xi_i represents the error term for the i-th data point, accounting for the deviation of the predicted value from the actual observed value y_i. The goal in linear regression is to adjust the weights w to minimize these errors across all data points.

If the equation above seems intricate, imagine it as finding the best-fit line through a scatterplot. By appending a column of ones to our matrix X, we compactly model all observations with

y = Xw + \xi = \tilde{y} + \xi,

where y encapsulates the outcomes, and \tilde{y} = Xw denotes our best-fit line, with \xi accounting for any discrepancies.

The variability of the errors, which measure the vertical distances from our data points to the best-fit line, should remain constant across all levels of our independent variables. This concept, known as homoscedasticity, ensures that our model’s predictive accuracy doesn’t depend on the magnitude of the data points. Mathematically, we express this as the errors \xi_i being normally distributed with a mean of zero and a constant variance, concisely written as \xi \sim N(0, \sigma^{2}I).

Independence of observations is key in linear regression; each data point must not be influenced by any other. This ensures that our model’s inferences are sound and unbiased. Mathematically, we express this as p(y_i|y_j) = p(y_i) for i \neq j, meaning the presence or position of one data point does not alter the likelihood of any other. This assumption is critical to avoid issues like autocorrelation–common in time series data–which can skew our model’s coefficients, compromising the model’s integrity, and undermining the validity of our predictions.

In linear regression, it’s essential that our independent variables, represented as x_i‘s, exhibit minimal correlation with one another. This phenomenon, termed multicollinearity, can obscure the individual effect of each variable, inflate the variance of the estimated coefficients, and compromise the interpretability of the model. Mathematically, we anticipate a correlation matrix of independent variables with off-diagonal elements close to zero, indicating low multicollinearity. We’ll delve into strategies to diagnose and rectify multicollinearity later, but for our initial model, we proceed under the assumption that such correlations are negligible.

Ordinary Least Squares Method

To identify the best-fitting line as described by our model, we need to find the weight vector w that minimizes the discrepancy between the observed outcomes y and the predictions \tilde{y}=Xw. This discrepancy is encapsulated in the residuals \xi, given by:

\xi = y - Xw

Our goal is to minimize the sum of the squares of these residuals, which is equivalent to minimizing their Euclidean norm squared:

\| \xi \|^{2} = \| y - Xw\|^{2}

Minimizing this quantity involves finding the weight vector w_{LS} that provides the smallest possible value for the squared residuals:

w_{LS} = \text{argmin}_{w} \|\xi\|^2 = \text{argmin}_{w} \| y - Xw\|^2

By expanding this expression, we aim to solve the optimization problem:

= \text{argmin}_{w} \{y^{T}y - 2w^{T}X^{T}y + w^{T}X^{T}Xw\}

We can eliminate the first term here, since it is not a function of w.

= \text{argmin}_{w}\{w^{T}X^{T}Xw - 2w^{T}X^{T}y\}

This formulation gives us the ordinary least squares solution (OLS) for the linear regression. The OLS method gives us the most efficient unbiased estimator under the Gauss-Markov theorem, assuming our previous four assumptions hold true.

To determine the optimal weights w, we take advantage of the convex nature of our function. Convexity ensures that any local minimum is also a global minimum, and thus, the point where the gradient (or the first derivative with respect to w) of our function equals zero will give us the value of w that minimizes the function. We calculate this by setting the derivative of our squared error term with respect to w to zero:

0 = \nabla_{w} \{w^{T}X^{T}Xw - 2w^{T}X^{T} y\}

Simplifying this expression by taking the derivative, we obtain:

= 2X^{T}Xw - 2X^{T}y

By rearranging the terms, we get:

X^{T}Xw = X^{T}y

When X^{T}X is invertible, we can solve for the weight vector w by multiplying both sides by the inverse of X^{T}X:

w_{LS} = (X^{T}X)^{-1}X^{T}y.

This solution w_{LS} is known as the Ordinary Least Squares (OLS) estimator. It gives us the set of weights that minimizes the squared residuals, thus providing the ‘best fit’ for our linear model under the least squares criterion. It’s important to note that this solution only exists when X^{T}X is invertible, which requires our data to be free of perfect multicollinearity, as discussed earlier.

An essential precondition for computing the matrix inverse (X^{T}X)^{-1} is that the matrix X has full column rank, meaning all columns (which represent our variables) are linearly independent. If this condition is not met, such as in the presence of multicollinearity, where variables are highly correlated, X^{T}X can become nearly singular. In practical terms, this can lead to large errors in the calculation of w_{LS} and unreliable regression estimates. Thus, confirming the absence of multicollinearity is not just a theoretical consideration but a computational necessity for the integrity of the OLS estimation.

Maximum Likelihood Estimation Approach

Building upon Assumption Two, we take a probabilistic approach to estimate w using Maximum Likelihood Estimation (MLE). We start by recognizing that the residual \xi is normally distributed as N(0, \sigma ^2 I). This leads us to the probability density function (pdf) for \xi:

p(\xi) = \frac{1}{(2 \pi)^{k/2}\sigma}e^{-\frac{1}{2\sigma^2} \xi^{T} \xi}

With a substitution for \xi, the pdf becomes:

p(y | w, X) = \frac{1}{(2 \pi)^{k/2}\sigma}e^{-\frac{1}{2\sigma^2} (y-Xw)^{T} (y-Xw)}

Here, Xw is posited as the “mean” of our predictions, with the observed outcomes y distributed around this mean. To find the optimal w, we turn to MLE, seeking the value of w that maximizes p(y | w, X):

w_{ML} = \text{argmax}_{w} \ln p(y | w, X)

= \text{argmax}_{w} - \frac{1}{2 \sigma^2} (y - Xw)^{T}(y-Xw)

= \text{argmin}_{w} \| \xi\|^2

= w_{LS}

Remarkably, this illustrates the equivalence between MLE under the assumption of normally distributed errors and the method of minimizing the sum of squared residuals, as we have previously shown with OLS. This equivalence bridges our statistical assumptions with practical estimation techniques. For a deeper dive into the MLE process, I invite the reader to reference my prior blog post on the topic.

Leave a Reply

Your email address will not be published. Required fields are marked *