Advanced – Ridge Regression Notes (Module 1)

In our previous discussions about Linear Regression (and the OLS Estimator), we identified a key limitation: multicollinearity. When the predictor variables (columns of X) are highly correlated, the matrix X^{T}X becomes nearly singular, affecting the stability of our OLS estimator. Ridge Regression effectively addresses the limitations of OLS regression by incorporating a parameter \lambda (lambda), called the Ridge parameter. This inclusion enhances the stability of (X^{T}X)^{-1}, and crucially reduces the variance of our model at the expense of introducing some bias. However before delving into the advantages of Ridge Regression over Linear Regression, it’s essential to lay some theoretical groundwork.

One fundamental aspect of understanding Ridge Regression lies in examining the Singular Value Decomposition (SVD) of the matrix X. Grasping the role of SVD here is pivotal, as it provides deep insights into how Ridge Regression stabilizes the solutions, dealing with issues like multicollinearity and overfitting, which often plague standard OLS models. SVD breaks down the matrix X into three components: U, \Sigma, and V. Here U and V are unitary matrices, and \Sigma is a diagonal matrix of singular values of X. In this context, X = U \Sigma V^{T}.

As was stated before, highly correlated columns of X can lead to issues when calculating the OLS estimator, specifically in inverting X^{T}X. In this case, the determinant of X, denoted as \text{det}(X), is approximately zero. Furthermore, \text{det}(X^{T}X) = \text{det}(V \Sigma^{2} V^{T}) is close to zero. Matrices with determinants that are nearly zero pose significant challenges with stable inversion. A nearly singular matrix X^{T}X, in the presence of multicollinearity, is sensitive to small changes in its elements. This sensitivity means that minor fluctuations in data can lead to large variations in the computed inverse, which in turn significantly affects the estimated regression coefficients. In simpler terms, matrices with determinants close to zero pose challenges in computing solutions using OLS methods.

We tackle this issue by adding an adjustable term \lambda I to \Sigma^{2} in the decomposition of X^{T}X, where I is the identity matrix. This adjustment changes the determinant calculation \text{det}(V \Sigma^{2} V^{T}) to \text{det}(V (\Sigma^{2} + \lambda I) V^{T}). Expanding this, we get:

\text{det}(V( \Sigma^{2} + \lambda I) V^{T}) = \text{det}(V \Sigma^{2} V^{T} + \lambda VV^{T})

and since V is an orthogonal matrix, VV^{T} equals I:

= \text{det}(V \Sigma^{2} V^{T} + \lambda I)

= \text{det}(X^{T}X + \lambda I).

As \lambda increases, the determinant significantly differs from zero, which ensures that (X^{T}X + \lambda I)^{-1} is stable. This leads us to an adjustably stable estimator of our weights:

w_{Ridge} = (X^{T}X + \lambda I)^{-1}X^{T} y.

Traditionally, Ridge Regression is formulated through the lens of the least squares method, enhanced with the regularization term \lambda \| w\|^{2}. This approach aims to mitigate the excessive variance often observed in the OLS model. The mathematical representation is:

w_{Ridge} = \text{argmin}_{w} \| y - Xw \| ^{2} + \lambda \| w \|^{2}

To delve deeper, we can expand and simplify this equation, mirroring the process used in Linear Regression:

= \text{argmin}_{w} \{ (y - Xw)^{T}(y-Xw) + \lambda w^{T}w \}

= \text{argmin}_{w}\{y^{T}y - 2 w^{T}X^{T}y + w^{T}X^{T}Xw + \lambda w^{T}w\}

= \text{argmin}_{w} \{y^{T}y - 2 w^{T}X^{T} y + w^{T}(X^{T}X + \lambda I) w\}

To find the solution to this convex problem, we set its gradient, \nabla_{w}, equal to zero:

0 = \nabla_{w} \{y^{T}y - 2 w_{Ridge}^{T} X^{T} y + w_{Ridge}^{T}(X^{T}X + \lambda I) w_{Ridge}\}

= 2(X^{T}X + \lambda I) w_{Ridge} - 2 X^{T}y,

which leads us back to the previously stated solution for w_{Ridge}:

w_{Ridge} = (X^{T}X + \lambda I)^{-1}X^{T}y.

This derivation not only demonstrates the close relationship between Ridge Regression and OLS but also highlights the impact of the regularization term in refining the model.

Lets observe the meaning of \lambda in terms of what is happening to the predictors (or columns) of X. Again, assuming X is a k – by – (d+1) matrix, we know the hat matrix

H_{LS} = X(X^{T}X)^{-1}X^{T}

projects the response variable y onto the space of fitted values Xw. Assuming the columns of X are linearly independent, one can see the rank of H_{LS} is equal to the number of features of X. Also, by taking the trace of H_{LS}:


\text{trace}[H_{LS}] = \text{trace}[X(X^{T}X)^{-1}X^{T}]

= \text{trace}[(X^{T}X)^{-1}X^{T}X]

= \text{trace}[I]

= d+1

So we can see \text{trace}[H_{LS}] is equal to the number of predictors of X, when (X^{T}X)^{-1} is feasible. In the case of multicollinearity, we can modify the hat matrix:

H_{Ridge} = X(X^{T}X + \lambda I)^{-1}X^{T}

and observe the following equation

df_{X}(\lambda) = \text{trace}[H_{Ridge}]

Now, lets pivot our discussion about degrees of freedom to a broader context, which for sake of variety and depth, we’ll refer to by its other name: Tikhonov Regularization. We are given a more generalized formula:

w_{Tikhonov} = \text{argmin}_{w} \| y - Xw \|^{2} + \| \Gamma w \| ^{2}

where \Gamma represents the Tikhonov matrix. In the specific case of Ridge Regression, this matrix simplifies to \sqrt{\lambda} I. This perspective is instrumental in analyzing the impact of the regularization on our model, particularly through the matrix \Gamma^{T} \Gamma.

Upon following standard mathematical procedures, we deduce:

w_{Tikhonov} = (X^{T}X + \Gamma^{T} \Gamma)^{-1} X^{T} y

Leave a Reply

Your email address will not be published. Required fields are marked *