Advanced – Ridge Regression Notes (Module 1)

In our previous discussions about Linear Regression (and the OLS Estimator), we identified a key limitation: multicollinearity. When the predictor variables (columns of $X$ ) are highly correlated, the matrix $X^{T}X$ becomes nearly singular, affecting the stability of our OLS estimator. Ridge Regression effectively addresses the limitations of OLS regression by incorporating a parameter $\lambda$ (lambda), called the Ridge parameter. This inclusion enhances the stability of $(X^{T}X)^{-1}$ , and crucially reduces the variance of our model at the expense of introducing some bias. However before delving into the advantages of Ridge Regression over Linear Regression, it’s essential to lay some theoretical groundwork.

SVD Approach

One fundamental aspect of understanding Ridge Regression lies in examining the Singular Value Decomposition (SVD) of the matrix $X$ . Grasping the role of SVD here is pivotal, as it provides deep insights into how Ridge Regression stabilizes the solutions, dealing with issues like multicollinearity and overfitting, which often plague standard OLS models. SVD breaks down the matrix $X$ into three components: $U$ , $\Sigma$ , and $V$ . Here $U$ and $V$ are unitary matrices, and $\Sigma$ is a diagonal matrix of singular values of $X$ . In this context, $X = U \Sigma V^{T}$ .

As was stated before, highly correlated columns of $X$ can lead to issues when calculating the OLS estimator, specifically in inverting $X^{T}X$ . In this case, the determinant of $X$ , denoted as $\text{det}(X)$ , is approximately zero. Furthermore, $\text{det}(X^{T}X) = \text{det}(V \Sigma^{2} V^{T})$ is close to zero. Matrices with determinants that are nearly zero pose significant challenges with stable inversion. A nearly singular matrix $X^{T}X$ , in the presence of multicollinearity, is sensitive to small changes in its elements. This sensitivity means that minor fluctuations in data can lead to large variations in the computed inverse, which in turn significantly affects the estimated regression coefficients. In simpler terms, matrices with determinants close to zero pose challenges in computing solutions using OLS methods.

We tackle this issue by adding an adjustable term $\lambda I$ to $\Sigma^{2}$ in the decomposition of $X^{T}X$ , where $I$ is the identity matrix. This adjustment changes the determinant calculation $\text{det}(V \Sigma^{2} V^{T})$ to $\text{det}(V (\Sigma^{2} + \lambda I) V^{T})$ . Expanding this, we get:

$\text{det}(V( \Sigma^{2} + \lambda I) V^{T}) = \text{det}(V \Sigma^{2} V^{T} + \lambda VV^{T})$

and since $V$ is an orthogonal matrix, $VV^{T}$ equals $I$ :

$= \text{det}(V \Sigma^{2} V^{T} + \lambda I)$

$= \text{det}(X^{T}X + \lambda I)$ .

As $\lambda$ increases, the determinant significantly differs from zero, which ensures that $(X^{T}X + \lambda I)^{-1}$ is stable. This leads us to an adjustably stable estimator of our weights:

$w_{Ridge} = (X^{T}X + \lambda I)^{-1}X^{T} y$ .

Calculus-Based Solution

Traditionally, Ridge Regression is formulated through the lens of the least squares method, enhanced with the regularization term $\lambda \| w\|^{2}$ . This approach aims to mitigate the excessive variance often observed in the OLS model. The mathematical representation is:

$w_{Ridge} = \text{argmin}_{w} \| y - Xw \| ^{2} + \lambda \| w \|^{2}$

To delve deeper, we can expand and simplify this equation, mirroring the process used in Linear Regression:

$= \text{argmin}_{w} \{ (y - Xw)^{T}(y-Xw) + \lambda w^{T}w \}$

$= \text{argmin}_{w}\{y^{T}y - 2 w^{T}X^{T}y + w^{T}X^{T}Xw + \lambda w^{T}w\}$

$= \text{argmin}_{w} \{y^{T}y - 2 w^{T}X^{T} y + w^{T}(X^{T}X + \lambda I) w\}$

To find the solution to this convex problem, we set its gradient, $\nabla_{w}$ , equal to zero:

$0 = \nabla_{w} \{y^{T}y - 2 w_{Ridge}^{T} X^{T} y + w_{Ridge}^{T}(X^{T}X + \lambda I) w_{Ridge}\}$

$= 2(X^{T}X + \lambda I) w_{Ridge} - 2 X^{T}y$ ,

which leads us back to the previously stated solution for $w_{Ridge}$ :

$w_{Ridge} = (X^{T}X + \lambda I)^{-1}X^{T}y$ .

This derivation not only demonstrates the close relationship between Ridge Regression and OLS but also highlights the impact of the regularization term in refining the model.

On Degrees of Freedom

Lets observe the meaning of $\lambda$ in terms of what is happening to the predictors (or columns) of $X$ . Again, assuming $X$ is a k – by – (d+1) matrix, we know the hat matrix

$H_{LS} = X(X^{T}X)^{-1}X^{T}$

projects the response variable $y$ onto the space of fitted values $Xw$ . Assuming the columns of $X$ are linearly independent, one can see the rank of $H_{LS}$ is equal to the number of features of $X$ . Also, by taking the trace of $H_{LS}$ :

$\text{trace}[H_{LS}] = \text{trace}[X(X^{T}X)^{-1}X^{T}]$

$= \text{trace}[(X^{T}X)^{-1}X^{T}X]$

$= \text{trace}[I]$

$= d+1$

So we can see $\text{trace}[H_{LS}]$ is equal to the number of predictors of $X$ , when $(X^{T}X)^{-1}$ is feasible. In the case of multicollinearity, we can modify the hat matrix:

$H_{Ridge} = X(X^{T}X + \lambda I)^{-1}X^{T}$

and observe the following equation

$df_{X}(\lambda) = \text{trace}[H_{Ridge}]$

which represents the effective degrees of freedom under the regularization parameter $\lambda$ . In this context, it gives a measure of how much “influence” the model retains under $\lambda$ , and ranges from zero to the number of predictors of $X$ . Generally the trace operation in this formula is a way to sum up those influences across all predictors, providing a single value that reflects the model’s complexity under the given level of regularization.

A Broad Speculation

Now, lets pivot our discussion about degrees of freedom to a broader context, which for sake of variety and depth, we’ll refer to by its other name: Tikhonov Regularization. We are given a more generalized formula:

$w_{Tikhonov} = \text{argmin}_{w} \| y - Xw \|^{2} + \| \Gamma w \| ^{2}$

where $\Gamma$ represents the Tikhonov matrix. In the specific case of Ridge Regression, this matrix simplifies to $\sqrt{\lambda} I$ . This perspective is instrumental in analyzing the impact of the regularization on our model, particularly through the matrix $\Gamma^{T} \Gamma$ .

Upon following standard mathematical procedures, we deduce:

$w_{Tikhonov} = (X^{T}X + \Gamma^{T} \Gamma)^{-1} X^{T} y$

This gives us room to speculate. Notably, $\Gamma^{T} \Gamma$ is a symmetric matrix, suitable for decomposition:

$\Gamma^{T} \Gamma = Q \Lambda Q^{T}$

In this eigendecomposition, the orthogonal matrix $Q$ can be perceived as an eigenbasis for the weight vector $w$ , providing a foundation to understand how each dimension contributes to the regularization. The eigenvalues in $\Lambda = \text{diag}(\lambda_1, \lambda_2, ..., \lambda_{d+1})$ offer a way to fine-tune this regularization optimally.

This approach illuminates a novel aspect: generalizing the concept of degrees of freedom to assess the contribution of features in the model. We express this as:

$D(\Lambda) = \text{trace}[H_{Tikhonov}]$

$= \text{trace}[X(X^{T}X + \Gamma^{T} \Gamma)^{-1} X^{T}]$

where $D(\Lambda)$ now represents a more comprehensive measure of the model’s flexibility, factoring in the influence of each predictor variable as denoted by the eigenvalues $\lambda_1, \lambda_2, ..., \lambda_{d+1}$ . This generalized definition of degrees of freedom is a significant stride in understanding the intricate balance between feature contribution and regularization in complex models.

Share Your Feedback

BGecko

Learn how to see.

Advanced – Ridge Regression Notes (Module 1)

SVD Approach

Calculus-Based Solution

On Degrees of Freedom

A Broad Speculation

Leave a Reply Cancel reply

SVD Approach

Calculus-Based Solution

On Degrees of Freedom

A Broad Speculation

Share this:

Related

Leave a Reply Cancel reply