Advanced – Ridge Regression Notes (Module 2): A Closer Look at the Ridge Estimator

With a clear understanding of the framework of Ridge Regression, we are now well-equipped to delve deeper into some of its nuances. A key aspect of this exploration involves examining the parameters of the distribution over the Ridge weights, denoted as $w_{Ridge}$ . Through this, we will uncover a crucial property: while Ridge Regression helps mitigate high variance in our OLS model, it also potentially introduces a significant amount of bias. This emphasizes the necessity to develop methods that can optimize the trade-off between bias and variance, which we will investigate later. For now, it is important to learn about some already established preliminary results concerning our response variable $y$ .

Understanding y

Before we proceed to determine the parameters for the distribution over $w_{Ridge}$ , we will take a moment to reexamine and clarify the nature of $y$ . This reexamination will ensure that we have a comprehensive understanding of $y$ as a fundamental element in our analysis, setting the stage for a more informed exploration of the distribution parameters of Ridge weights. Lets revisit the first two fundamental assumptions about Linear Regression:

Linearity Assumption: This implies that the relationship between the dependent variable $y$ and the independent variables $X$ is linear, represented as:

$y = Xw + \xi$

Homoscedasticity: This term refers to the assumption that the residuals (or errors) $\xi$ have a constant variance, denoted as:

$\xi \sim N(0, \sigma^{2} I)$

By combining these assumptions, we arrive at an important inference that has been commonly stated. The distribution of $y$ can be modeled as a normal distribution with mean $Xw$ and a variance of $\sigma^{2} I$ expressed as: $y \sim N(Xw, \sigma^{2} I)$ .

It’s also pertinent to reintroduce an equation from my previous discussion on the Ordinary Least Squares (OLS) estimator:

$\mathbb{E}[yy^{T}] = \sigma^{2} I + Xww^{T}X^{T}$

This equation, derived from a detailed examination of the variance of $y$ , provides valuable insights into our evaluation of the variance of $w_{Ridge}$ .

Evaluating the Mean

Following the same approach we used to determine the expected value of $w_{LS}$ , we will set up the calculation for $w_{Ridge}$ in a similar manner:

$\mathbb{E}[w_{Ridge}] = \mathbb{E}[(X^{T}X + \lambda I)^{-1}X^{T}y]$

$= (X^{T}X + \lambda I)^{-1}X^{T}\mathbb{E}[y]$

$= (X^{T}X + \lambda I)^{-1}X^{T}Xw$

So we observe that as $\lambda$ approaches zero, $\mathbb{E}[w_{Ridge}]$ converges towards $\mathbb{E}[w_{LS}]$ . On the other hand, when $\lambda$ increases towards infinity, $\mathbb{E}[w_{Ridge}]$ tends toward zero. It’s important to note here that with any positive value of $\lambda$ , there is an inherent addition of bias to the Ridge weights.

Evaluating the Variance

We’ve seen how incorporating the ridge term introduces additional bias into the weights of our model. While increasing bias is generally undesirable, understanding the full impact of this modification is essential. Specifically, we need to examine how the variance of $w_{Ridge}$ is affected by the addition of the ridge term.

Remembering our earlier analyses, we deduced that $\text{Var}[w_{LS}] = \sigma^{2}(X^{T}X)^{-1}$ , which highlighted how multicollinearity in $X$ could lead to instability in the $(X^{T}X)^{-1}$ component, raising issues in estimating variance accurately. We’ll approach the variance of $\text{Var}[w_{Ridge}]$ using a similar method:

$\text{Var}[w_{Ridge}] = \mathbb{E}[w_{Ridge}w_{Ridge}^{T}] - \mathbb{E}[w_{Ridge}] \mathbb{E}[w_{Ridge}^{T}]$

Lets separate this problem by terms. In term one, we have

$\mathbb{E}[w_{Ridge}w_{Ridge}^{T}]$

$= \mathbb{E}[(X^{T}X + \lambda I)^{-1}X^{T}yy^{T}X(X^{T}X + \lambda I)^{-1}]$

$= (X^{T}X + \lambda I)^{-1}X^{T} \mathbb{E}[yy^{T}]X(X^{T}X + \lambda I)^{-1}$

by linearity of expectation. Substituting for $\mathbb{E}[yy^{T}]$ ,

$= (X^{T}X + \lambda I)^{-1}X^{T} \{ \sigma^{2} I + Xww^{T}X^{T}\}X(X^{T}X + \lambda I)^{-1}$

$= \sigma^{2} \{(X^{T}X +\lambda I)^{-1}X^{T}X(X^{T}X + \lambda I)^{-1}\} \newline + (X^{T}X + \lambda I)^{-1} X^{T}Xww^{T}X^{T}X(X^{T}X + \lambda I)^{-1}$

This may at first seem intricate, but calculating the second term will help in simplifying the overall equation.

$\mathbb{E}[w_{Ridge}] \mathbb{E}[w_{Ridge}]^{T} = (X^{T}X + \lambda I)^{-1} X^{T}Xww^{T}X^{T}X(X^{T}X + \lambda I)^{-1}$

This calculation allows us to refine and clarify our formula for the variance of $w_{Ridge}$ :

$\text{Var}[w_{Ridge}] = \sigma^{2} \{(X^{T}X +\lambda I)^{-1}X^{T}X(X^{T}X + \lambda I)^{-1}\}$

As $\lambda$ approaches zero, the variance of the Ridge estimator converges to the variance of the Ordinary Least Squares (OLS) estimator. Moreover as the value of $\lambda$ increases towards infinity, the variance of the ridge regression coefficients approaches zero. This trend is due to the inversely proportional relationship between $(X^{T}X + \lambda I)$ and $\text{Var}[w_{Ridge}]$ .

Propagation of Uncertainty

A fascinating alternative to examine the variance of the ridge estimator, $w_{Ridge}$ , is by employing the concept of uncertainty propagation. This method is particularly effective in the context of linear equations. For some $n \times m$ matrix $M$ , consider the system:

$u = Mv$

In such a system, the uncertainty propagation principle allows us to determine the covariance matrix of $u$ , denoted as $\Sigma^{u}$ , using the formula:

$\Sigma^{u} = M \Sigma^{v} M^{T}$

This principle can be applied to our ridge estimator, which is expressed as:

$w_{Ridge} = (X^{T}X + \lambda I)^{-1}X^{T}y$

We can rewrite it as $w_{Ridge} = Qy$ , where $Q$ represents the transformation matrix $(X^{T}X + \lambda I)^{-1}X^{T}$ . This transformation takes the vector $y$ into the space of $w_{Ridge}$ .

Assuming the covariance of the response vector $y$ is given by $\Sigma^{y} = \sigma^{2} I$ , which indicates that the observations in $y$ have uniform variance $\sigma^{2}$ , we can then apply the uncertainty propagation. It leads us to the variance of $w_{Ridge}$ as follows:

$\text{Var}[w_{Ridge}] = Q \Sigma^{y} Q^{T}$

By substituting the expressions for $Q$ and $\Sigma^{y}$ , we get

$\text{Var}[w_{Ridge}] = \sigma^{2} \{(X^{T}X +\lambda I)^{-1}X^{T}X(X^{T}X + \lambda I)^{-1}\}$

This result beautifully summarizes the influence of the response variable’s intrinsic variance, denoted as $\sigma^{2}$ , on the variance of the ridge estimator. By understanding these crucial aspects of $w_{Ridge}$ ’s distribution, we’ve laid a solid foundation for what’s next. Moving to Module 3 of Ridge Regression, we’ll delve into an intriguing technique known as Maximum a Posteriori (MAP) estimation, offering a fresh perspective on our ongoing analysis.

Share Your Feedback

BGecko

Learn how to see.

Advanced – Ridge Regression Notes (Module 2): A Closer Look at the Ridge Estimator

Understanding y

Evaluating the Mean

Evaluating the Variance

Propagation of Uncertainty

Leave a Reply Cancel reply

Understanding y

Evaluating the Mean

Evaluating the Variance

Propagation of Uncertainty

Share this:

Related

Leave a Reply Cancel reply