Advanced – Ridge Regression Notes (Module 2): A Closer Look at the Ridge Estimator

With a clear understanding of the framework of Ridge Regression, we are now well-equipped to delve deeper into some of its nuances. A key aspect of this exploration involves examining the parameters of the distribution over the Ridge weights, denoted as w_{Ridge}. Through this, we will uncover a crucial property: while Ridge Regression helps mitigate high variance in our OLS model, it also potentially introduces a significant amount of bias. This emphasizes the necessity to develop methods that can optimize the trade-off between bias and variance, which we will investigate later. For now, it is important to learn about some already established preliminary results concerning our response variable y.

Before we proceed to determine the parameters for the distribution over w_{Ridge}, we will take a moment to reexamine and clarify the nature of y. This reexamination will ensure that we have a comprehensive understanding of y as a fundamental element in our analysis, setting the stage for a more informed exploration of the distribution parameters of Ridge weights. Lets revisit the first two fundamental assumptions about Linear Regression:

Linearity Assumption: This implies that the relationship between the dependent variable y and the independent variables X is linear, represented as:

y = Xw + \xi

Homoscedasticity: This term refers to the assumption that the residuals (or errors) \xi have a constant variance, denoted as:

\xi \sim N(0, \sigma^{2} I)

By combining these assumptions, we arrive at an important inference that has been commonly stated. The distribution of y can be modeled as a normal distribution with mean Xw and a variance of \sigma^{2} I expressed as: y \sim N(Xw, \sigma^{2} I).

It’s also pertinent to reintroduce an equation from my previous discussion on the Ordinary Least Squares (OLS) estimator:

\mathbb{E}[yy^{T}] = \sigma^{2} I + Xww^{T}X^{T}

This equation, derived from a detailed examination of the variance of y, provides valuable insights into our evaluation of the variance of w_{Ridge}.

Following the same approach we used to determine the expected value of w_{LS}, we will set up the calculation for w_{Ridge} in a similar manner:

\mathbb{E}[w_{Ridge}] = \mathbb{E}[(X^{T}X + \lambda I)^{-1}X^{T}y]

= (X^{T}X + \lambda I)^{-1}X^{T}\mathbb{E}[y]

= (X^{T}X + \lambda I)^{-1}X^{T}Xw

So we observe that as \lambda approaches zero, \mathbb{E}[w_{Ridge}]​ converges towards \mathbb{E}[w_{LS}]. On the other hand, when \lambda increases towards infinity, \mathbb{E}[w_{Ridge}] tends toward zero. It’s important to note here that with any positive value of \lambda, there is an inherent addition of bias to the Ridge weights.

We’ve seen how incorporating the ridge term introduces additional bias into the weights of our model. While increasing bias is generally undesirable, understanding the full impact of this modification is essential. Specifically, we need to examine how the variance of w_{Ridge} is affected by the addition of the ridge term.

Remembering our earlier analyses, we deduced that \text{Var}[w_{LS}] = \sigma^{2}(X^{T}X)^{-1}, which highlighted how multicollinearity in X could lead to instability in the (X^{T}X)^{-1} component, raising issues in estimating variance accurately. We’ll approach the variance of \text{Var}[w_{Ridge}] using a similar method:

\text{Var}[w_{Ridge}] = \mathbb{E}[w_{Ridge}w_{Ridge}^{T}] - \mathbb{E}[w_{Ridge}] \mathbb{E}[w_{Ridge}^{T}]

Lets separate this problem by terms. In term one, we have

\mathbb{E}[w_{Ridge}w_{Ridge}^{T}]

= \mathbb{E}[(X^{T}X + \lambda I)^{-1}X^{T}yy^{T}X(X^{T}X + \lambda I)^{-1}]

= (X^{T}X + \lambda I)^{-1}X^{T} \mathbb{E}[yy^{T}]X(X^{T}X + \lambda I)^{-1}

by linearity of expectation. Substituting for \mathbb{E}[yy^{T}],

= (X^{T}X + \lambda I)^{-1}X^{T} \{ \sigma^{2} I + Xww^{T}X^{T}\}X(X^{T}X + \lambda I)^{-1}

= \sigma^{2} \{(X^{T}X +\lambda I)^{-1}X^{T}X(X^{T}X + \lambda I)^{-1}\} \newline + (X^{T}X + \lambda I)^{-1} X^{T}Xww^{T}X^{T}X(X^{T}X + \lambda I)^{-1}

This may at first seem intricate, but calculating the second term will help in simplifying the overall equation.

\mathbb{E}[w_{Ridge}] \mathbb{E}[w_{Ridge}]^{T} = (X^{T}X + \lambda I)^{-1} X^{T}Xww^{T}X^{T}X(X^{T}X + \lambda I)^{-1}

This calculation allows us to refine and clarify our formula for the variance of w_{Ridge}:

\text{Var}[w_{Ridge}] = \sigma^{2} \{(X^{T}X +\lambda I)^{-1}X^{T}X(X^{T}X + \lambda I)^{-1}\}

As \lambda approaches zero, the variance of the Ridge estimator converges to the variance of the Ordinary Least Squares (OLS) estimator. Moreover as the value of \lambda increases towards infinity, the variance of the ridge regression coefficients approaches zero. This trend is due to the inversely proportional relationship between (X^{T}X + \lambda I) and \text{Var}[w_{Ridge}].

A fascinating alternative to examine the variance of the ridge estimator, w_{Ridge}, is by employing the concept of uncertainty propagation. This method is particularly effective in the context of linear equations. For some n \times m matrix M, consider the system:

u = Mv

In such a system, the uncertainty propagation principle allows us to determine the covariance matrix of u, denoted as \Sigma^{u}, using the formula:

\Sigma^{u} = M \Sigma^{v} M^{T}

This principle can be applied to our ridge estimator, which is expressed as:

w_{Ridge} = (X^{T}X + \lambda I)^{-1}X^{T}y

We can rewrite it as w_{Ridge} = Qy, where Q represents the transformation matrix (X^{T}X + \lambda I)^{-1}X^{T}. This transformation takes the vector y into the space of w_{Ridge}.

Assuming the covariance of the response vector y is given by \Sigma^{y} = \sigma^{2} I, which indicates that the observations in y have uniform variance \sigma^{2}, we can then apply the uncertainty propagation. It leads us to the variance of w_{Ridge} as follows:

\text{Var}[w_{Ridge}] = Q \Sigma^{y} Q^{T}

By substituting the expressions for Q and \Sigma^{y}, we get

\text{Var}[w_{Ridge}] = \sigma^{2} \{(X^{T}X +\lambda I)^{-1}X^{T}X(X^{T}X + \lambda I)^{-1}\}

This result beautifully summarizes the influence of the response variable’s intrinsic variance, denoted as \sigma^{2}, on the variance of the ridge estimator. By understanding these crucial aspects of w_{Ridge}​’s distribution, we’ve laid a solid foundation for what’s next. Moving to Module 3 of Ridge Regression, we’ll delve into an intriguing technique known as Maximum a Posteriori (MAP) estimation, offering a fresh perspective on our ongoing analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *