Welcome to the final module of our comprehensive study of Ridge Regression!

In Module 1, we uncovered various facets of Ridge Regression, starting with SVD (Singular Value Decomposition) approach. We carefully dissected the formula for the Ridge estimator: $w_{Ridge} = (X^{T}X + \lambda I)^{-1}X^{T}y$ , unraveling its intricacies through calculus.

Our previous discussions in Module 2 illuminated the mean and variance of $w_{Ridge}$ , leading to an interesting revelation: there exists an optimal value of $\lambda$ that intricately balances the increase in model bias with a corresponding decrease in variance. This balance is crucial for building robust predictive models.

Now, we are set to cover another method of determining the Ridge estimator. We’ll pivot from the previously discussed methods and delve into the realm of Bayesian statistics, particularly focusing on MAP (Maximum A Posteriori) Estimation. This approach refines the MLE (Maximum Likelihood Estimation) methodology by integrating a prior distribution $p(w)$ into our prediction strategy, offering a richer understanding of Ridge Regression.

To set the stage, let’s briefly revisit the MLE approach. The MLE method for deducing the weights is framed as:

$w_{MLE} = \text{argmax}_{w} p(y | X, w)$

This equation represents the cornerstone of likelihood–based estimation, seeking the weight vector $w$ that maximizes the likelihood of observing our responses $y$ given our inputs $X$ . However, MLE has its limitations, particularly in situations where the dataset is small or when multicollinearity is present among the predictions.

Using MAP Estimation

The MAP approach incorporates prior beliefs about the distribution of the weights through $p(w)$ . Namely we are finding the weight vector $w$ that maximizes the posterior probability $p(w | y, X)$ . Lets lay down some mathematical groundwork.

Our goal is to determine $w_{MAP}$ , defined as:

$w_{MAP} = \text{argmax}_{w} p(w | y, X)$

This equation might seem daunting, but we can rewrite it using Bayes’ Theorem:

$= \text{argmax}_{w} \frac{p(y | w, X) p(w)}{p(y | X)}$

where $p(y | w, X)$ is the likelihood, representing how probable our observed response $y$ is, given $w$ and input data $X$ . In this equation, $p(y |X)$ is the marginal probability of $y$ , given $X$ , and does not contribute to determining $w_{MAP}$ . Instead we might consider it acts as a normalizing constant. Therefore, we simplify our equation to:

$= \text{argmax}_{w} p(y | w, X) p(w)$

Let’s assume our response variable $y$ follows a normal distribution with mean $Xw$ and variance $\sigma^{2}I$ , denoted as $y \sim N(Xw, \sigma^{2} I)$ . Similarly, we assume our weights $w$ also follow a normal distribution, centered around zero with variance $\lambda^{-1}$ , expressed as $w \sim N(0, \lambda^{-1} I)$ . With these assumptions in place, we can now expand our expression for MAP estimation.

Our expanded expression for the MAP estimator is given by:

$= \text{argmax}_{w}\{ \frac{1}{(2 \pi \sigma^{2})^{n/2}} \exp(- \frac{1}{2 \sigma^{2}} (y - Xw)^{T}(y - Xw)) * (\frac{\lambda}{2 \pi})^{d/2} \exp(- \frac{\lambda}{2} w^{T}w) \}$

In this expression, the terms $\frac{1}{(2 \pi \sigma^{2})^{n/2}}$ and $(\frac{\lambda}{2 \pi})^{d/2}$ are constants with respect to the optimization process and thus can be omitted for simplicity. This simplification leads us to:

$= \text{argmax}_{w} \exp( - \frac{1}{2 \sigma^{2}}(y - Xw)^{T}(y - Xw) - \frac{\lambda}{2}w^{T}w)$

applying the logarithm trick for simplification, our expression becomes a problem of minimization:

$= \text{argmin}_{w} \frac{1}{2 \sigma^{2}}(y - Xw)^{T}(y - Xw) + \frac{\lambda}{2}w^{T}w$

Expanding and rearranging the terms, we have:

$= \text{argmin}_{w} \frac{1}{\sigma^{2}}(y^{T}y - 2w^{T}X^{T}y + w^{T}X^{T}Xw) + \lambda w^{T}w$

As this function is convex in $w$ , we will obtain a unique solution by setting the gradient to zero:

$0 = \nabla_{w} \{\frac{1}{\sigma^{2}}(y^{T}y - 2w^{T}X^{T}y + w^{T}X^{T}Xw) + \lambda w^{T}w\}$

After computing the gradient and simplifying, we arrive at:

$X^{T}y = (X^{T}X + \sigma^{2} \lambda I) w$

therefore, our Ridge Estimator, by re-arranging and solving for $w$ , is finally given by:

$w_{Ridge} = (X^{T}X + \sigma^{2} \lambda I)^{-1}X^{T}y$

This derivation not only reinforces our understanding of the Ridge Estimator but also highlights the seamless blend of statistical theory and mathematical elegance inherent in Ridge Regression.

Thanks for following me on this three-part journey!

(Scienta potentia est.)

Share Your Feedback

BGecko

Learn how to see.

Advanced – Ridge Regression Notes (Module 3)

Using MAP Estimation

Thanks for following me on this three-part journey!

(Scienta potentia est.)

Leave a Reply Cancel reply

Using MAP Estimation

Thanks for following me on this three-part journey!

(Scienta potentia est.)

Share this:

Related

Leave a Reply Cancel reply