Ridge Regression, Lagrangian Duality and PyTorch, oh my!

Dec 02, 2023

In my previous post, I showed how to find the optimal coefficients for ordinary least-squared regression using simple analytical methods. For this post, I want to show how we can do the same for Ridge Regression.

As an additional bit of fun, instead of using numpy, we will prove our methods using PyTorch. Note that while there are many more exciting ways to build regression models in pytorch, the goal here is simply to demonstrate the math.

Regularization and Regression

What is regularization, and how can we utilize it in linear regression?

Regularization is a technique used in machine learning and statistics to prevent overfitting and improve the generalization of a model. It helps strike a balance between fitting the training data well and preventing the model from becoming too complex. Typically, overfitting occurs when a model learns the training data too well by capturing noise or random fluctuations rather than the underlying patterns. Regularization introduces a penalty term to the model's objective function, discouraging overly complex models.

In the case of ridge regression, we add an L2 penalty term to the linear regression equation. This penalty is proportional to the square of the magnitude of coefficients, which effectively shrinks the coefficients towards zero. Note that ridge regression doesn't completely shrink the coefficients to zero; it only shrinks them. If we were to use an L1 penalty (LASSO), the coefficients would shrink to zero.

In ridge regression, this L2 penalty helps in reducing the variance of the estimates, making the model less sensitive to fluctuations in the data and improving its generalization to new, unseen data. This L2 penalty helps with the problem of multicollinearity in multiple regression data. In multiple regression, when two or more independent variables are highly correlated, the standard least squares estimates tend to have high variance, which can lead to overfitting.

Estimation of the Ridge Regression Coefficients

To obtain the coefficients for ridge regression, we can form the problem as a convex optimization problem where the regularization parameter serves as the constraint:

\(argmin_{\beta} \frac{1}{2} ||X\beta-y||_{2}^{2} \)

\(s.t. ||\beta||_{2}^{2} \le s\)

To help solve our optimization problem, we can form it as a Lagrangian dual problem.

Lagrangian duality

Lagrangian duality is a concept in optimization theory that involves transforming an optimization problem into a dual form to facilitate analysis and solution. The Lagrangian dual problem provides an upper bound on the optimal value of the original (primal) problem. This duality relationship is established through the Lagrangian function, which includes Lagrange multipliers to account for constraints in the optimization problem.

The Lagrangian dual problem is obtained by forming the Lagrangian of a minimization problem by using nonnegative Lagrange multipliers to add the constraints to the objective function and then solving for the primal variable values that minimize the original objective function.

For ridge regression, the Lagrangian would be:

\(L(\beta, \lambda) = ||X\beta-y||_{2}^{2} + \lambda (||\beta||_{2}^{2} - s)\)

Now that we have our Lagrangian, we can solve for our Beta coefficients. Since ridge regression has a closed-form solution, we can simply take the derivative with respect to Beta.

First, let's expand our function to make the derivation easier:

\(L(\boldsymbol{\beta}, \lambda) = (y_{true} - \textbf{X} \boldsymbol{\beta})^{\intercal}(y_{true} - \textbf{X} \boldsymbol{\beta}) + \lambda (\boldsymbol{\beta}^{\intercal}\boldsymbol{\beta} - s)\)

\(L(\boldsymbol{\beta}, \lambda) = \left( y^{\intercal}y - y^{\intercal}X\beta - \beta^{\intercal}X^{\intercal}y + \beta^{\intercal}X^{\intercal}X\beta \right) + \lambda (\boldsymbol{\beta}^{\intercal}\boldsymbol{\beta} - s)\)

Having the following identity:

\(\beta^{\intercal}X^{\intercal}y = y^{\intercal}X\beta\)

means we can substitute in to make things a bit easier:

\(L(\boldsymbol{\beta}, \lambda) = \left( y^{\intercal}y - 2y^{\intercal}X\beta + \beta^{\intercal}X^{\intercal}X\beta \right) + \lambda (\boldsymbol{\beta}^{\intercal}\boldsymbol{\beta} - s)\)

Taking the derivative with respect to Beta:

\(\frac{\partial L(\beta, \lambda)}{\partial \beta} = -2y^{\intercal}X + 2\beta X^{\intercal}X + 2\lambda\beta\)

\(\frac{\partial L(\beta, \lambda)}{\partial \beta} = -2 \left[ y^{\intercal}X + \beta X^{\intercal}X + \lambda\beta \right]\)

Set our derivative to zero and solve for Beta to get our Beta coefficients:

\(-y^{\intercal}X + \beta X^{\intercal}X + \lambda\beta = 0\)

\(\hat\beta \left(X^{\intercal}X + \lambda I \right) = X^{\intercal}y\)

Isolating for Beta hat, we get:

\(\hat\beta = \left(X^{\intercal}X + \lambda I \right)^{-1} X^{\intercal}y\)

Now that we have Beta hat, let’s see how we can estimate our Beta coefficients using pytorch.

PyTorch!

Before estimating our Beta coefficients for ridge regressions, let’s take a step back and estimate our OLS regression coefficients in pytorch. Previously, we showed how to estimate our coefficients using numpy.

Since we’re using pytorch to test our math, we can use a GPU if you have one. In my case, I have an older NVIDIA GeForce GTX 1080.

The code block below shows how to configure your device settings (GPU vs CPU) and will show some memory usage information on your GPU.

# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
dev = 0
if device.type == 'cuda':
    print(torch.cuda.get_device_name(dev))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(dev)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(dev)/1024**3,1), 'GB')

Using device: cuda

NVIDIA GeForce GTX 1080
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB

Next let’s set a random seed so that our results will be consistent:

seed = 42

torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)

OLS Regression in PyTorch

Starting off with our example from the last post, let’s go ahead and define a linear function for our multiple linear regression model. This time, the type hints have been modified for pytorch.

def multiple_linear_func(x: torch.Tensor, 
                         B0: float, 
                         B1: float,
                         B2: float,
                         B3: float,
                         B4: float,
                         B5: float,) -> torch.FloatTensor:
    # y = B1*x + B0
    assert type(x) == torch.Tensor
    return x[:,1]*B1 + x[:,2]*B2 + x[:,3]*B3 + x[:,4]*B4 + x[:,5]*B5 + B0

Similar to the last post, let’s create a matrix/tensor for our 5 features (5 features, 5 beta coefficients). Next, we augment the X tensor with a column of ones. This column of one will represent Beta0 or the intercept. To get our target, we pass our X tensor into multiple_linear_func function then add some random, normal noise.

Note how we pass the device argument for X, ones, and y1. By doing this, we are storing and operating on the tensors within the GPU.

Getting back to our Beta coefficients, we can simply write our code like the equation for our estimator:

\(\hat\beta = \left( X^{\intercal}X \right)^{-1} X^{\intercal}y\)

Finally, when we print the contents our betas variable, we see that the values in the returned tensor are roughly the same as our true Beta coefficients of 1, 2, 3, 4, 5, 6.

seed = 133742
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)

n  = 1000
features = 5

X = torch.normal(0, 4, size=(n, features), requires_grad=True, device=device)
ones = torch.ones((n,1), requires_grad=True, device=device)
X = torch.cat((ones, X), 1)
y = multiple_linear_func(X, 1, 2, 3, 4, 5, 6)
y1 = y + torch.normal(0, 4, size=(1, n), requires_grad=True, device=device)
y1 = y1.T
betas = torch.linalg.inv(X.T@X)@X.T@y1
print(betas)

tensor([[1.0109],
        [1.9911],
        [2.9868],
        [4.0013],
        [4.9976],
        [6.0433]], device='cuda:0', grad_fn=<MmBackward0>)

Ridge Regression in PyTorch

Now that we have the estimator for our ridge regression Beta coefficients, we can quickly implement this with pytorch.

\(\hat\beta = \left(X^{\intercal}X + \lambda I \right)^{-1} X^{\intercal}y\)

As before, let’s build out our design matrix X with some random samples and augment it with a column of ones. Next, we run it through the multiple_linear_func function and add some random, normal noise. Note that our ridge regression estimator required the identity matrix, so we define that and then set the first 1 on the diagonal to zero. This prevents the bias term from being applied to the intercept. Finally, we can code up our estimator in a similar fashion as below.

Notice when we print the contents betas variable, our Beta estimates are pretty close to our true Betas (1, 2, 3, 4, 5, 6).

Also, note that we set the lambda variable, this controls how much penalty to apply to the coefficients in our regression model. Try to play around with this number. Note that it doesn’t change the coefficients very much. What happens if a couple of the features are highly correlated?

seed = 133742
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)

n  = 1000
features = 5

lmbda = 10

X = torch.normal(0, 4, size=(n, features), requires_grad=True, device=device)
ones = torch.ones((n,1), requires_grad=True, device=device)
X = torch.cat((ones, X), 1)
y = multiple_linear_func(X, 1, 2, 3, 4, 5, 6)
y1 = y + torch.normal(0, 4, size=(1, n), requires_grad=True, device=device)
y1 = y1.T
I = torch.eye(6, device=device)
# Prevent lambda from being applied to the intercept
I[0, 0] = 0
betas = torch.linalg.inv(X.T@X + lmbda*I)@X.T@y1
betas

tensor([[1.0119],
        [1.9901],
        [2.9848],
        [3.9990],
        [4.9941],
        [6.0388]], device='cuda:0', grad_fn=<MmBackward0>)

The Fish Market Dataset

Let’s try this on a real dataset with highly correlated features:

df = pd.read_csv("https://raw.githubusercontent.com/jgoodie/datasets/main/Fish.csv")
df = df.drop('Species', axis=1)
df.head()

   Weight  Body Height  Total Length  Diagonal Length   Height   Width
0   300.0         34.8          37.3             39.8   6.2884  4.0198
1   242.0         23.2          25.4             30.0  11.5200  4.0200
2   500.0         29.1          31.5             36.4  13.7592  4.3680
3   600.0         29.4          32.0             37.2  15.4380  5.5800
4   345.0         36.0          38.5             41.0   6.3960  3.9770

Let’s check Pearson’s R on a couple of the features. Notice that the correlation coefficient is 0.99 with a very small p-value indicating Total Length is highly correlated with Diagonal Length and Body Height.

r = scipy.stats.pearsonr(df['Total Length'], df['Diagonal Length'])
print(r)

r = scipy.stats.pearsonr(df['Total Length'], df['Body Height'])
print(r)

PearsonRResult(statistic=0.9941380221187125, pvalue=1.3429869427935273e-153)
PearsonRResult(statistic=0.9995199423023245, pvalue=8.085559276750992e-239)

Using the same methods above, we can find the Beta coefficients for our ridge regressor. Also, note that we are using a lambda of 10.

seed = 133742
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)


lmbda = 10.0

X = df.drop('Weight', axis=1)
X = X.to_numpy()

n = X.shape[0]

X = torch.tensor(X, requires_grad=True, device=device)
ones = torch.ones((n,1), requires_grad=True, device=device)
X = torch.cat((ones, X), 1)
m = X.shape[1]

y = df['Weight']
y = y.to_numpy()
y = torch.tensor(y, requires_grad=True, device=device)

I = torch.eye(m, device=device)
# Prevent lambda from being applied to the intercept
I[0, 0] = 0
betas = torch.linalg.inv(X.T@X + lmbda*I)@X.T@y
print(betas)

tensor([-480.7065,   18.0032,   24.6529,  -20.8236,   21.4641,   39.3959],
       device='cuda:0', dtype=torch.float64, grad_fn=<MvBackward0>)

To verify our Beta coefficients, let’s compare them against sklearn’s implementation of ridge regression. As you can see below, we have the same intercept and Beta coefficients as sklearn’s implementation.

seed = 133742
np.random.seed(seed)

X = df.drop('Weight', axis=1)
X = X.to_numpy()

y = df['Weight']
y = y.to_numpy()

X.shape

ridge_clf = Ridge(alpha=10.0, fit_intercept=True)
ridge_clf.fit(X, y)
print(ridge_clf.coef_)
print(ridge_clf.intercept_)

[ 18.00324057  24.65294869 -20.82355925  21.46414094  39.39587727]
-480.70651674177867

Conclusion

As we have seen, like OLS regression, we can find form our ridge regression problem as an optimization problem where the L2 regularization parameter serves as the constraint. We then showed we can take our optimization problem and form it as a Lagrangian dual problem and then solve for the Beta estimators. For fun we showed how to implement the closed-form solution to our Beta estimates in PyTorch.

John’s Substack

Discussion about this post

Ready for more?