Solving the LASSO

Dec 18, 2023

In my last two posts, we saw how to use analytical optimization methods to find the beta coefficients for ordinary least squares and ridge regression. In this post, I want to try to tackle LASSO Regression.

LASSO regression, or L1 regularization, is a popular method in statistical modeling and machine learning where the primary goal is to balance model simplicity and accuracy by adding an L1 penalty term to the linear regression model. The L1 penalty encourages sparse solutions, where some coefficients become precisely zero, making LASSO useful for feature selection.

In the case of ridge regression, we have the following objective function, with the constraint being the L2-norm.

$argmin_{\beta} \frac{1}{2} ||y-X\beta||_{2}^{2} $

$s.t. ||\beta||_{2}^{2} \le s$

Rewriting the objective function and constraint into its Lagrangian form, we have the following:

$L(\beta, \lambda) = ||y-\textbf{X} \boldsymbol{\beta})||_{2}^{2} + \lambda (||\beta||_{2}^{2} - s)$

In the case of LASSO regression, we have the following objective function, with the constraint being the L1-norm.

$argmin_{\beta} \frac{1}{2} ||y-X\beta||_{2}^{2}$

$s.t. ||\beta||_{1} \le s$

Rewriting the objective function and constraint into its Lagrangian form, we have the following:

$L(\beta, \lambda) = ||y-\textbf{X} \boldsymbol{\beta})||_{2}^{2} + \lambda (||\beta||_{1} - s)$

Unlike ordinary least squares and ridge regression, LASSO has no closed-form solution. The reason for this is the presence of the absolute value term in the L1 regularization, which results in a non-differentiable objective function. The lack of differentiability makes it challenging to find an analytical solution using traditional calculus-based methods.

To help solve our LASSO problem, we need to employ iterative optimization techniques to find the optimal coefficients. One approach to help solve this problem is the coordinate descent algorithm. Somewhat similar to gradient descent, coordinate descent is an optimization algorithm that minimizes a multivariate objective function by iteratively updating the parameters along coordinate axes. As we will see below, it optimizes one parameter at a time while keeping the others fixed. This process is repeated until convergence or a specified number of iterations.

In general, the steps for the Coordinate Descent algorithm are as follows:

Initialize Parameters:
1. Start with an initial guess for the parameters (coefficients), often initialized to zeros or some other value.
Choose a Coordinate:
1. Select a coordinate (variable) to update. This can be done cyclically, randomly, or based on some criterion.
Update the Selected Coordinate:
1. Update the selected coordinate by minimizing the objective function with respect to that specific coordinate while keeping all other coordinates fixed.
2. The update rule depends on the specific problem and objective function.
Repeat:
1. Repeat steps 2 and 3 for a predefined number of iterations or until convergence criteria are met.
2. Convergence can be determined by checking whether the change in the objective function is below a certain threshold or if the parameters have stabilized.
Output the Optimized Parameters:
1. Once the algorithm converges, output the optimized parameters.

Even with the basic outline for our coordinate descent algorithm, we still need to deal with the fact the L1 penalty introduces absolute values, which means we are no longer dealing with differentiable functions. To help deal with this, we must introduce semi-differentiable functions, subdifferentials, and soft thresholding.

Definition 1:

$\text{A function f :} \mathbb{R} \rightarrow \mathbb{R} \text{ is said to be semi-differentiable at a point x if both}$

$d_{-}f(x) \ \text{and} \ d_{+}f(x) \ \text{exist as real numbers, where } d_{-}f(x) \ \text{and} \ d_{+}f(x)$

$\text{are the left and right derivatives of $f$ at $x$.}$

Definition 2:

$\text{Given a semi-differentiable function f :} \mathbb{R} \rightarrow \mathbb{R} $

$\text{we say that $d$ is a subderivative of $f$ at $x$ if} \ d \in \left[ d_{-}f(x), d_{+}f(x) \right];$

$\text{the set} \ \left[ d_{-}f(x), d_{+}f(x) \right] \ \text{is called the subdifferential}$

$\text{of $f$ at $x$, and is denoted $\partial f(x)$}$

Definition 3:

$\text{The subdifferential is a set-valued function: it may consist of a single value}$

$\text{(if $f$ is differentiable), an interval of values, or it }$

$\text{may be empty ( if $d_{-}f(x) > d_{+}f(x)$ )}$

Keep this in mind! For our purposes, we need to understand that the subdifferential of f(x) = |x| is:

$\partial |x| = \begin{cases} -1 & \text{for } x < 0\\ [-1,1] & \text{for } x = 0\\ 1 & \text{for } x > 0 \end{cases}$

Karush-Kuhn-Tucker conditions

Within the field of optimization, Karush-Kuhn-Tucker (KKT) conditions are a set of necessary conditions for optimality in mathematical optimization problems. These conditions help characterize points that could be optimal solutions to an optimization problem.

Being a type of constrained optimization problem, LASSO regression can be analyzed using the KKT conditions where the LASSO regression problem is typically formulated as an optimization problem with an objective function and a constraint on the L1 norm of the coefficients.

The KKT conditions for Lasso regression are crucial for understanding the behavior of the optimization problem and determining whether a specific set of coefficients and Lagrange multiplier values corresponds to an optimal solution. These conditions help characterize the trade-off between the fit to the data and the sparsity of the solution, a key aspect of Lasso regression. Solving the KKT system is often done iteratively (via our coordinate descent algorithm), and the KKT conditions guide the convergence criteria for optimization algorithms used in Lasso regression. For convex optimization problems such as the lasso, the KKT conditions are both necessary and sufficient to characterize the solution.

Without diving into a deep, rigorous proof using KKT conditions, we can solve for the 𝛽 estimates by simply replacing the derivative with the subderivative and the likelihood with the penalized likelihood.

$x_{j}^{\intercal}(y-X\beta) = \lambda sign(\hat\beta_{j}) \ \text{for} \ \hat\beta_{j} \ne 0$

$|x_{j}^{\intercal}(y-X\beta)| < \lambda \ \text{for} \ \hat\beta_{j} = 0$

In the case where the design matrix X is orthonormal, the lasso estimate is:

$\hat\beta_{j}(\lambda) = \begin{cases} x-\lambda & \text{for } x > \lambda\\ 0 & \text{for } |x| \le \lambda\\ x+\lambda & \text{for } x < -\lambda \end{cases}$

This can be rewritten as:

$\hat\beta_{j}(\lambda) = S(x|\lambda)$

where

$S(x|\lambda)$

is known as the soft-thresholding operator.

With soft thresholding, the lasso has a positive probability of yielding an estimate of exactly 0, which means it can produce sparse solutions.

To see how this works, let's go ahead and implement coordinate descent and then compare the results to the output from sklearn.

As in previous posts, let's implement coordinate descent in pytorch and try to take advantage of any GPUs we might have access to. If you don't have a GPU, pytorch will use your CPU.

# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
dev = 0
if device.type == 'cuda':
    print(torch.cuda.get_device_name(dev))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(dev)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(dev)/1024**3,1), 'GB')

Using device: cuda

NVIDIA GeForce GTX 1080
Memory Usage:
Allocated: 0.0 GB
Cached:    0.0 GB

Before implementing the coordinate descent algorithm, let's implement the soft-thresholding function:

def soft_thresh(x: torch.Tensor, lam: torch.Tensor) -> torch.Tensor:
    if x > 0 and lam < abs(x):
      return x - lam
    elif x < 0 and lam < abs(x):
      return x + lam
    else:
      return 0

Following our general coordinate descent outline from above and utilizing our soft-thresholding function.

def lasso_coordinate_descent(X: torch.Tensor, y: torch.Tensor, max_iter: int=1000, alpha: float=0.1, tol: float=1e-8) -> Tuple:
    X = torch.column_stack((torch.ones(len(X), device=device), X))
    history = []
    # Initialize Parameters:
    # Start with an initial guess for the parameters (coefficients), often initialized to zeros or some other value.
    # beta = torch.ones(X.shape[1], dtype=torch.float64)
    beta = torch.rand(X.shape[1], dtype=torch.float64)
    # update intercept
    beta[0] = torch.sum(y - X[:, 1:].to(device)@beta[1:].to(device))/(X.shape[0])
    for iteration in range(max_iter):
        for j in range(1, len(beta)):
            # Choose a Coordinate:
            # Select a coordinate (variable) to update. This can be done cyclically, randomly, or based on some criterion.
            tmp_beta = beta.detach().clone()
            tmp_beta[j] = 0.0
            # Update the Selected Coordinate:
            # Update the selected coordinate by minimizing the objective function with respect 
            # to that specific coordinate while keeping all other coordinates fixed.
            r_j = y - X@tmp_beta.to(device)
            x = X[:, j]@r_j
            lam = alpha*X.shape[0]
            # soft thresholding function
            beta[j] = soft_thresh(x, lam)/(X[:, j]**2).sum()
            # update intercept
            beta[0] = torch.sum(y - X[:, 1:].to(device)@beta[1:].to(device))/(X.shape[0])
        # Calculate the cost function
        cost = torch.sum((y - X.to(device)@beta.to(device)) ** 2) + alpha * torch.sum(torch.abs(beta))
        history.append(cost)
    
        # Check for convergence
        if iteration > 0 and abs(history[-2] - history[-1]) < tol:
            print(f"iteration: {iteration} | convergence: {abs(history[-2] - history[-1]) < tol}")
            break
        elif iteration == max_iter:
            print(f"iteration: {iteration} | convergence: {abs(history[-2] - history[-1]) < tol}")

    intercept = beta[0]
    coef = beta[2:]
    return intercept, coef

Next, to test this out, let's use sklearn's make_regression function to generate 100 samples with 3 features and set the intercept to 1.0.

X, y = make_regression(n_samples=100, n_features=3, bias=1.0, random_state=42)
# Add a bias term to X
X_bias = np.c_[np.ones((X.shape[0], 1)), X]

To sanity-check our results, let's first see what sklearn generates for the beta coefficients and intercept.

lasso = Lasso(alpha=0.1, max_iter=1000, tol=1e-4, fit_intercept=True)
lasso.fit(X, y)
print(lasso.coef_)
print(lasso.intercept_)

[74.95038302 28.02948162 17.61582968]
1.0008004301347908

To use our lasso_coordinate_descent function, we need to convert our numpy arrays to pytorch tensors.

X = torch.tensor(X, device=device)
y = torch.tensor(y, device=device)

X_bias = torch.tensor(X_bias, device=device)

While we set the max number of iterations to 1000, note how quickly we reached convergence! Using our coordinate descent algorithm, we reached convergence in 11 iterations and were able to get the same result for our 𝛽 estimates as sklearn. (This makes sense since sklearn uses coordinate descent to solve for the 𝛽 estimates.)

i,c = lasso_coordinate_descent(X_bias, y)
print(f"coeficients: {c.numpy()}")
print(f"intercept: {i}")

iteration: 11 | convergence: True
coeficients: [74.95040821 28.02949042 17.61583359]
intercept: 1.0007984531607874

Conclusion:

As there is no closed form solution for the 𝛽 estimates of LASSO regression, we have to resort to iterative optimization methods. In this case, we’ve seen how coordinate descent can solve this optimization problem.

John’s Substack

Discussion about this post

Ready for more?