Python Scikit Learn - Ridge Regression

Ridge regression, a variant of linear regression, is an essential tool in the arsenal of data scientists and machine learning practitioners. It addresses some of the limitations of linear regression, particularly when dealing with multicollinearity or when the number of features exceeds the number of observations. In this article, we will explore ridge regression using Scikit-Learn, one of Python's most popular machine learning libraries.

Understanding Ridge Regression

Ridge regression, also known as Tikhonov regularization, adds a regularization term to the ordinary least squares (OLS) objective function. This term penalizes the magnitude of the coefficients, effectively shrinking them towards zero but not setting them exactly to zero. The ridge regression objective function is given by:

Here, w are the model coefficients, Xi is the feature vector for the i-th observation, yi is the target value and λ is the regularization parameter. The second term is the regularization term, which penalizes large coefficients.

Why Use Ridge Regression?

Multicollinearity: When features are highly correlated, the OLS estimates have large variances. Ridge regression mitigates this issue by adding bias but reducing variance.
Overfitting: In high-dimensional spaces, models can easily overfit the training data. Ridge regression helps by imposing a penalty on the size of the coefficients, leading to simpler models that generalize better.
Numerical Stability: The regularization term stabilizes the inversion of the covariance matrix, which can be problematic in OLS when the matrix is close to singular.

Implementing Ridge Regression with Scikit-Learn

Scikit-Learn provides an easy-to-use implementation of ridge regression through the Ridge class. Let's go through a practical example.

Step 1: Importing Libraries

First, we need to import the necessary libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

Step 2: Loading the Data

For this example, we will use the Boston Housing dataset, which is included in Scikit-Learn.

from sklearn.datasets import load_boston

# Load the dataset
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target)
Step 3: Preprocessing
We split the data into training and testing sets and standardize the features.
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 4: Training the Model

We instantiate the Ridge class and fit it to the training data.

# Instantiate the Ridge regression model
ridge_reg = Ridge(alpha=1.0)  # alpha is the regularization parameter

# Fit the model
ridge_reg.fit(X_train, y_train)

Step 5: Evaluating the Model

We use the model to make predictions on the test data and evaluate its performance.

# Make predictions
y_pred = ridge_reg.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")

Output:

Mean Squared Error: 25.41958712682191
R^2 Score: 0.6693702691495616

Step 6: Analyzing the Coefficients

One of the key benefits of ridge regression is that it shrinks the coefficients. We can inspect the coefficients to see the effect of regularization.

# Inspect the coefficients
coefficients = pd.Series(ridge_reg.coef_, index=boston.feature_names)
print(coefficients)

Output:

CRIM       -1.038819
ZN          1.021696
INDUS       0.205204
CHAS        0.780355
NOX        -1.821555
RM          2.918722
AGE        -0.820582
DIS        -3.028661
RAD         2.405121
TAX        -1.499506
PTRATIO    -2.063730
B           0.830963
LSTAT      -3.837109
dtype: float64

Step 7: Tuning the Regularization Parameter

The performance of ridge regression depends on the regularization parameter α.We can use cross-validation to find the optimal value.

from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0, 200.0]}

# Instantiate the grid search
grid_search = GridSearchCV(Ridge(), param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the grid search
grid_search.fit(X_train, y_train)

# Get the best model
best_ridge_reg = grid_search.best_estimator_
print(f"Best alpha: {best_ridge_reg.alpha}")

Output:

Best alpha: 1.0

Visualizing the Results

Visualizing the performance of ridge regression can help in understanding its behavior better. Let's plot the true versus predicted values.

# Plot true vs predicted values
plt.scatter(y_test, y_pred, edgecolor='k', alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('True Values')
plt.ylabel('Predicted Values')
plt.title('True vs Predicted Values')
plt.show()

Output:

Conclusion

Ridge regression is a powerful technique that addresses some of the limitations of ordinary least squares regression, particularly in the presence of multicollinearity and high-dimensional data. Using Scikit-Learn, implementing ridge regression is straightforward, allowing for easy experimentation with different regularization parameters and model evaluation.

By penalizing large coefficients, ridge regression can lead to more stable and interpretable models that generalize better to new data. As with any machine learning technique, it is crucial to carefully tune the hyperparameters and validate the model to ensure optimal performance.

Next TopicWhat is pythonpath environment variable in python

← prev next →