This project implements a variety of regression models, both linear and ensemble-based, to predict numerical outcomes. The models are optimized using Grid Search and Randomized Grid Search techniques to find the best hyperparameters for each model.
- Linear Models
- Ensemble Models
- Hyperparameter Optimization
- Grid Search
- Randomized Grid Search
- How to Use
- Dependencies
The following linear regression models are included:
A basic linear regression model that minimizes the mean squared error.
Ridge regression adds L2 regularization to the loss function to prevent overfitting.
Stochastic Gradient Descent is used to minimize the loss function iteratively.
Lasso regression adds L1 regularization to the loss function, which performs feature selection by driving some coefficients to zero.
Combines L1 and L2 regularization with cross-validation to find the best combination of hyperparameters.
A robust regression method that minimizes the influence of outliers on the model.
Predicts a specific quantile (e.g., median) instead of the mean of the response variable.
A robust regression model that iteratively fits the model to subsets of the data and identifies inliers.
Used for count data regression assuming the response variable follows a Poisson distribution.
Handles distributions from the Tweedie family, including compound Poisson-gamma.
Models data that follow a gamma distribution, useful for positively skewed data.
The following ensemble-based regression models are included:
A non-linear model that uses a tree structure to split data based on feature values.
An ensemble of decision trees where each tree is trained on a random subset of the data.
An iterative method that combines weak learners (trees) to minimize the loss function.
An optimized gradient boosting implementation designed for speed and performance.
A gradient boosting framework optimized for efficiency, capable of handling large datasets.
A gradient boosting algorithm that handles categorical features natively.
This exhaustive search method evaluates all possible combinations of hyperparameter values specified in the search space.
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
param_grid = {
'alpha': [0.1, 0.5, 1.0],
'max_iter': [1000, 2000, 3000]
}
grid_search = GridSearchCV(estimator=Ridge(), param_grid=param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)
This method evaluates a random subset of the hyperparameter search space, making it faster for large datasets and complex models.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor
param_dist = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20],
'learning_rate': [0.01, 0.1, 0.2]
}
random_search = RandomizedSearchCV(estimator=GradientBoostingRegressor(), param_distributions=param_dist, scoring='r2', n_iter=10, cv=5, random_state=42)
random_search.fit(X_train, y_train)
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)
-
Import Models: The models are defined in dictionaries (
regressors
andensemble_models
). Use the key to access a specific model.from sklearn.linear_model import LinearRegression model = regressors['LinearRegression'] model.fit(X_train, y_train) predictions = model.predict(X_test)
-
Perform Optimization: Use either
GridSearchCV
orRandomizedSearchCV
to optimize hyperparameters. -
Evaluate Models: Evaluate the model using metrics such as Mean Squared Error (MSE), R2 Score, or Mean Absolute Error (MAE).
from sklearn.metrics import mean_squared_error, r2_score mse = mean_squared_error(y_test, predictions) r2 = r2_score(y_test, predictions) print(f"MSE: {mse}, R2: {r2}")
- Python 3.7+
- scikit-learn
- xgboost
- lightgbm
- catboost
Install required libraries:
pip install scikit-learn xgboost lightgbm catboost