Regression
Regression is done when the label is a continuous variable.
The three regressors, LASSO, Ridge and Elastic Net that have regularization are elaborated here.
OLS Regression
Ordinary Least Squares Regression or OLS Regression is the most basic form and fundamental of regression. A best fit line ลท = a + bx
is drawn based on the ordinary least squares method. i.e., least total area of squares (sum of squares) with length from each x,y point to regresson line.
OLS can be conducted using statsmodel package.
model = smf.ols(formula='diameter ~ depth', data=df3).fit()
print model.summary()
OLS Regression Results
==============================================================================
Dep. Variable: diameter R-squared: 0.512
Model: OLS Adj. R-squared: 0.512
Method: Least Squares F-statistic: 1.895e+04
Date: Tue, 02 Aug 2016 Prob (F-statistic): 0.00
Time: 17:10:34 Log-Likelihood: -51812.
No. Observations: 18067 AIC: 1.036e+05
Df Residuals: 18065 BIC: 1.036e+05
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 2.2523 0.054 41.656 0.000 2.146 2.358
depth 11.5836 0.084 137.675 0.000 11.419 11.749
==============================================================================
Omnibus: 12117.030 Durbin-Watson: 0.673
Prob(Omnibus): 0.000 Jarque-Bera (JB): 391356.565
Skew: 2.771 Prob(JB): 0.00
Kurtosis: 25.117 Cond. No. 3.46
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Or sci-kit learn package.
from sklearn import linear_model
reg = linear_model.LinearRegression()
model = reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])
print(model)
# LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
print(reg.coef_)
# array([ 0.5, 0.5])
r2_trains = model.score(X_train, y_train)
r2_tests = model.score(X_test, y_test)
LASSO Regression
LASSO refers to Least Absolute Shrinkage and Selection Operator Regression. When alpha = 0, then it is a normal OLS regression.
import pandas as pd
import numpy as py
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
import sklearn.metrics
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
for i in df.columns[:-1]:
df[i] = preprocessing.scale(df[i].astype('float64'))
df.describe()
train_feature, test_feature, train_target, test_target = \
train_test_split(feature, target, random_state=123, test_size=0.2)
model=LassoLarsCV(cv=10, precompute=False)
model = model.fit(train_feature,train_target)
# Compare the regression coefficients, and see which one LASSO removed.
# LSTAT is the most important predictor,
# followed by RM, DIS, and RAD. AGE is removed by LASSO
df2=pd.DataFrame(model.coef_, index=feature.columns)
df2.sort_values(by=0,ascending=False)
# RM 3.050843
# RAD 2.040252
# ZN 1.004318
# B 0.629933
# CHAS 0.317948
# INDUS 0.225688
# AGE 0.000000
# CRIM -0.770291
# NOX -1.617137
# TAX -1.731576
# PTRATIO -1.923485
# DIS -2.733660
# LSTAT -3.878356
train_error = mean_squared_error(train_target, model.predict(train_feature))
test_error = mean_squared_error(test_target, model.predict(test_feature))
# MSE
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
# R-square
rsquared_train=model.score(train_feature,train_target)
rsquared_test=model.score(test_feature,test_target)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)
Ridge Regression
import panda as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.preprocessing import MinMaxScaler
X_train, X_test, y_train, y_test = train_test_split(X_crime, y_crime,
random_state = 0)
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
linridge = Ridge(alpha=20.0).fit(X_train_scaled, y_train)
print('ridge regression linear model intercept: {}'
.format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
.format(linridge.coef_))
print('R-squared score (train): {:.3f}'
.format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'
.format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
.format(np.sum(linridge.coef_ != 0)))
Elastic Net
Elastic Net combines the penalties of ridge regression and lasso to get the best of both worlds.
Tree Regressors
For each of the tree classifiers, there exists a regressor.
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor