Linear Regression – You should always try it first!

You should always try to use linear regression first when you are faced with an unknown data source for regression analysis, the main and simple reason behind this is:

Linear Regression creates a stable and accurate model, unlikely to fail miserably with erratic data.

If you are able to model the data correctly to fit a linear model, it is very likely that you have derived the true equation to solve the problem. Lets see an example with a very basic example with a very linear equation:

Y = x1+x2

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

# generating data
x1 = np.random.randint(100,size=100)
x2 = np.random.randint(100,size=100)
X = np.stack((x1,x2),axis=1)

# purely linear equation
y = x1+x2

# splitting data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.8)

# linear regression
linearRegressor = LinearRegression()
linearRegressor.fit(X_train,y_train)
y_pred_linear = linearRegressor.predict(X_test)
print(linearRegressor.intercept_)  # prints near zero
print(linearRegressor.coef_)  # prints [1,1]

# random forest regression
randomForestRegressor = RandomForestRegressor(max_depth=100,n_estimators=100)
randomForestRegressor.fit(X_train,y_train)
y_pred_randomForest = randomForestRegressor.predict(X_test)

# plotting
plt.scatter(y_test,y_pred_linear,label='Linear')
plt.scatter(y_test,y_pred_randomForest,label='Random Forest')
plt.plot([np.min(y_test),np.max(y_test)],[np.min(y_test),np.max(y_test)],'k-',label='n=1')
plt.legend()
plt.show()

The results are shown below for degree of fit, additionally from printing the linear regression intercept and slope we can see that it derived the following equation:

Y = 1*x1+1*x2+0

Which is spot on.

Leave a Reply Cancel reply