Header Ads

Machine Learning - Multiple Linear Regression



Multiple Linear Regression Depends upon more than one variable Linearly :
              Y= a0 X0+ a1X1 + a2X2 +.....+anXn   (X0 = 1)

At first we have to do everything as we did in the simple linear regression upto Prediction.
Click Here to get the Data Set.
 
#Machine Learning series Multiple_linear_regression
#created by @the ai datascience

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#import_datset

dataset = pd.read_csv('Startups.csv')
x = dataset.iloc[:,:-1 ].values
y = dataset.iloc[:, 4 ].values

#encoding_categorical_data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:, 3] = labelencoder_x.fit_transform(x[: , 3])

#dummy_variable
onehotencoder = OneHotEncoder(categorical_features = [3])
x = onehotencoder.fit_transform(x).toarray()

#Avoid_dummy_trap
x = x[:, 1:]

#spliting_deataset_into_test_and_training_data
from sklearn.cross_validation import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x, y , test_size = 1/5, random_state = 0)

#MLR fitting training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train , y_train)

#predict the test data
y_pred = regressor.predict(x_test)


Here we use dummy variable for the City column.And we take 20% of our data as test data.

Now we Have to Build our Optimal Model . And we choose the Backward selection process for this model. In Backward selection , at first we take all the column into consideration and then check their 'P' value which will tell us the significance of that feature in prediction. The higher the 'P' value , lower significance level.After Taking all feature in prediction we will see the feature with highest 'P' value and remove it from the model. And we do this process again and again until our highest 'P' value is less than our Significance level value.

#Build optimal model using backward  elimination
import statsmodels.formula.api as sm

#adding a column of value 1 in our dataset of x
x = np.append( arr = np.ones((50, 1)).astype(int), values = x , axis = 1)


x_opt = x[:, [0,1,2,3,4,5]]
regressor_ols = sm.OLS(endog = y , exog = x_opt).fit()
regressor_ols.summary()

x_opt = x[:, [0,1,3,4,5]]
regressor_ols = sm.OLS(endog = y , exog = x_opt).fit()
regressor_ols.summary()

x_opt = x[:, [0,3,4,5]]
regressor_ols = sm.OLS(endog = y , exog = x_opt).fit()
regressor_ols.summary()

x_opt = x[:, [0,3,5]]
regressor_ols = sm.OLS(endog = y , exog = x_opt).fit()
regressor_ols.summary()

x_opt = x[:, [0,3]]
regressor_ols = sm.OLS(endog = y , exog = x_opt).fit()
regressor_ols.summary()



Here we choose all the column first. Then we notice that the 3rd column having highest 'p' value . So, we remove it . we choose our significance level as 5% or 0.05 . so we will delete all the column having 'p' value more than 0.05. then we remove 2nd column , then 5th column and then 6th column and we get our Final Model.


That's all for Multiple Linear Regression.In our Next post we will share the snippets about Polynomial Regression.

Stay Tuned :)

No comments

Powered by Blogger.