Header Ads

Machine Learning - Data Preprocessing




 In this post I will share the code snippets about Data - preprocessing in any Machine - Learning Model . Now , the first point is why should we preprocess our data . The answer is simple , it is just because to  make the data more clean, efficient and usable in our Machine Learning model. 

Now , Generally Data preprocessing consist of several steps. like - Import the data , find any value is missing or not . If missing then take actions accordingly . In some cases we have to encode some categorical data depending on our data. The most important is split the data into test data and train data .And feature Scaling ( we don't need this step much because in most of the model they do it by it-self.



  • Importing Libraries 
At first we have to import some common libraries for machine-learning
   
#Created_By@the_ai_datascience
#Data_preprocessing

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


  • Import Dataset

          In this step , we will import the Dataset in our Model


#import_datset

dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values
y= dataset.iloc[:, 3].values


     Here , we import the dataset and divide them into two array where x is for independent variables and y is for dependent- variable.
     using the value of independent variables , we will predict the dependent variable in our model.
           Click here to get the data set.





  • Handle Missing Data
        Here we will see how to handle the missing data.

#handle_missing_data

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN' , strategy = 'mean' , axis =0 )
imputer = imputer.fit(x[: ,1:3])
x[: ,1:3] = imputer.transform(x[: ,1:3])

Here , I use the Imputer function from Sci-kit Learn Library for Handle the missing data.
Here , we replace every NaN value of the dataset with the respective Mean value of respective Column.



  • Encoding Categorical Data
     Here , we will see how to encode the object data into numeric data as our ML model will learn from numeric data only

#encoding_categorical_data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:, 0] = labelencoder_x.fit_transform(x[: , 0])
#dummy_variable
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
#encoding_dependent_variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)


We use the LabelEncoder Function and Dummy Variable for encoding 


  • Splitting Dataset 
       Here we will split our dataset into test data and train data 
#spliting_deataset_into_test_and_training_data

from sklearn.cross_validation import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x, y , test_size = 0.2 , random_state = 0)

Here we split the dataset in 20:80 ratio , where testing will be done on 20% data and for training I am using 80 % data . We can change this ratio easily.


That's all for Data-Preprocessing.
In the Next Blog I will share the snippets about Simple Linear Regrassion.

Stay Tuned :)

No comments

Powered by Blogger.