Machine Learning - Data Preprocessing

In this post I will share the code snippets about Data - preprocessing in any Machine - Learning Model . Now , the first point is why should we preprocess our data . The answer is simple , it is just because to make the data more clean, efficient and usable in our Machine Learning model.

Now , Generally Data preprocessing consist of several steps. like - Import the data , find any value is missing or not . If missing then take actions accordingly . In some cases we have to encode some categorical data depending on our data. The most important is split the data into test data and train data .And feature Scaling ( we don't need this step much because in most of the model they do it by it-self.

Importing Libraries

At first we have to import some common libraries for machine-learning

#Created_By@the_ai_datascience
#Data_preprocessing

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Import Dataset

In this step , we will import the Dataset in our Model

#import_datset

dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values
y= dataset.iloc[:, 3].values

Here , we import the dataset and divide them into two array where x is for independent variables and y is for dependent- variable.
using the value of independent variables , we will predict the dependent variable in our model.
Click here to get the data set.

Handle Missing Data

Here we will see how to handle the missing data.

#handle_missing_data

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN' , strategy = 'mean' , axis =0 )
imputer = imputer.fit(x[: ,1:3])
x[: ,1:3] = imputer.transform(x[: ,1:3])

Here , I use the Imputer function from Sci-kit Learn Library for Handle the missing data.
Here , we replace every NaN value of the dataset with the respective Mean value of respective Column.

Encoding Categorical Data

Here , we will see how to encode the object data into numeric data as our ML model will learn from numeric data only

#encoding_categorical_data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:, 0] = labelencoder_x.fit_transform(x[: , 0])
#dummy_variable
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
#encoding_dependent_variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

We use the LabelEncoder Function and Dummy Variable for encoding

Splitting Dataset

Here we will split our dataset into test data and train data

#spliting_deataset_into_test_and_training_data

from sklearn.cross_validation import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x, y , test_size = 0.2 , random_state = 0)

Here we split the dataset in 20:80 ratio , where testing will be done on 20% data and for training I am using 80 % data . We can change this ratio easily.

That's all for Data-Preprocessing.
In the Next Blog I will share the snippets about Simple Linear Regrassion.

Stay Tuned :)

The AI Data Science

Header Ads

Machine Learning - Data Preprocessing

Import Dataset

Handle Missing Data

Here we will see how to handle the missing data.

Post a Comment

No comments

Follow Us

Random Posts

Facebook

Popular Posts

Sponsor

Categories

Blog Archive

Comments

Random Posts

Recent Posts

Popular Posts

The AI Data Science

Header Ads

Machine Learning - Data Preprocessing

Import Dataset

Handle Missing Data Here we will see how to handle the missing data.

Related Posts

Post a Comment

No comments

Follow Us

Random Posts

Facebook

Popular Posts

Sponsor

Categories

Blog Archive

Comments

Random Posts

Recent Posts

Popular Posts

Handle Missing Data

Here we will see how to handle the missing data.