Machine Learning - Data Preprocessing
In this post I will share the code snippets about Data - preprocessing in any Machine - Learning Model . Now , the first point is why should we preprocess our data . The answer is simple , it is just because to make the data more clean, efficient and usable in our Machine Learning model.
Now , Generally Data preprocessing consist of several steps. like - Import the data , find any value is missing or not . If missing then take actions accordingly . In some cases we have to encode some categorical data depending on our data. The most important is split the data into test data and train data .And feature Scaling ( we don't need this step much because in most of the model they do it by it-self.
- Importing Libraries
At first we have to import some common libraries for machine-learning
#Created_By@the_ai_datascience
#Data_preprocessing
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
- Import Dataset
In this step , we will import the Dataset in our Model
#import_datset
dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values
y= dataset.iloc[:, 3].values
Here , we import the dataset and divide them into two array where x is for independent variables and y is for dependent- variable.
using the value of independent variables , we will predict the dependent variable in our model.
Click here to get the data set.
- Handle Missing Data
Here we will see how to handle the missing data.
#handle_missing_data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN' , strategy = 'mean' , axis =0 )
imputer = imputer.fit(x[: ,1:3])
x[: ,1:3] = imputer.transform(x[: ,1:3])
Here , I use the Imputer function from Sci-kit Learn Library for Handle the missing data.
Here , we replace every NaN value of the dataset with the respective Mean value of respective Column.
- Encoding Categorical Data
Here , we will see how to encode the object data into numeric data as our ML model will learn from numeric data only
#encoding_categorical_data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:, 0] = labelencoder_x.fit_transform(x[: , 0])
#dummy_variable
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
#encoding_dependent_variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
We use the LabelEncoder Function and Dummy Variable for encoding
#Created_By@the_ai_datascience
#Data_preprocessing
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
- Import Dataset
dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values
y= dataset.iloc[:, 3].values
Here , we import the dataset and divide them into two array where x is for independent variables and y is for dependent- variable.
using the value of independent variables , we will predict the dependent variable in our model.
Click here to get the data set.
- Handle Missing Data
Here we will see how to handle the missing data.
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN' , strategy = 'mean' , axis =0 )
imputer = imputer.fit(x[: ,1:3])
x[: ,1:3] = imputer.transform(x[: ,1:3])
Here , we replace every NaN value of the dataset with the respective Mean value of respective Column.
- Encoding Categorical Data
Here , we will see how to encode the object data into numeric data as our ML model will learn from numeric data only
#encoding_categorical_data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:, 0] = labelencoder_x.fit_transform(x[: , 0])
#dummy_variable
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
#encoding_dependent_variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
We use the LabelEncoder Function and Dummy Variable for encoding
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:, 0] = labelencoder_x.fit_transform(x[: , 0])
#dummy_variable
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
#encoding_dependent_variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
- Splitting Dataset
Here we will split our dataset into test data and train data
#spliting_deataset_into_test_and_training_data
from sklearn.cross_validation import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x, y , test_size = 0.2 , random_state = 0)
Here we split the dataset in 20:80 ratio , where testing will be done on 20% data and for training I am using 80 % data . We can change this ratio easily.
That's all for Data-Preprocessing.
In the Next Blog I will share the snippets about Simple Linear Regrassion.
Stay Tuned :)
#spliting_deataset_into_test_and_training_data
from sklearn.cross_validation import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x, y , test_size = 0.2 , random_state = 0)
Here we split the dataset in 20:80 ratio , where testing will be done on 20% data and for training I am using 80 % data . We can change this ratio easily.from sklearn.cross_validation import train_test_split
x_train , x_test , y_train , y_test = train_test_split(x, y , test_size = 0.2 , random_state = 0)
That's all for Data-Preprocessing.
In the Next Blog I will share the snippets about Simple Linear Regrassion.
Stay Tuned :)
Post a Comment