百日机器学习指南(100-Days-Of-ML-Code)

百日机器学习编程指南-Day1 数据预处理

2018-09-13  本文已影响0人  TJH_KYC

前言

预习中文图

Day 1.jpg

细细道来

第1步:导入库

# 1.Importing the required libraries
import numpy as np
import pandas as pd

第2步:导入数据集

# 2.Importing the Dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/datasets/Data.csv")
dataset.head()
# type(dataset.iloc[:,:-1])
# type(dataset.iloc[:,:-1].values)
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values

第3步:处理缺失数据

# 3.Handling the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
Z = imputer.fit(X[:,1:3])
Z.statistics_
X[:,1:3] = Z.transform(X[:,1:3])
# X[:,1:3] = imputer.fit_transform(X[:,1:3])
# print(X[:,1:3])

第4步:解析分类数据及创立哑变量

# 4.Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
print(X[:,0])
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
print(y)

# Creating a dummy variable
onehotencoder = OneHotEncoder(categorical_features = [0])
type(X)
type(onehotencoder.fit_transform(X))
X = onehotencoder.fit_transform(X).toarray()
type(onehotencoder.fit_transform(X).toarray())

第5步:拆分数据集为训练集合和测试集合

# 5.Splitting the dataset into test set and training set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

第6步:特征量化

# 6.Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.tranform(X_test)
print(X_train);print(X_test)

总结

复习

最后,奉上完整代码和英文图供复习:

# -*- coding: utf-8 -*-
"""
Created on Thu Sep 13 20:33:51 2018

@author: wongz
"""
# 1.Importing the required libraries
import numpy as np
import pandas as pd

# 2.Importing the Dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/datasets/Data.csv")
dataset.head()
# type(dataset.iloc[:,:-1])
# type(dataset.iloc[:,:-1].values)
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values

# 3.Handling the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
Z = imputer.fit(X[:,1:3])
Z.statistics_
X[:,1:3] = Z.transform(X[:,1:3])
# X[:,1:3] = imputer.fit_transform(X[:,1:3])
# print(X[:,1:3])

# 4.Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
print(X[:,0])
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
print(y)

# Creating a dummy variable
onehotencoder = OneHotEncoder(categorical_features = [0])
type(X)
type(onehotencoder.fit_transform(X))
X = onehotencoder.fit_transform(X).toarray()
type(onehotencoder.fit_transform(X).toarray())

# 5.Splitting the dataset into test set and training set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

# 6.Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.tranform(X_test)
print(X_train);print(X_test)
Day 1 EN.jpg
上一篇下一篇

猜你喜欢

热点阅读