百日机器学习编程指南-Day1 数据预处理
2018-09-13 本文已影响0人
TJH_KYC
前言
- 话说,Avik Jain小哥在Github发起了一个百日机器学习编程项目(100-Days-Of-ML-Code),由于其简单、易学、系统等特点,一时间火热无比。
- 学渣看了无比鸡冻,心想我草这特么不是为我准备的么!然后火速去学习,发现虽然没几行,但是还是有一堆代码看不懂啊!
- 学霸看了看可怜的学渣,无奈地说道:这都看不懂,服了,还是我给你细细道来吧!
预习中文图
Day 1.jpg细细道来
第1步:导入库
# 1.Importing the required libraries
import numpy as np
import pandas as pd
- 导入两个常规的库,后面是定义缩写方便后文引用;
- Numpy是数值计算的扩展包,Panadas是做数据处理;
第2步:导入数据集
# 2.Importing the Dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/datasets/Data.csv")
dataset.head()
# type(dataset.iloc[:,:-1])
# type(dataset.iloc[:,:-1].values)
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values
- pd.read_csv()用于导入数据集,可以本地或者url;
- dataset.head()用于查看数据集前5行数据;
- 通过注释掉的两个print可以看出,dataset.iloc[:,:-1]是DataFrame,而dataset.iloc[:,:-1].values是ndarray;
- 将数据集的(所有行和除最后一列外所有列)的数值导入X矩阵,将数据集的(所有行和最后一列)的数值导入y向量;
第3步:处理缺失数据
# 3.Handling the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
Z = imputer.fit(X[:,1:3])
Z.statistics_
X[:,1:3] = Z.transform(X[:,1:3])
# X[:,1:3] = imputer.fit_transform(X[:,1:3])
# print(X[:,1:3])
- sklearn.preprocessing四步法:IIFT,不懂的可以看看这里,本例具体如下:
I for Importing,导入某类(class),这里是Imputer;
I for Instantiate,实例化,这里是把类Imputer实例化为imputer;
F for fitting,喂实例数据进行拟合,拟合后生成某些统计量;
T for Transforming,将统计量转换到某处; - 其中,F和T可以一步完成,见注释掉的fit_transform处;
第4步:解析分类数据及创立哑变量
# 4.Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
print(X[:,0])
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
print(y)
# Creating a dummy variable
onehotencoder = OneHotEncoder(categorical_features = [0])
type(X)
type(onehotencoder.fit_transform(X))
X = onehotencoder.fit_transform(X).toarray()
type(onehotencoder.fit_transform(X).toarray())
- LabelEncoder用于将分类变量里的字符型数据转化为数值型数据; OneHotEncoder用于哑变量的独热编码;
- OneHotEncoder无法直接对字符型变量进行编码,需要先通过LabelEncoder将字符型变量转换为数值型变量;
- toarray()的作用是将coo_matrix转化为ndarray,这里的3个type可以解释,结果分别是:ndarray,coo_matrix,ndarray;
第5步:拆分数据集为训练集合和测试集合
# 5.Splitting the dataset into test set and training set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
- 注意train_test_split()括号里的arguments,和它等号左侧各数据集的摆放顺序:先训练集后测试集;
第6步:特征量化
# 6.Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.tranform(X_test)
print(X_train);print(X_test)
- 还是sklearn.preprocessing的IIFT四步法;
- 为什么先是fit_transform而后是只有tranform可以看这里的解释;
总结
- numpy,pandas,sklearn是库(Library);
- sklearn.preprocessing,sklearn.model_selection是sklearn库里的模块(Module);
- Imputer(),LabelEncoder(),OneHotEncoder(),StandardScaler()这四个是类(Class)函数,train_test_split()是方法(Method)函数;
复习
最后,奉上完整代码和英文图供复习:
# -*- coding: utf-8 -*-
"""
Created on Thu Sep 13 20:33:51 2018
@author: wongz
"""
# 1.Importing the required libraries
import numpy as np
import pandas as pd
# 2.Importing the Dataset
dataset = pd.read_csv("https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/datasets/Data.csv")
dataset.head()
# type(dataset.iloc[:,:-1])
# type(dataset.iloc[:,:-1].values)
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,3].values
# 3.Handling the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN",strategy="mean",axis=0)
Z = imputer.fit(X[:,1:3])
Z.statistics_
X[:,1:3] = Z.transform(X[:,1:3])
# X[:,1:3] = imputer.fit_transform(X[:,1:3])
# print(X[:,1:3])
# 4.Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
print(X[:,0])
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
print(y)
# Creating a dummy variable
onehotencoder = OneHotEncoder(categorical_features = [0])
type(X)
type(onehotencoder.fit_transform(X))
X = onehotencoder.fit_transform(X).toarray()
type(onehotencoder.fit_transform(X).toarray())
# 5.Splitting the dataset into test set and training set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
# 6.Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.tranform(X_test)
print(X_train);print(X_test)
Day 1 EN.jpg