机器学习系列（七）——模型性能评测·准确率

2019-06-09 本文已影响5人 Ice_spring

笔者的机器学习系列将对各机器学习算法都进行自己的算法编写用于模拟sklearn实现方式，借此更好地理解算法原理和sklearn的内部逻辑。

1.模型性能测试

在knn算法中，整个训练数据集被视为模型，参与新的未知样本点的预测，训练得到的模型直接被用于真实环境。但是如果模型效果很差呢，而且有些真实环境难以拿到真实的label，训练出模型后难以改进，难道就要听天由命让它在真实环境中使用？由此可见将全部数据当作训练集得到的模型直接投入真实环境是不恰当的。
一种简单有效的改进办法是将数据集分为训练集和测试集，用训练数据训练模型，用测试数据检验模型性能好坏。不过其实这种方式也存在问题，以后会提到。

划分鸢尾花数剧集

取80%作为训练集，20%作为测试集

'''使用鸢尾花数剧集'''
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
iris=datasets.load_iris()
x=iris.data
y=iris.target

不过由于y中标签是排好序的，不能直接取前120个作为训练集，需要打乱顺序。

'''分为训练和测试数剧集train test split'''
'''先随机排列'''
shuffle_indexes = np.random.permutation(len(x))#150个索引的随机排列

'''测试集和训练集对应的索引'''
test_indexes = shuffle_indexes[:test_size]#前30个索引
train_indexes = shuffle_indexes[test_size:]

'''fancy index获取训练剧集和测试集'''
x_train = x[train_indexes]
y_train = y[train_indexes]

x_test = x[test_indexes]
y_test = y[test_indexes]

使用我们的算法

这里为了更加熟悉sklearn中机器学习的方式，模拟了一段sklearn的train_test_split实现方式。
新建model_selection.py：

'''model_selection.py'''
import numpy as np
def train_test_split(x,y,test_ratio=0.2,seed=None):
    '''划分训练集和测试集'''
    assert x.shape[0] == y.shape[0],"must be the same"
    assert 0.0<=test_ratio<=1.0,"must be valid"

    if seed:
        np.random.seed(seed)

    shuffle_indexes = np.random.permutation(len(x))#150个索引的随机排列
   
    test_size = int(len(x)*test_ratio)

    test_indexes = shuffle_indexes[:test_size]#前30个索引
    train_indexes = shuffle_indexes[test_size:]
    '''fancy index获取训练剧集和测试集'''
    x_train = x[train_indexes]
    y_train = y[train_indexes]

    x_test = x[test_indexes]
    y_test = y[test_indexes]
    
    return x_train,x_test,y_train,y_test

接下来运行这段程序将鸢尾花数剧集分为训练集和测试集：

'''使用封装的model_selection.py模拟sklearn实现方式'''
from play_Ml.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y)

再将我们在系列（六）中所写的knn代码一并封装到play_Ml模块，用我们自己的代码实现knn性能测试：

from play_Ml.kNN_sklearn import kNNClassifier
'''初始化一个knn分类器'''
knn_clf=kNNClassifier(k=3)
'''fit'''
knn_clf.fit(x_train,y_train)
'''预测'''
y_predict=knn_clf.predict(x_test)
sum(y_predict==y_test)      out:30
'''预测准确率'''
sum(y_predict==y_test)/len(y_test)     out:1

可以看到在测试集30个样本上，knn算法给出了全部正确的预测，可见knn算法的性能还是相当不错的。可能由于随机shuffle的影响，不同次运行不一定都是100%的准确率（可以指定seed使结果一样），但也都接近100%的预测准确率。

sklearn实现上述过程

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=666)
'''这里的test_size相当于上面我们自己函数的test_ratio,不传值则默认为0.2'''
knn_clf = KNeighborsClassifier(n_neighbors=6)#创建knnclassifier实例
knn_clf.fit(x_train,y_train)#fit过程训练数据集，得到模型
knn_clf.predict(x_test)#预测
y_predict=knn_clf.predict(x_test)#预测
sum(y_predict==y_test)/len(y_test) out:1

可以看到我们自己写的代码实现过程和sklearn的实现过程几乎一样，这样我们也更加理解了sklearn是如何实现算法的。

2.手写数字数据集

手写数字数据集共5620个数据项，但实际在sklearn的datasets中只有1797个，每个数据项共64个特征，是8×8像素的图像像素。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
'''使用手写数字识别数剧集'''
digits = datasets.load_digits()

下面展示一下由特征决定的图像形状，利用matplotlib模块绘制出来：

x=digits.data#特征
y=digits.target#对应0-9之一的label
some_digit = x[666]
'''对some_digit进行可视化'''
import matplotlib
some_digit_image=some_digit.reshape(8,8)
plt.imshow(some_digit_image,cmap = matplotlib.cm.binary)
plt.show()

some_digit

使用自己的knn对手写数字数据集进行训练与准确率评价

准确率是用的比较多的一种模型评价，所以封装进自己的模块：

'''metrics.py'''
import numpy as np
def accuracy_score(y_true,y_predict):
    assert y_true.shape[0] == y_predict.shape[0],"must be the same"
    return sum(y_true == y_predict)/len(y_true)

使用自己的accuracy_score计算准确率：

from play_Ml.model_selection import train_test_split
from play_Ml.kNN import kNNClassifier
x_train,x_test,y_train,y_test = train_test_split(x,y,test_ratio=0.2)
knn_clf = kNNClassifier(k=3)
knn_clf.fit(x_train,y_train)
from play_Ml.metrics import accuracy_score
accuracy_score(y_test,y_predict) out:0.988
'''以后这些写进自己模块的逻辑就直接使用，不再另定义'''

可以看到knn对手写数字数据集的准确率也达到了98%。
有时我们并不在意预测结果是什么，只关心模型当前准确率，以在后来改进模型，为此，在kNN.py中封装另一个函数score（score内其实就是accuracy_score的调用）：

def score(self,x_test,y_test):
    '''根据x_test和y_test确定当前模型准确率'''
    y_predict = self.predict(x_test)
    return accuracy_score(y_test,y_predict)

于是可以直接调用knn分类器中的score函数：
knn_clf.score(x_test,y_test)out:0.988

使用sklearn方式实现训练与准确率评价

sklearn当中也有accuracy_score这个函数，上小节实现的accuracy_score就是模拟sklearn中的实现方法，可以运用如下：

'''sklearn中使用accuracy_score'''
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 666)
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(x_train,y_train)
y_predict = knn_clf.predict(x_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predict)   out:0.988

同样sklearn的knn算法也封装了score方法,不关心y_predict是什么，只关心准确率：
knn_clf.score(x_test,y_test)
输出准确率为：0.988。