2-5节 k-近邻算法-测试算法&使用算法|优化约会网站

2018-08-10 本文已影响30人努力奋斗的durian

文章原创,最近更新：2018-08-10

本章节的主要内容是:
重点介绍项目案例1: 优化约会网站的配对效果中的 测试算法和使用算法:约会网站预测函数。

1.KNN项目案例介绍:

项目案例1:

优化约会网站的配对效果

项目概述:

1）海伦使用约会网站寻找约会对象。经过一段时间之后，她发现曾交往过三种类型的人: 不喜欢的人、魅力一般的人、极具魅力的人。
2）她希望： 1. 工作日与魅力一般的人约会 2. 周末与极具魅力的人约会 3. 不喜欢的人则直接排除掉。现在她收集到了一些约会网站未曾记录的数据信息，这更有助于匹配对象的归类。

开发流程：

收集数据：提供文本文件
准备数据：使用 Python 解析文本文件
分析数据：使用 Matplotlib 画二维散点图
训练算法：此步骤不适用于 k-近邻算法
测试算法：使用海伦提供的部分数据作为测试样本。
测试样本和非测试样本的区别在于：测试样本是已经完成分类的数据，如果预测分类与实际类别不同，则标记为一个错误。
使用算法：产生简单的命令行程序，然后海伦可以输入一些特征数据以判断对方是否为自己喜欢的类型。

数据集介绍

海伦把这些约会对象的数据存放在文本文件 datingTestSet2.txt (数据来源于《机器学习实战》第二章 k邻近算法)中，总共有 1000 行。

本文使用的数据主要包含以下三种特征：每年获得的飞行常客里程数，玩视频游戏所耗时间百分比，每周消费的冰淇淋公升数。其中分类结果作为文件的第四列，并且只有3、2、1三种分类值。datingTestSet2.csv文件格式如下所示：

飞行里程数	游戏耗时百分比	冰淇淋公升数	分类结果
40920	8.326976	0.953952	3
14488	7.153469	1.673904	2
26052	1.441871	0.805124	1

数据在datingTestSet2.txt文件中的格式如下所示：

2.测试算法代码

测试KNN的分类准确率

def datingClassTest():
    """
    对约会网站的测试方法
    return:错误数
    """
    # 设置测试数据的的一个比例（训练数据集比例=1-hoRatio）
    hoRatio= 0.1 # 测试范围,一部分测试一部分作为样本
    #从文件中加载数据,获得特征和标签分开保存
    datingDataMat,datingLabels =file2matrix('datingTestSet2.txt')
    # 特征归一化,返回归一化特征,归一化范围集每个特征最小值
    normMat,ranges,minVals=autoNorm(datingDataMat)
    # m 表示数据的行数,即矩阵的第一维
    m =normMat.shape[0]
    # 设置测试的样本数量,numTestVecs:m表示训练样本的数量
    numTestVecs=int(m*hoRatio)
    print("numTestVecs=",numTestVecs)
    errorCount= 0.0 
    for i in range(numTestVecs):#遍历测试集数据确定错误分类百分比
        # 对数据进行测试
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs: m], datingLabels[numTestVecs:m], 3)
        # 显示分类结果和实际标签
        print("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))
        # 对比分类结果统计错误分类数量
        if (classifierResult != datingLabels[i]): errorCount += 1.0
    #显示错分比例
    print("the total error rate is: %f" % (errorCount / float(numTestVecs)))
    print(errorCount)

测试代码及其结果如下:

>import kNN
>kNN.datingClassTest()

numTestVecs= 100
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
'''''''''''''''''''''''(略)'''''''''''''''''''''''''''''''''
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the total error rate is: 0.080000
8.0

3.使用算法:约会网站预测函数代码

# 构建完整的约会网站预测函数：
def classifyPerson():
    resultList=['not at all','in samll doses','in large doses']
    # 原文中用的raw_input,在python3中统统使用input
    percentTats = float(input("Percentage of time spent playing vedio game?"))
    ffMiles = float(input("frequent flier miles earned per years?"))
    iceCream = float(input("liters of ice cream consumed per years?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    # classifyerResut-1是因为分类结果是123，而resultlist中排序是012
    classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels, 3)
    print("You will probably like this person: ",resultList[classifierResult-1])