KNN-近邻算法

2020-02-15  本文已影响0人  ggr

思路:两个样本如果足够相似,那么他们可能属于同一个类别。

原理介绍

我们先来看一个简单的案例:
现在有10组数据,第一组 raw_data_x 记录每个病人的血压和血常规,第二组 raw_data_y 记录病人是否属于疾病晚期(0 表示不是晚期, 1表示晚期):

raw_data_x = [
    [3.393953321, 2.331273301],
    [3.110073482, 1.781519638],
    [1.343800831, 3.368369547],
    [3.543243008, 4.686943958],
    [2.847390282, 2.867942324],
    [7.432434368, 4.683435453],
    [5.745051997, 3.533989803],
    [9.172168622, 2.511101045],
    [7.792783481, 3.424088941],
    [7.939820817, 0.791637231]
]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

对应的图像如下(蓝点表示不是疾病晚期,红点就是疾病晚期):

屏幕快照 2020-02-15 上午9.26.53.png 现在如果来了一个新的患了同种疾病的人,已知他的血压和血常规 [3.390000766, 2.1829382999],现在要预测这个人是否可能是晚期,我们就可以把这个点用蓝色也画在坐标轴中如下: 屏幕快照 2020-02-15 上午9.30.34.png 从图中可以看出,这个新的病人指标最近的几个点中,蓝点的数量较多一些,所以我们可以初步猜测这个病人不是疾病晚期。

公式推导

如果要用数学做量化的话,实际上我们就是要找到和新的数据距离最近的那几个点,看这几个点的情况来预测新的这个点的情况。数学上计算两个点的距离我们一般称之为欧拉距离


设计实现

好了,知道了如何求n维空间的欧拉距离,我们就可以编码实现我们的demo了

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from math import sqrt

"""KNN算法实现过程, 目前取距离最近的前3个点进行预测"""
if __name__ == "__main__":
    raw_data_x = [
        [3.393953321, 2.331273301],
        [3.110073482, 1.781519638],
        [1.343800831, 3.368369547],
        [3.543243008, 4.686943958],
        [2.847390282, 2.867942324],
        [7.432434368, 4.683435453],
        [5.745051997, 3.533989803],
        [9.172168622, 2.511101045],
        [7.792783481, 3.424088941],
        [7.939820817, 0.791637231]
    ]
    raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

    x_train = np.array(raw_data_x)
    y_train = np.array(raw_data_y)
    A = np.array([3.390000766, 2.1829382999])
    distinct = [sqrt(np.sum((x - A)**2)) for x in x_train]
    nearest = np.argsort(distinct)
    k = 3
    topK_y = [y_train[i] for i in nearest[:k]]

    votes = Counter(topK_y)
    result = votes.most_common(1)[0][0]
    result = "晚期" if result == 1 else "不是晚期"
    print("新病人:{}".format(result))

高级部分

其实python已经有很多优秀的库实现了对KNN算法的封装,为了稳定性和准确性我们一边会使用一些比较业界比较经典的机器学习算法库。例如,scikit-learn
官网地址: scikit-learn(sklearn): http://scikit-learn.org
如果不想看英文,请自行百度中文版的scikit-learn api文档,
下面看一下scikit-learn中如何使用内置的数据集和算法库实现KNN-近邻算法

import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier

if __name__ == "__main__":
    # load dataset
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target

    # train_test_split
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=666)

    # normalization
    from sklearn.preprocessing import StandardScaler
    standardScaler = StandardScaler()
    standardScaler.fit(X_train)
    X_train = standardScaler.transform(X_train)
    X_test_standard = standardScaler.transform(X_test)

    # use KNeighborsClassifier && GridSearch
    from sklearn.model_selection import GridSearchCV
    param_grid = [
        {
            "weights": ["uniform"],
            "n_neighbors": [i for i in range(1, 11)]
        },
        {
            "weights": ["distance"],
            "n_neighbors": [i for i in range(1, 11)],
            "p": [i for i in range(1, 6)]
        },
    ]
    knn_clf = KNeighborsClassifier()

    """
    n_jobs : Number of jobs to run in parallel
    verbose : Controls the verbosity: the higher, the more messages
    iid: If True, return the average score across folds, weighted by the number of samples in each test set.  
    cv:Determines the cross-validation splitting strategy
    """
    grid_search = GridSearchCV(estimator=knn_clf, param_grid=param_grid, n_jobs=-1, verbose=2, iid=True, cv=5)

    grid_search.fit(X_train, y_train)
    knn_clf = grid_search.best_estimator_
    print(knn_clf.score(X_test_standard, y_test))

说明:我们加载了sciken-learn内置的鸢蕊花数据集,由于数据集本身就是乱序,所以我们只需要对整个数据集进行简单划分即70%的数据用于训练生成模型,30%的数据用来验证模型的一个准确性,其次为了保证计算距离时每个维度能够的平等,我们对数据进行了一次均值-方差归一化处理,最后我们使用网格搜索的方式让模型自动选出最好的模型参数,然后将测试集推到最好的模型中进行测试。

看不懂上面的“鸟语”也没关系,下面我会一一说明这些词语的含义:

抛开上面的东西,我们来思考几个问题。

最值归一化 (normalization):把所有数据归一化到0-1空间 ,具体做法如下(X \Longrightarrow X_{scale})其中:

X_{scale} = (X-X_{min})/(X_{max}-X_{min})
下面演示一下上面的鸢蕊花数据集进行最值归一化的一个效果:

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

if __name__ == "__main__":
    # load dataset
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target

    # train_test_split
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=666)

    from sklearn.preprocessing import MinMaxScaler
    minMaxScaler = MinMaxScaler()
    minMaxScaler.fit(X_train)
    X_train = minMaxScaler.transform(X_train)
    print(X_train.shape)
    plt.scatter(X_train[:, 0], X_train[:, 1],  X_train[:, 2],  X_train[:, 3])z
    plt.show()
maxminScaler.png

数值都被归一化到一个0-1空间

均值方差归一化(standardization):把所有数据归一到均值为0方差为1的分布中,X \Longrightarrow X_{scale})其中::
X_{scale} = (X-avg(X))/(std(X)) 也就是每个数减去总体的均值后除以总体方差。

import numpy as np
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

if __name__ == "__main__":
    # load dataset
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    # train_test_split
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=666)

    # normalization
    from sklearn.preprocessing import StandardScaler
    standardScaler = StandardScaler()
    standardScaler.fit(X_train)
    X_train = standardScaler.transform(X_train)
    plt.scatter(X_train[:, 0], X_train[:, 1])
    plt.show()

means.png

日常情况下我们可能可以针对我们的业务设定一个比较合理的超参数来开始训练模型,但多数场景下我们都是靠
网格搜索:这里所谓的网格搜索就是用来解决超参数问题的,基本原理就是把我们每个参数的取值范围都排列组合然后每一组参数就会有一个模型,我们最好选出准确度最高的那个模型就可以了。下面看一下具体做法:

 # use KNeighborsClassifier && GridSearch
    from sklearn.model_selection import GridSearchCV
    param_grid = [
        {
            "weights": ["uniform"],
            "n_neighbors": [i for i in range(1, 11)]
        },
        {
            "weights": ["distance"],
            "n_neighbors": [i for i in range(1, 11)],
            "p": [i for i in range(1, 6)]
        },
    ]
    knn_clf = KNeighborsClassifier()

    """
    n_jobs : Number of jobs to run in parallel
    verbose : Controls the verbosity: the higher, the more messages
    iid: If True, return the average score across folds, weighted by the number of samples in each test set.  
    cv:Determines the cross-validation splitting strategy
    """
    grid_search = GridSearchCV(estimator=knn_clf, param_grid=param_grid, n_jobs=-1, verbose=2, iid=True, cv=5)

    grid_search.fit(X_train, y_train)
    print(grid_search.best_estimator_)
    knn_clf = grid_search.best_estimator_

控制台输出如下:

...
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    1.7s finished
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=9, p=2,
                     weights='uniform')
0.9777777777777777

可以看出我们最终的发现n_neighbors=9, p=2,weights='uniform' 这个组合的模型准确度最高达到了97.77777777777777%


写到最后

我们总结一下KNN-近邻算法的优缺点吧:


上一篇 下一篇

猜你喜欢

热点阅读