机器学习（五）：聚类

2019-07-11 本文已影响4人 fromeast

一、基本原理

聚类(clustering)是一种无监督学习(unsupervised learning)，即没有标记信息，通过对无标记训练样本的学习发现数据的内在性质和规律。
对于样本集 $D=\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{2}, \ldots, \boldsymbol{x}_{m}\right\}$ ，聚类算法需要将样本划分为 $k$ 个互不相交的簇 $\mathcal{C}=\left\{C_{1},C_{2}, \ldots, C_{k} \right\}$ . $k-means$ 算法针对聚类所得簇最小化平方误差： $E=\sum_{i=1}^{k} \sum_{\boldsymbol{x} \in C_{i}}\left\|\boldsymbol{x}-\boldsymbol{\mu}_{i}\right\|_{2}^{2}$ 其中 $\boldsymbol{\mu}_{i}=\frac{1}{\left|C_{i}\right|} \sum_{\boldsymbol{x} \in C_{i}} \boldsymbol{x}$ 为簇均值向量。E在一定程度上刻画了簇内样本围绕簇均值向量的紧密程度，E越小相似度越高。
简单来说， $k-means$ 算法主要分为两个步骤：
第一步：簇分配，随机选 $k$ 个点作为中心，计算到这 $k$ 个点的距离，分为 $k$ 个簇。
第二步：移动聚类中心：重新计算每个簇的中心，移动中心，重复以上步骤。

k-means算法

二、算法实现

$k-means$ 算法过程比较简单明晰，以下为其实现过程：

所需要的库

import numpy as np
import scipy.io as spio
import imageio
import matplotlib.pyplot as plt

主体过程，返回聚类中心的位置和各点的归属

def KMeans(X,init_centroids,max_iters,plot_process):
    m,n = X.shape
    K = init_centroids.shape[0]
    centriods = init_centroids
    pre_centroids = centriods
    
    for i in range(max_iters):
        if plot_process:
            ax = plotProcess(X,centriods,pre_centroids)
            pre_centroids = centriods
        idx = calcDistance(X,centriods)
        centriods = calcCentroids(X,idx,K)
    if plot_process:
        ax.show()
    return centriods,idx

计算各点到聚类中心的位置，并据此将其分属不同类

def calcDistance(X,init_centroids):
    m = X.shape[0]
    K = init_centroids.shape[0]
    dis = np.zeros((m,K))
    idx = np.zeros((m,1))
    
    for i in range(m):
        for j in range(K):
            dis[i,j] = np.dot((X[i,:]-init_centroids[j,:]).reshape(1,-1),(X[i,:]-init_centroids[j,:]).reshape(-1,1))
    
    dummy, idx = np.where(dis==np.min(dis,axis=1).reshape(-1,1))
    return idx[0:m]

计算聚类中心，即均值向量

def calcCentroids(X,idx,K):
    n = X.shape[1]
    centroids = np.zeros((K,n))
    for i in range(K):
        centroids[i,:] = np.mean(X[np.ravel(idx==i),:],axis=0).reshape(1,-1)
        
    return centroids

辅助函数，绘图及初始化聚类中心位置

def plotProcess(X,centriods,pre_centroids):
    plt.scatter(X[:,0],X[:,1])
    plt.plot(pre_centroids[:,0],pre_centroids[:,1],'r*',markersize=10,linewidth=5)
    plt.plot(centriods[:,0],centriods[:,1],'r*',markersize=10,linewidth=5)
    for j in range(centriods.shape[0]):
        p1 = centriods[j,:]
        p2 = pre_centroids[j,:]
        plt.plot([p1[0],p2[0]],[p1[1],p2[1]],'->',linewidth=2)
    return plt

def initCentroids(X,K):
    m = X.shape[0]
    m_arr = np.arange(0,m)
    centroids = np.zeros((K,X.shape[1]))
    np.random.shuffle(m_arr)
    rand_indices = m_arr[:K]
    centroids = X[rand_indices,:]
    return centroids

主函数

def main():
    data = spio.loadmat('data.mat')
    X = data['X']
    K = 3 
   # init_centroids = np.array([[3,3],[0,2],[7,5]])
    init_centroids = initCentroids(X,K)
    max_iters = 10
    KMeans(X,init_centroids,max_iters,True)
if __name__=='__main__':
    main()

下图为聚类的结果，形象展示了聚类中心及其移动过程。

聚类中心及其移动过程

聚类算法可用于图片的压缩，将图片的像素分为若干类，然后用这个类代替原来的像素值，以下为其实现过程及结果：

def imageCompression():
    img_data = imageio.imread('bird.png')
    img_data = img_data/255.0
    img_size =img_data.shape
    X = img_data.reshape(img_size[0]*img_size[1],3)
    
    K = 10
    max_iters = 10
    init_centroids = initCentroids(X,K)
    centroids,idx = KMeans(X,init_centroids,max_iters,False)
    idx = calcDistance(X,centroids)
    X_recovered = centroids[idx,:]
    X_recovered = X_recovered.reshape(img_size[0],img_size[1],3)
    
    plt.subplot(1,2,1)
    plt.imshow(img_data)
    plt.subplot(1,2,2)
    plt.imshow(X_recovered)
    plt.show()
    
if __name__=='__main__':
    imageCompression()

原图及压缩后的结果

另外可以通过mglearn库展示聚类过程和分类边界。

import mglearn

mglearn.plots.plot_kmeans_algorithm()
mglearn.plots.plot_kmeans_boundaries()

聚类过程及分类边界

三、问题探讨

3.1、性能度量

聚类性能度量大致有两类：一类是外部指标，就是将聚类结果与某个“参考模型”比较；另一类是内部指标，即直接考察聚类结果而不利用任何参考模型。
外部指标：

Jaccard系数（Jaccard Coefficient, JC）
$J C=\frac{a}{a+b+c}$
FM指数（Fowlkes and Mallows Index, FMI）
$\mathrm{FMI}=\sqrt{\frac{a}{a+b} \cdot \frac{a}{a+c}}$
Rand指数（Rand Index, RI）
$\mathrm{RI}=\frac{2(a+d)}{m(m-1)}$
上述指标的结果值均在[0,1]之间，值越大越好。
其中
$a=|S S|, \quad S S=\left\{\left(x_{i}, x_{j}\right) | \lambda_{i}=\lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j\right) \}$
$b=|S D|, \quad S D=\left\{\left(x_{i}, x_{j}\right) | \lambda_{i}=\lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j\right) \}$
$c=|D S|, \quad D S=\left\{\left(x_{i}, x_{j}\right) | \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*}=\lambda_{j}^{*}, i<j\right) \}$
$d=|D D|, \quad D D=\left\{\left(x_{i}, x_{j}\right) | \lambda_{i} \neq \lambda_{j}, \lambda_{i}^{*} \neq \lambda_{j}^{*}, i<j\right) \}$
$\mathcal{C}^{*}=\left\{C_{1}^{*}, C_{2}^{*}, \ldots, C_{s}^{*}\right\}$ 表示参考模型。

内部指标：

DB指数（Davies-Bouldin Index, DBI）
$\mathrm{DBI}=\frac{1}{k} \sum_{i=1}^{k} \max _{j \neq i}\left(\frac{\operatorname{avg}\left(C_{i}\right)+\operatorname{avg}\left(C_{j}\right)}{d_{\operatorname{cen}}\left(\boldsymbol{\mu}_{i}, \boldsymbol{\mu}_{j}\right)}\right)$
Dunn指数（Dunn Index, DI）
$\mathrm{DI}=\min _{1 \leqslant i \leqslant k}\left\{\min _{j \neq i}\left(\frac{d_{\min }\left(C_{i}, C_{j}\right)}{\max _{1 \leqslant l \leqslant k} \operatorname{diam}\left(C_{l}\right)}\right)\right\}$
DBI的值越小越好，DI值越大越好。
其中
$\operatorname{avg}(C)=\frac{2}{|C|(|C|-1)} \sum_{1 \leqslant i<j \leqslant|C|} \operatorname{dist}\left(x_{i}, x_{j}\right)$
$\operatorname{diam}(C)=\max _{1 \leqslant i<j \leq|C|} \operatorname{dist}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)$
$d_{\min }\left(C_{i}, C_{j}\right)=\min _{\boldsymbol{x}_{i} \in C_{i, \boldsymbol{x}} \in C_{j}} \operatorname{dist}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)$
$d_{\mathrm{cen}}\left(C_{i}, C_{j}\right)=\operatorname{dist}\left(\boldsymbol{\mu}_{i}, \boldsymbol{\mu}_{j}\right)$

3.2、距离度量

距离度量 $\operatorname{dist}(\cdot, \cdot)$ 是机器学习中一个非常常用的概念，反映了两点之间的相似程度，具有非负性、同一性、对称性和三角不等式性质。最常用的表示方法是闵可夫斯基距离(Minkowski distance)
$\operatorname{dist}_{\mathrm{mk}}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)=\left(\sum_{u=1}^{n}\left|x_{i u}-x_{j u}\right|^{p}\right)^{\frac{1}{p}}$

当 $p=2$ 时，即欧氏距离(Euclidean distance)
$\operatorname{dist}_{\mathrm{ed}}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)=\left\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\right\|_{2}=\sqrt{\sum_{u=1}^{n}\left|x_{i u}-x_{j u}\right|^{2}}$

当 $p=1$ 时，即曼哈顿距离(Manhattan distance)
$\operatorname{dist}_{\operatorname{man}}\left(\boldsymbol{x}_{i}, \boldsymbol{x}_{j}\right)=\left\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\right\|_{1}=\sum_{u=1}^{n}\left|x_{i u}-x_{j u}\right|$

另外，聚类算法还包括密度聚类（如DBSCAN）、层次聚类（如AGNES）等，在此不予详述。

参考资料

[1] https://github.com/lawlite19/MachineLearning_Python
[2] 周志华著. 机器学习. 北京:清华大学出版社,2016
[3] 李航著. 统计学习方法. 北京:清华大学出版社,2012
[4] 史春奇等著. 机器学习算法背后的理论与优化. 北京:清华大学出版社,2019
[5] Peter Harrington 著. 李锐等译. 机器学习实战. 北京:人民邮电出版社,2013

人生如逆旅，我亦是行人 ——苏轼《临江仙·送钱穆父》