K-Means 算法总结和 Python 实现

2017-04-13 本文已影响0人榴莲酥君

在聚类算法中，我们给定训练集 ${x^{(1)},...,x{(m)}}$ （醉了，简书的 Markdown 平台不支持数学公式的解析 o(≧口≦)o，可参考我另外的博客），希望这些输入数据聚类到若干个类簇当中。其中$x^{(1)}\in{Rn}$，但是每个样本数据没有$y^{(i)}$，即没有类标信息，因而这时一个无监督学习问题。

K-Means 主要想法是找到 k 个质心，将离某个质心最近的样本聚类到这个类簇档当中，将所有样本聚类成 k 个类簇（对K-Means 详细的介绍可参考 Wikipedia）。基本算法那如下：

Pseudocode of K-Means

第一步，首先随机初始化 k 个质心的位置。

第二步是一个迭代循环的操作:

首先对于每一个样本 $x^{(i)}$，找到离该样本最近的质心，将其归类到该质心对应的类簇中，即这里的第 $j$ 个类簇中。
将所有样本都归类到对应的类簇后，需要利用每一个类簇中的样本，重新计算该类簇中所有样本的均值得到新的质心。

循环执行步骤1和步骤2，直至收敛，即迭代过程中质心不再更新。

Python 实现简单的 K-Means 算法如下：

__author__ = 'bin'
# reference: https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/

import numpy as np
import random
import matplotlib.pyplot as plt


# Lloyd's algorithm
# inner loop step 1
def cluster_points(X, mu):
    clusters = {}  # store k centers, type: dict

    for x in X:
        # bestmukey is "int" type
        # for i in enumerate(mu):
        #     print ((i[0], np.linalg.norm(x-mu[i[0]])))
        bestmukey = min([(i[0], np.linalg.norm(x - mu[i[0]])) \
                         for i in enumerate(mu)], key=lambda t: t[1])[0]
        # A new built-in function, enumerate(), will make certain loops a bit clearer.
        # enumerate(thing), where thing is either an iterator or a sequence,
        # returns a iterator that will return (0, thing[0]), (1, thing[1]), (2, thing[2]), and so forth.
        # key=lambda t:t[1] is used for sort this dict by t:t[1] (the second element in this element)

        try:
            clusters[bestmukey].append(x)
        except KeyError:
            clusters[bestmukey] = [x]
    return clusters


# inner loop step 2, (update the mu)
def reevaluate_centers(mu, clusters):
    newmu = []
    keys = sorted(clusters.keys())
    for k in keys:
        print len(clusters[k])
        newmu.append(np.mean(clusters[k], axis=0))

    return newmu


def has_converged(mu, oldmu):
    # A tuple is a sequence of immutable Python objects.
    # tuple is using (), list is using [], dict is using {}
    return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))


def find_centers(X, K):
    # Initialize to K random centers
    oldmu = random.sample(X, K)
    mu = random.sample(X, K)

    while not has_converged(mu, oldmu):
        oldmu = mu
        # Assign all points in X to clusters
        clusters = cluster_points(X, mu)
        # Reevaluate centers (update the centers)
        mu = reevaluate_centers(oldmu, clusters)
    return (mu, clusters)


# The initial configuration of points for the algorithm is created as follows:
def init_board(N):
    # random.uniform:
    # Draw samples from a uniform distribution
    X = np.array([(random.uniform(-1, 1), random.uniform(-1, 1)) for i in range(N)])

    return X


# The following routine constructs a specified number of Gaussian distributed clusters with random variances:
def init_board_gauss(N, k):
    n = float(N) / k
    X = []
    for i in range(k):
        c = (random.uniform(-1, 1), random.uniform(-1, 1))
        s = random.uniform(0.05, 0.5)
        x = []
        while len(x) < n:
            a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
            # Continue drawing points from the distribution in the range [-1,1]
            if abs(a) < 1 and abs(b) < 1:
                x.append([a, b])
        X.extend(x)
    X = np.array(X)[:N]
    return X


if __name__ == "__main__":
    X = init_board(100)
    K = 4
    mu, clusters = find_centers(X, K)

    x = []
    y = []
    for i in range(K):
        lx = []
        ly = []
        for l0 in clusters[i]:
            lx.append(l0[0])
            ly.append(l0[1])
        x.append(lx)
        y.append(ly)

    for i in range(K):
        plt.plot(x[i], y[i], 'o')
        plt.plot(mu[i][0], mu[i][1], 's', markersize=10)

    plt.show()

程序中假设 $k=4$，可以看到用均匀分布随机生成的样本，在算法收敛后，成功被聚成了四类。

运行结果图

很明显地可出看到 K-Means 有两个较大的问题：

$k$ 值的选择问题，如何确定这个 $k$ 值的大小
如何初始化 $k$ 个质心

这两个方面的内容将在后续的总结中补上，这两个部分一般在面试中只要问到了 K-Means 算法肯定是绕不开的。

另外，K-Means 的优缺点简单总结如下：

优点：

收敛速度快

缺点：

需要调到合适的 $k$ 值
对异常值敏感，不够 robust
需要样本存在均值
只能保证局部最优

吐槽：简书的 Markdown 不支持 LaTeX，数学公式要怎么打出来 o(≧口≦)o

Reference

Clustering With K-Means in Python

k-means clustering

K-Means 算法总结和 Python 实现

Reference

猜你喜欢

热点阅读