pyspark 实现bisecting k-means算法

2021-07-27  本文已影响0人  米斯特芳

bisecting k-means

KMeans的一种,基于二分法实现:开始只有一个簇,然后分裂成2个簇(最小化误差平方和),再对所有可分的簇分成2类,如果某次迭代导致大于K个类,则样本量大的类具有优先权(保证只有K个类)

与KMeans区别

KMeans对初始中心点的选择非常敏感,可能收敛到局部最优值,而二分法KMeans无此影响。两者都不适用非球形簇。当K值较大时,Bisecting KMeans不太适合,它可能导致分裂在各自的子群进行

其他聚类

Gaussian mixture/Power iteration clustering (PIC)/Latent Dirichlet allocation (LDA)/Streaming k-means

from pyspark.ml.clustering import BisectingKMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("BisectingKMeansExample")\
    .getOrCreate()
# libsvm格式数据:每一行中,第一个是标签,后面是序号:特征值,以空格分隔,例如 label 1:first_feature 2:second_feature ...
dataset = spark.read.format("libsvm").load("sample_kmeans_data.txt")# 格式化读取

# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)

# Make predictions
predictions = model.transform(dataset)

# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
    print(center)

上一篇下一篇

猜你喜欢

热点阅读