pyspark 实现bisecting k-means算法
2021-07-27 本文已影响0人
米斯特芳
bisecting k-means
KMeans的一种,基于二分法实现:开始只有一个簇,然后分裂成2个簇(最小化误差平方和),再对所有可分的簇分成2类,如果某次迭代导致大于K个类,则样本量大的类具有优先权(保证只有K个类)
与KMeans区别
KMeans对初始中心点的选择非常敏感,可能收敛到局部最优值,而二分法KMeans无此影响。两者都不适用非球形簇。当K值较大时,Bisecting KMeans不太适合,它可能导致分裂在各自的子群进行
其他聚类
Gaussian mixture/Power iteration clustering (PIC)/Latent Dirichlet allocation (LDA)/Streaming k-means
from pyspark.ml.clustering import BisectingKMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("BisectingKMeansExample")\
.getOrCreate()
# libsvm格式数据:每一行中,第一个是标签,后面是序号:特征值,以空格分隔,例如 label 1:first_feature 2:second_feature ...
dataset = spark.read.format("libsvm").load("sample_kmeans_data.txt")# 格式化读取
# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)
# Make predictions
predictions = model.transform(dataset)
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))
# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
print(center)