聚类算法的评估
2019-01-16 本文已影响0人
dreampai
1、用真实值评估聚类
调整兰德系数 (Adjusted Rand index)


from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans,AgglomerativeClustering,DBSCAN
import numpy as np
import matplotlib.pyplot as plt
import mglearn
X,y=make_moons(n_samples=200,noise=0.05,random_state=0)
# 将数据缩放成平均值为 0,方差为 1
scaler=StandardScaler()
scaler.fit(X)
X_scaled=scaler.transform(X)
fig,axes=plt.subplots(1,4,figsize=(15,3),subplot_kw={'xticks':(),'yticks':()})
# 需要使用的算法
algorithms=[KMeans(n_clusters=2),AgglomerativeClustering(n_clusters=2),DBSCAN()]
# 创建一个随机的簇分配,作为参考
random_state=np.random.RandomState(seed=0)
random_clusters=random_state.randint(low=0,high=2,size=len(X))
axes[0].scatter(X_scaled[:,0],X_scaled[:,1],c=random_clusters,cmap=mglearn.cm3,s=60)
axes[0].set_title('Random assignment - ARI:{:.2f}'.format(adjusted_rand_score(y,random_clusters)))
for ax,algorithm in zip(axes[1:],algorithms):
clusters=algorithm.fit_predict(X_scaled)
ax.scatter(X_scaled[:,0],X_scaled[:,1],c=clusters,cmap=mglearn.cm3,s=60)
ax.set_title('{}- ARI:{:.2f}'.format(algorithm.__class__.__name__,adjusted_rand_score(y,clusters)))
plt.show()

2、在没有真实值的情况下评估聚类
轮廓系数
轮廓系数是类的密集与分散程度的评价指标。它会随着类的规模增大而增大。彼此相距很远,本身很密集的类,其轮廓系数较大,彼此集中,本身很大的类,其轮廓系数较小。 轮廓系数是通过所有样本计算出来的,计算每个样本分数的均值,计算公式如下:

from sklearn.metrics.cluster import silhouette_score
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans,AgglomerativeClustering,DBSCAN
import numpy as np
import matplotlib.pyplot as plt
import mglearn
X,y=make_moons(n_samples=200,noise=0.05,random_state=0)
# 将数据缩放成平均值为 0,方差为 1
scaler=StandardScaler()
scaler.fit(X)
X_scaled=scaler.transform(X)
fig,axes=plt.subplots(1,4,figsize=(15,3),subplot_kw={'xticks':(),'yticks':()})
# 需要使用的算法
algorithms=[KMeans(n_clusters=2),AgglomerativeClustering(n_clusters=2),DBSCAN()]
# 创建一个随机的簇分配,作为参考
random_state=np.random.RandomState(seed=0)
random_clusters=random_state.randint(low=0,high=2,size=len(X))
axes[0].scatter(X_scaled[:,0],X_scaled[:,1],c=random_clusters,cmap=mglearn.cm3,s=60)
axes[0].set_title('Random assignment:{:.2f}'.format(silhouette_score(X_scaled,random_clusters)))
for ax,algorithm in zip(axes[1:],algorithms):
clusters=algorithm.fit_predict(X_scaled)
ax.scatter(X_scaled[:,0],X_scaled[:,1],c=clusters,cmap=mglearn.cm3,s=60)
ax.set_title('{}:{:.2f}'.format(algorithm.__class__.__name__,silhouette_score(X_scaled,clusters)))
plt.show()

参考链接
本文作为笔记记录,如果侵权,联系我删除