数据科学与R语言R. python新手日记机器学习

无监督学习 聚类分析①

2018-01-11  本文已影响31人  柳叶刀与小鼠标

聚类分析

介绍

步骤

如果我们在分析中选择的变量变化范围很大,那么该变量对结果的影响也是最大的。这往往是不可取的。最常用的将数据缩放的方法是将每个变量标准化为均值为0和标准差为1的变量。其他的方法包括每个变量被其最大值相除或该变量减去它的平均值并除以变量的平均绝对偏差。这三种方法如下:
df1 <- apply(mydata, 2, function(x){ (x-mean(x))/sd(x)})
df2 <- apply(mydata, 2, function(x){x/max(x)})
df3 <- apply(mydata, 2, function(x){x - mean(x))/mad(x)})

计算距离


> setwd("E:\\Rwork")
> data(nutrient, package = "flexclust")
> head(nutrient, 4)
             energy protein fat calcium iron
BEEF BRAISED    340      20  28       9  2.6
HAMBURGER       245      21  17       9  2.7
BEEF ROAST      420      15  39       7  2.0
BEEF STEAK      375      19  32       9  2.6
> d <- dist(nutrient)
> as.matrix(d)[1:4,1:4]
             BEEF BRAISED HAMBURGER BEEF ROAST BEEF STEAK
BEEF BRAISED      0.00000   95.6400   80.93429   35.24202
HAMBURGER        95.64000    0.0000  176.49218  130.87784
BEEF ROAST       80.93429  176.4922    0.00000   45.76418
BEEF STEAK       35.24202  130.8778   45.76418    0.00000

层次聚类分析

如前所述,在层次聚类中,起初每一个实例或者观测值属于一类。聚类就是每一次把两类聚成新的一类,直到所有的类聚成单个类为止。算法如下:
(1) 定义每个观测值(行或单元) 为一类;

(2) 计算每类和其他各类的距离;

(3) 把距离最短的两类合并成一类,这样类的个数就减少一个;

(4) 重复步骤(2)和步骤,直到包含所有观测值的类合并成单个的类为止;

层次聚类可以用hclust()函数来实现,格式是hclust(d, method=),其中d是通过dist()函数产生的距离矩阵,并且方法包括"single". "complete". "average"."centroid"和"ward"。


> setwd("E:\\Rwork")
> data(nutrient, package = "flexclust")
> row.names(nutrient) <- tolower (row.names(nutrient))
> nutrient.scaled <- scale(nutrient)
> d <- dist(nutrient.scaled)
> fit.average <- hclust(d, method = "average")
> plot(fit.average)

library(NbClust)
devAskNewPage(ask = TRUE)
nc <- NbClust(nutrient.scaled, distance = "euclidean", 
              min.nc = 2, max.nc = 15, method = "average")
table(nc$Best.n[1,])
barplot(table(nc$Best.n[1,]),
        xlab = "number of cluster", ylab = "number of criteria",
        main = "number of cluster chosen by 26 cruteria")
table(nc$Best.n[1,])

 0  1  2  3  4  5  9 10 13 14 15 
 2  1  4  4  2  4  1  1  2  1  4 
******************************************************************* 
* Among all indices:                                                
* 4 proposed 2 as the best number of clusters 
* 4 proposed 3 as the best number of clusters 
* 2 proposed 4 as the best number of clusters 
* 4 proposed 5 as the best number of clusters 
* 1 proposed 9 as the best number of clusters 
* 1 proposed 10 as the best number of clusters 
* 2 proposed 13 as the best number of clusters 
* 1 proposed 14 as the best number of clusters 
* 4 proposed 15 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
clusters <- cutree(fit.average, k=5)
table(clusters)
plot(fit.average, hang = -1, cex = .8,
     main = "average linkage clustering \ n5 cluster solution")

rect.hclust(fit.average, k =5)
上一篇 下一篇

猜你喜欢

热点阅读