【论文理解】OPTICS: Ordering Points to
4. Identifying The Clustering Structure
确认聚类结构
The OPTICS algorithm generates the augmented cluster-ordering consisting of the ordering of the points, the reachability-values and the core-values. However, for the following interactive and automatic analysis techniques only the ordering and the reachability-values are needed. To simplify the notation, we specify them formally
augmented [ɔːɡˈmentɪd]:增加的
notation:符号
OPTICS算法生成一个增广簇序,包括点的排序,可达距离与核心距离。然而,接下来的交互式自动分析技术(interactive and automatic analysis techniques),仅需用到点排序与可达距离。因此,为了简化符号,我们正式地指定他们。
Definition 7: (results of the OPTICS algorithm)
Let DB be a database containing n points. The OPTICS algorithm generates an ordering of the points o:{1..n} → DB and corresponding reachability-values r:{1..n} → R≥0.
定义7:(OPTICS算法的输出结果)
OPTICS算法按序生成:点o:{1..n} → DB 与对应的可达距离r:{1..n} → R≥0
4.3 Automatic Techniques
In this section, we will look at how to analyze the cluster-ordering automatically and generate a hierarchical clustering structure from it. The general idea of our algorithm is to identify potential start-of-cluster and end-of-cluster regions first, and then to combine matching regions into (nested) clusters.
在本节中,我们将了解如何自动分析cluster-ordering,并从中生成分层的簇结构。该算法的基本思想是,先识别潜在的簇首和簇尾区域,然后将匹配区域组合成(嵌套)cluster。
4.3.1 Concepts And Formal Definition Of A Cluster
Cluster的概念和形式定义
In order to identify the clusters contained in the database, we need a notion of “clusters” based on the results of the OPTICS algorithm. As we have seen above, the reachability value of a point corresponds to the distance of this point to the set of its predecessors. From this (and the way OPTICS chooses the order) we know that clusters are dents in the reachability-plot.
dents[dents]:凹坑
为了识别数据集中的簇,我们需要根据OPTICS的结果定义“簇”。根据上文我们可以得到,可达距离表示,该点与前面点集合的距离。据此(再加上OPTICS选择顺序的方式),我们可以知道clusters是可达图的凹部。
For example, in figure 16 we see a cluster starting at point #3 and ending at point #16. Note that point #3, which is the last point with a high reachability value, is part of the cluster, because its high reachability indicates that it is far away from the points #1 and #2. It has to be close to point #4, however, because point #4 has a low reachability value, indicating that it is close to one of the points #1, #2 or #3. Because of the way OPTICS chooses the next point in the cluster-ordering, it has to be close to #3 (if it were close to point #1 or point #2 it would have been assigned index 3 and not index 4). A similar argument holds for point #16, which is the last point with a low reachability value, and therefore is also part of the cluster.
Figure16举个例子,Figure16中一个簇的簇首是点3,簇尾是点16。点3有很高的可达距离,是高可达距离点的最后一个,这说明它离点1与点2很远。点4有较低的可达距离,说明它离点1或2或3很近。根据OPTICS选择下一个点进入cluster-ordering的规则,可说明,点4的可达距离是距离点3的。(如果点4离点1或点2更近,点4应该在第三位而非第四位)。同样可得,点16是最后一个可达距离较低的点,因此也是簇的一部分。
Clusters in real-world data sets do not always start and end with extremely steep points. For example, in figure 17 we see three clusters that are very different. The first one starts with a very steep point A and ends with a very steep point B, whereas the second one ends with a number of not-quite-so-steep steps in point D.
真实数据集中的簇并不总是以非常陡峭的点开始和结束。举个例子,figure17有三个不同的簇。第一个簇首、簇尾分别是是非常陡的A点、B点;而第二个的簇尾是不太陡的D点。
Figure17To capture these different degrees of steepness, we need to introduce a parameter ξ:
Definition 9: (ξ-steep points)
A point p ∈ {1...n – 1} is called a ξ-steep upward point iff it is ξ% lower than its successor:
UpPointξ(p) ⇔ r(p) ≤ r(p + 1) × (1 – ξ)
A point p ∈ {1...n – 1} is called a ξ-steep downward point iff p’s successor is ξ% lower:
DownPoint (p)⇔r(p)×(1–ξ)≥r(p+1)
iff:当且仅当
successor:后继者
为了捕捉这些不同的陡度,我们需要引入一个参数ξ:
定义9:(ξ-陡峭点)
ξ-向上陡峭点:当且仅当它比它的后继点至少低ξ%。 UpPointξ(p) ⇔ r(p) ≤ r(p + 1) × (1 – ξ)
ξ-向下陡峭点:当且仅当它的后继点比他至少低ξ%。DownPoint (p)⇔r(p)×(1–ξ)≥r(p+1)
Looking closely at figure 17, we see that all clusters start and end in a number of upward points, most of, but not all are ξ- steep. These ‘regions’ start with ξ-steep points, followed by a few points where the reachability values level off, followed by more ξ-steep points. We will call such a region a steep area. More precisely, a ξ-steep upward area starts at a ξ-steep upward point, ends at a ξ-steep upward point and always keeps on going upward or level. Also, it must not contain too many consecutive non-steep points. Because of the core condition used in OP- TICS, a natural choice for “not too many” is “fewer than MinPts”, because MinPts consecutive non-steep points can be considered as a separate cluster and should not be part of a steep area. Our last requirement is that ξ-steep areas be as large as possible, leading to the following definition:
consecutive [kən'sekjʊtɪv]:连贯的
观察图17,我们发现所有的簇都开始且结束于一群向上的点,大部分是ξ-陡峭点,但并非全部。这些区域,始于ξ-陡峭点,接下来是一些趋于平稳的点,接着又是一些ξ-陡峭点。我们将这种区域称为陡峭地区。更精确的说,一个向上陡峭区域始于ξ-向上陡峭点,结束于ξ-向上陡峭点,并且始终保持向上或平缓趋势。同时,它不能包含太多连续的非陡峭点。在OPTICS算法中, “not too many” 也就意味着“fewer than MinPts”。因为,MinPts个连续的非陡峭点可被认为是一个独立的簇,那他们就不应该成为陡峭区域。我们希望ξ-陡峭区域可以尽可能的大,于是有了下面的定义:
Definition 10: (ξ-steep areas)
An interval I = [s,e] is called a ξ-steep upward area UpAreaξ(I) iff it satisfies the following conditions:
- s is a ξ-steep upward point: UpPointξ(s) - e is a ξ-steep upward point: UpPointξ(e)
- each point between s and e is at least as high as its predecessor:
∀x, s < x ≤ e : r(x) ≥ r(x – 1)
- I does not contain more than MinPts consecutive points that are not ξ-steep upward:
∀[s, e] ⊆ [s, e] : ((∀x ∈ [s, e] :
¬UpPointξ(x)) ⇒ e – s < MinPts) -Iismaximal:∀J:(I⊆J,UpArea (J)⇒I = J)
A ξ-steep downward area is defined analogously (DownAreaξ(I) ).
interval ['ɪntəv(ə)l]:n.间隔
analogously [ə'næləgəsli] :adv. 类似地
区间 I = [s,e] 被认为是ξ-陡峭上升区域UpAreaξ(I),当且仅当它满足下面的条件:
1. s是一个ξ-向上陡峭点:UpPointξ(s);e是一个ξ-向上陡峭点:UpPointξ(s)
2. [s,e]间的每个点,至少高于或等于它的前一个点。∀x, s < x ≤ e : r(x) ≥ r(x – 1)
3. I中不存在超过MinPts个平缓的非ξ-向上陡峭点。
4. I是最大区间。
类似可定义ξ-陡峭下降区域DownAreaξ(I)。
The first and the second cluster in figure 17 start at the beginning of steep areas (points A and C) and end at the end of other steep areas (points B and D), while the third cluster ends at point F in the middle of a steep area extending up to point G. Given these definitions, we can formalize the notion of a cluster. The following definition of a ξ-cluster consists of 4 conditions, each will be explained in detail. Recall that the first point of a cluster (called the start of the cluster) is the last point with a high reachability value, and the last point of a cluster (called the end of the cluster) is the last point with a low reachability value.
formalize['fɔːm(ə)laɪz] :使正式
图17中,第一个与第二个簇开始于陡峭区域的开始(点A与点C),结束于另一个陡峭区域(点B与点D)。第三个簇结束于点F,它是一个陡峭区域的中间,该区域结束于G。考虑到这些定义,我们可以将簇的概念形式化。下面对 ξ-cluster 的定义包含四个条件,每个都将仔细解释。第一个簇的首点(称之为簇的开始)是高可达距离点中的最后一个点,簇的末点(称之为簇的结束)是低可达距离中的最后一个点。
Definition 11定义11:(ξ-cluster)
区间 C = [s, e] 被称为 ξ-cluster,当且仅当满足下面四个条件:
1. DownAreaξ(D) ∧ s ∈ D
2. UpAreaξ(U) ∧ e ∈ U
3. a) e – s ≥ MinPts
3. b) ∀x, sD < x < eU : (r(x) ≤ min(r(sD), r(eU)) × (1 – ξ))
Conditions 1) and 2) simply state that the start of the cluster is contained in a ξ-steep downward area D and the end of the cluster is contained in a ξ-steep upward area U. Condition 3a) states that the cluster has to consist of at least MinPts points because of the core condition used in OPTICS. Condition 3b) states that the reachability values of all points in the cluster have to be at least ξ% lower than the reachability value of the first point of D and the reachability value of the first point after the end of U.
条件1与条件2简单描述了,簇的首点被包含在ξ-向下陡峭区域D,簇的末点被包含在ξ-向上陡峭区域U。条件3a说明一个簇中至少包含MinPts个点。条件3b说明,簇中所有点的可达距离,至少比D中第一个点与U中第一个点可达距离的 ξ%低。
Condition 4) determines the start and the end point. Depending on the reachability values of the first point of D (called Reach-Start) and the reachability of the first point after the end of U (called ReachEnd), we distinguish three cases (c.f. figure 18). First, if these two values are at most ξ% apart, the cluster starts at the beginning of D and ends at the end of U (figure 18a). Second, if ReachStart is more than ξ% higher than ReachEnd, the cluster ends at the end of U, but starts at that point in D, that has approximately the same reachability value as ReachEnd (figure 18b, cluster starts at SoC). Otherwise (i.e. if ReachEnd is more than ξ% higher than ReachStart), the cluster starts at the first point of D and ends at that point in U, that has approximately the same reachability value as ReachStart (figure 18c, cluster ends at EoC).
Figure 18条件4决定了起始点与结束点。取决于D的第一个点(Reach-Start)的可达距离与U结束后的第一个点(ReachEnd)。我们将区别下面三种情况(figure 18)。首先,如果这两个值分别与后继值相差ξ%,则簇开始于D的第一个点,簇结束于U的最后一个点(figure 18a)。第二种,如果ReachStart比ReachEnd高了ξ%,则簇结束于U的末点,开始于D中一点,该点与ReachEnd大概率有相等的可达距离。(figure 18b, 簇起始于SoC点)。否则(如果ReachEnd超过ReachStart超过ξ% ),则簇开始于D的第一个点,结束于U中某点,该点与ReachStart大概率有相等的可达距离。(figure 18c, 簇结束于EoC点)
4.3.2 An Efficient Algorithm To Compute All ξ-Clusters
计算ξ-Clusters的有效算法
We will now present an efficient algorithm to compute all ξ- clusters using only one pass over the reachability values of all points. The basic idea is to identify all ξ-steep down and ξ-steep up areas (we will drop the ξ for the remainder of this section and just talk about steep up areas, steep down areas and clusters), and combine them into clusters if a combination satisfies all of the cluster conditions.
我们将提出一个有效的算法,通过遍历一次所有点的可达距离,去计算得到所有的ξ- clusters。基本想法是去识别所有的ξ-向下陡峭区域与ξ-向上陡峭区域(我们将会在后文中去掉ξ,只讲向上陡峭区域,向下陡峭区域与簇),接着如果形成簇的条件符合的话,将这些区域结合成簇。
We will start out with a naïve implementation of definition 11. The resulting algorithm, however, is inefficient, so in a second step we will show how to transform it into a more efficient version while still finding all clusters. The naïve algorithm works as follows: We make one pass over all reachability values in the order computed by OPTICS. Our main data structure is a set of steep down regions SDASet which contains for each steep down region encountered so far its start-index and end-index. We start at index=0 with an empty SDASet.
我们将对定义11进行延审。结果算法是不有效的,所以在第二步,我们将更有效地去找到所有的簇。其按下述原理工作:遍历OPTICS生成的可达距离。我们的主要数据结构SDASet,存储了每个陡峭向下的区域,包括其开始值与结束值。
While index < n do
1 If a new steep down region starts at index, add it to SDASet and continue to the right of it.
2 If a new steep up region starts at index, combine it with every steep down region in SDASet, check each combination for fulfilling the cluster conditions and save it if it is a cluster (computing s and e according to condition 4). Continue to the right of this steep up region.
3 Otherwise, increment index.
当index<n时,
1. 如果最新的向下陡峭区域从index开始,将其加入SDASet ,并向右遍历。
2. 如果最新的向上陡峭区域从index开始,将其与SDASet中的每个陡降区域相结合,检查每个组合是否满足集群条件,如果是集群,则将其保存。(根据条件4计算s与e)继续向右遍历。
3. 否则,增加index。
Obviously, this algorithm finds all clusters. It makes one pass over the reachability values and additionally one loop over each potential cluster to check for condition 3b in step 2. The number of potential clusters is quadratic in the number of steep up and steep down regions. Most combinations do not actually result in clusters, however, so we employ two techniques to overcome this inefficiency: First, we filter out most potential clusters which will not result in real clusters, and second, we get rid of the loop over all points in the cluster.
显然,这个算法可以找到所有的簇。它遍历所有的可达距离,并遍历每一个潜在的簇检查其是否符合步骤2中的条件3b。潜在簇的数量,是陡峭上升与陡峭下降区域总数的二次方。大部分合并不会产生簇,于是我们采用了两个技术去克服这种低效:首先,我们筛选掉大多数不会产生真正簇的潜在簇。第二,我们防止它循环簇中的所有点。
Condition 3b
∀x, sD < x < eU : (r(x) ≤ min(r(sD), r(eU)) × (1 – ξ)) , is equivalent to
(sc1) ∀x,sD<x<eU : (r(x)≤r(sD)×(1–ξ)) ∧
(sc2) ∀x, sD < x < eU : (r(x) ≤ r(eU) × (1 – ξ)) ,
so we can split it and check the sub-conditions (sc1) and (sc2) separately. We can further transform (sc1) and (sc2) into the equivalent condition (sc1*) and (sc2*), respectively:
所以我们可以把它分开,分别检查子条件(sc1)和(sc2)。我们可以将(sc1)和(sc2)进一步转化为等价条件(sc1*)和(sc2*),分别为:
(sc1*)max{x|sD<x<eU}≤r(sD)×(1–ξ)
(sc2*) max{x | sD < x < eU} ≤ r(eU) × (1 – ξ)
In order to make use of conditions (sc1*) and (sc2*), we need to introduce the concept of maximum-in-between values, or mib-values, containing the maximum value between a certain point and the current index. We will keep track of one mib-value for each steep down region in SDASet, containing the maximum value between the end of the steep down region and the current index, and one global mib-value containing the maximum between the end of the last steep (up or down) region found and the current index.
为了使用条件(sc1*)与条件(sc2*),我们需要去介绍概念“介于两者间的最大值”(maximum-in-between/mib-values),包含某个点和当前索引之间的最大值。我们将会追踪SDASet中每个陡峭下降区域和当前index的mib-value,其中包含陡峭下降区域的末点与当前索引间,以及一个全局mib-value,该值包含找到的最后一个陡峭(向上或向下)区域的结尾与当前索引之间的最大值。
We can now modify the algorithm to keep track of all the mib-values and use them to save the expensive operations identified above. The complete algorithm is shown in figure 19. Whenever we encounter a new steep (up or down) area, we filter all steep down areas from SDASet whose start multiplied by (1-ξ) is smaller than the global mib-value, thus reducing the number of potential clusters significantly and satisfying (sc1*) at the same time (c.f. line (*) in figure 19). In step 2 (line (**) in figure 19), we compare the end of the steep up area U multi- plied by (1-ξ) to the mib-value of the steep down area D, thus satisfying (sc2*).
我们现在可以修改算法来追踪所有的mib-value,并使用他们来保存上面的昂贵操作。完整算法展示于图19。当我们遇到一个新的陡峭区域(向上或向下),我们从SDASet中筛选掉比global mib-value小(1-ξ) 的陡峭向下区域,从而减少掉潜在簇的数量。在step2中(图19),我们将陡升区u的末端乘以(1-ζ)与陡降区d的mib值进行比较,从而满足(sc2*)。
Figure 19What we have gained by using the mib-values to satisfy condition (sc1*) and (sc2*) (implying that condition 3b is satisfied) is that we reduced the number of potential clusters significantly and saved the loop over all points in each remaining potential cluster.
通过使用mib值来满足条件(sc1*)和(sc2*)(意味着条件3b已满足)所得到的结果是,我们显著减少了潜在簇的数量,并在每个剩余潜在簇的所有点上保存了循环。
4.3.3 Experimental Evaluation
Figure 20 depicts the runtime of the cluster extraction algorithm for a data set containing 64-dimensional color histograms ex- tracted from TV-snapshots. It shows the scale-up between 10,000 and 100,000 data objects for different ξ values, proving that the algorithm is indeed very fast and linear in the number of data objects. For higher ξ-values the number of clusters increas- es, resulting in a small (constant) increase in runtime. The pa- rameter ξ can be used to control the steepness of the points a cluster starts with and ends with. Higher ξ-values can be used to find only the most significant clusters, lower ξ-values to find less significant clusters which means that the choice depends on the intended granularity of the analysis. All experiments were performed on a 180 MHz PentiumPro with 96 MB RAM under Windows NT 4.0. The clustering algorithm was implemented in Java by using Sun’s JDK version 1.1.6. Note that Java is in- terpreted byte-code, not compiled native code.
图20描述了从电视快照中提取的包含64维彩色直方图的数据集的集群提取算法的运行时。结果表明,对于不同的ζ值,该算法在10000到100000个数据对象之间进行了放大,证明了该算法在数据对象数量上确实是非常快速和线性的。对于较高的ζ-值,集群的数量会增加,从而导致运行时的少量(常量)增加。参数ζ可用于控制簇开始和结束点的陡度。较高的ζ-值可用于仅找到最重要的簇,较低的ζ-值可用于找到较不重要的簇,这意味着选择取决于分析的预期粒度。所有实验均在windows nt 4.0环境下,在一台180mhz、96mb ram的奔腾处理器上进行。使用Sun的JDK版本1.1.6在Java中实现了聚类算法。请注意,Java是字节代码,而不是编译的本地代码。
In section 4.1, we have identified clusters as “dents” in the reachability-plot. Here, we demonstrate that what we call a dent is, in fact, a ξ-cluster by showing synthetic, two-dimensional points as well as high-dimensional real-world examples.
In figure 21, we see an example of three equal size clusters, two of which are very close together, and some noise points. We see that the algorithm successfully identifies this hierarchical struc- ture, i.e. it finds the two clusters and the higher-level cluster containing both of them. It also finds the third cluster, and even identifies an especially dense region within it.
在第4.1节中,我们在可达性图中将簇识别为“凹痕”。在这里,我们通过显示合成的二维点和高维的真实世界示例来证明我们所称的凹痕实际上是一个ζ-簇。
在图21中,我们看到了三个大小相等的集群(其中两个非常接近)和一些噪声点的示例。我们看到,该算法成功地识别了这种层次结构,即发现了两个簇和包含这两个簇的高层簇。它还发现了第三个集群,甚至确定了其中一个特别密集的区域。
In figure 22, a reachability-plot for 64-dimensional color histo- grams is shown. The cluster identified in region I contains screen shots from one talk show. The cluster in region II con- tains stock market data. It is very interesting to see that this clus- ter actually consists of 2 smaller sub-clusters (both of which the algorithm identifies; the first one, however, is filtered out be- cause it does not contain enough points). These sub-clusters contain the same stock-market-data shown on two different TV-stations which (during certain times of the day) broadcast the same program. The only difference between these clusters is the TV-station symbol, which each station overlays in the top-left hand corner of the image. Region III are pictures from a tennis match, the subclusters are different camera angles. Note that the notion of ξ-clusters nicely captures this hierachi- cal structure.
在图22中,显示了64维彩色直方图的可达性图。区域I中标识的集群包含一个脱口秀的屏幕截图。第二区域的集群包含股市数据。很有意思的是,这个聚类实际上由两个较小的子聚类组成(算法识别这两个聚类;但是,第一个被过滤掉了,因为它没有包含足够的点)。这些子集群包含两个不同电视台(在一天中的某些时间)播放相同节目的相同股市数据。这些簇之间的唯一区别是电视台符号,每个电视台都覆盖在图像的左上角。第三区域是一场网球比赛的照片,子星团是不同的拍摄角度。注意,ζ-簇的概念很好地捕获了这个层次结构。
We have seen that it is possible to extract the hierarchical cluster structure from the augmented cluster-ordering generated by OPTICS, both visually and automatically. The algorithm for the automatic extraction is highly efficient and of a very high qual- ity. Once we have the set of points belonging to a cluster, we can easily compute traditional clustering information like represen- tative points (e.g. medoid or mean) or shape descriptions.
我们已经看到,从光学产生的增强簇序中,可以从视觉上和自动地提取层次簇结构。自动提取算法效率高、质量好。一旦我们得到了属于一个聚类的点集,我们就可以很容易地计算传统的聚类信息,如表示点(如medoid或mean)或形状描述。
Figure 21