elasticsearch-carrot2 详解
Neil Zhu,简书ID Not_GOD,University AI 创始人 & Chief Scientist,致力于推进世界人工智能化进程。制定并实施 UAI 中长期增长战略和目标,带领团队快速成长为人工智能领域最专业的力量。
作为行业领导者,他和UAI一起在2014年创建了TASA(中国最早的人工智能社团), DL Center(深度学习知识中心全球价值网络),AI growth(行业智库培训)等,为中国的人工智能人才建设输送了大量的血液和养分。此外,他还参与或者举办过各类国际性的人工智能峰会和活动,产生了巨大的影响力,书写了60万字的人工智能精品技术内容,生产翻译了全球第一本深度学习入门书《神经网络与深度学习》,生产的内容被大量的专业垂直公众号和媒体转载与连载。曾经受邀为国内顶尖大学制定人工智能学习规划和教授人工智能前沿课程,均受学生和老师好评。
[本文译自 elasticsearch-carrot2 用例]
引言
Carrot2 - Open Source Search Results Clustering Engine是一个开源搜索结果聚类引擎。它可以自动地根据内容将搜索结果组织成更小的主题分类。本文则是关于在 elasticsearch 中的 carrot2 插件的介绍。
基础概念
carrot2是聚类插件,可以自动地将相似的文档组织起来,并且给每个文档的群组分类贴上相应的较为用户可以理解的标签。这样的聚类也可以看做是一种动态的针对每个搜索和命中结果集合的动态 facet。可以在Carrot2 demo page 体验一下这个工具。
每个需要聚类的文档有若干逻辑单元:文档标识符,原始的 URL,标题,主要的内容和语言代码。只有标识符字段是强制的,其他部分都是可选得,但是至少一个其他字段是需要指定以保证操作的合理性的。
在 Elasticsearch 中索引的文档不需要按照任何的预设 schema 所以一个 JSON 文档的实际字段需要被映射到聚类插件要求的逻辑单元上。下面图示了一个例子:
请注意文档的两个字段被映射到 TITLE 上。这不是一个错误,任意数目的字段都可以映射到 TITLE 或者 CONTENT 上——这些字段的内容可以被连接起来用作聚类。
逻辑单元也可以用生成的内容进行填充,例如使用 高亮 在文档的字段上。这功能可以大大降低输入给聚类算法的文档数量(提高性能),同样会让聚类的内容更加与查询相关(聚类效果更佳)。下面的 REST API 会展示字段映射的细节。
Java API
用作聚类查询结果的 Java API 功能完备,也是下面提到的 REST 请求背后的工作原理的支撑。可以参考github 上插件的源码,尤其是单元测试和集成测试部分。
HTTP (REST) API
HTTP REST API 包含反映了 Java API 功能的几种方法。下面会详细介绍。
列举可用算法
/_algorithms
(GET
或者POST
)
这个操作列举所有可用的聚类算法。返回的 标识符 可以用作 聚类 请求的参数。
请求 Request
简单的 GET
或者 POST
到 /_algorithms
URL 的请求。
响应 Response
响应就是一个 JSON 对象有一个 algorithms
的属性,其中存放一个算法的 标识符 列表。下面的例子展示了此插件用例的可用算法。默认算法就是出现在返回列表的第一个。
$.get("/_algorithms", function(response) {
$("#list-of-algorithms").text(
response.algorithms.join("\n"));
});
lingo
stc
kmeans
byurl
搜索和聚类结果
/_search_with_clusters
(POST
,GET
)/{index}/_search_with_clusters
(POST
,GET
)/{index}/{type}/_search_with_clusters
(POST
,GET
)
这个操作执行一个搜索的查询,获取匹配的命中结果,并对其进行聚类。
index
和 type
这两个 URI 隐性地绑定了搜索请求到一个给定的索引和文档类型上,正如搜索请求API所示。
聚类的请求是一个 HTTP REST 请求,其中整个的参数集合通过 包含一个JSON body 的 HTTP POST 请求完成。通过 HTTP GET 方法也可以得到聚类功能的一个子集。
请求 (HTTP POST)
HTTP POST 请求应当包含一个 JSON 对象,该对象有如下的属性
-
search_request
必须 该搜索请求获取用来聚类的文档。这个部分完全依照 搜索DSL 指定的规范,包含所有功能比如说 sorting、filtering、query DSL、highlighter等等。 -
query_hint
必须 这是指定用来获取匹配文档的查询 term 的属性。query_hint 帮助聚类算法避免无意义的聚类结果。一般来说,这个查询 term hint 会和用户在输入框中输入的东西保持一致。可能的话,也会经过任何 boolean 或者搜索引擎具体相关的操作的处理,最终会影响聚类的过程。此项是强制性的,但也可以是空字符串。 -
field_mapping
必须 定义了如何去映射匹配search_request
的文档的实际字段到需要被聚类的文档的逻辑单元。该属性是用 hash 表示的,key 是逻辑单元的字段,value 则是字段源定义的数组(由这些说明所定义的字段内容将被连接起来)。例如,下面的是有效的映射说明:
{
"url": [_source.urlSource],
"title": [fields.subject],
"content": [_source.abstract, highlight.main],
"language": [fields.lang]
}
-
url
是文档的 URL -
title
是文档的 title -
content
文档的主体 -
language
可以选择的对 title 和 content 的语言 tag。语言标记是两个字母构成 ISO 639-1 code,中文简体例外(zh_cn
code)。聚类引擎是否支持某个语言是由使用的算法决定的。Carrot2 算法支持的语言定义在LanguageCode
类中。
字段源说明定义了 value 从哪里取来:搜素命中的字段,存放文档的内容,或者高亮的输出。字段源说明的语法如下:
-
fields.{fieldname}
定义了搜索命中的字段(存储的字段或者从源文档重新 parse 但是在搜索请求中返回) -
highlight.{fieldname}
定义了搜索命中的高亮字段。高亮输出必须同样在搜索请求中被合理配置(参见用例) -
_source.{fieldname}
定义了源文档的字段(这是 json 文档的顶级属性)。这里会重新 parse 源文档并获取合适的值。 -
algorithm
可选 定义了采用哪种聚类算法。所有内置的聚类算法在启动的时候都已经载入了,在上面的例子中返回算法列表中都已经展示了。如果没有指定,则会默认使用第一个。 -
include_hits
可选 此处设置为true
,聚类响应不会返回搜索的命中,只会包含聚类的标签和文档的引用。这个选项在降低聚类响应的规模和只需要返回聚类标签时比较有效。 -
max_hits
可选 如果设置为一个非负值,聚类响应会被限制在包含搜索命中不超过最大值数量的结果中。聚类仍将在整个原始搜素的结果的窗口中运行。这个选项可能在聚类标签用作 facet 的时候能有效降低聚类响应的数量。注意:聚类可能会参考到那些并没有在最终返回结果的那些文档。 -
attributes
可选 key value 的映射将会重载默认的对每个 query 的算法设置。典型的默认设置使用初始时 XML 配置文件。
注意
聚类需要至少一些文档的结果以具有合理性。聚类插件只是对查询的结果进行聚类(而不会在索引中查看,也不会看额外获得的文档)。确保自己指定获取窗口的size
至少为 100. 如果响应不需要这么多的命中结果,命中结果可以使用max_hits
参数来对聚类请求进行删减。
请求(HTTP GET)
HTTP GET 聚类请求支持一个 HTTP URI 参数(定义在 Elasticsearch 的 URI 搜索请求)的超集。所有额外的参数对应于这些聚类 POST 请求的 body 中典型定义。HTTP GET 支持下面的参数:
-
field_mapping_*
必须 这是参数的一个集合,每个参数定义了一个逻辑字段映射,类似于 HTTP POST 中field_mapping
。field_mapping_title
将指定逻辑 title 的映射,而field_mapping_url
将会指定逻辑 URL 映射。映射参数的值是逗号分隔映射说明的列表,正如POST 请求中的说明所示。 -
algorithm
可选 与 HTT POST 请求的algorithm
相同。 -
query_hint
可选 与 HTT POST 请求的query_hint
相同。对 GET 请求,query_hint 是可选的;如果没有指定,q
属性作为默认值。
Important
HTTP GET 请求提供了完全版 HTTP POST 请求所有功能的子集。例如,不能指定一个字段映射到高亮字段值,不能定义定制的算法属性等等。推荐使用 HTTP POST。
下面给出一个使用 HTTP GET 聚类请求的例子。
var getUrl = "/test/test/_search_with_clusters?"
+ "q=data+mining&"
+ "size=100&"
+ "field_mapping_title=_source.title&"
+ "field_mapping_content=_source.content";
// Run HTTP GET via jquery and render cluster labels.
$.get(getUrl,
function(response) {
$("#cluster-httpget-result").text(
dumpClusters([], response.clusters).join("\n"));
});
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]
响应
响应的格式和一个正常的查询请求的响应基本相同,只是多了一些额外的属性。
{
/* Typical search response fields. */
"hits": { /* ... */ },
/* Clustering response fields. */
"clusters": [
/* Each cluster is defined by the following. */
{
"id": /* identifier */,
"score": /* numeric score */,
"label": /* primary cluster label */,
"other_topics": /* if present, and true, this cluster groups
unrelated documents (no related topics) */,
"phrases": [
/* cluster label array, will include primary. */
],
"documents": [
/* This cluster's document ID references.
May be undefined if this cluster holds sub-clusters only. */
],
"clusters": [
/* This cluster's subclusters (recursive objects of the same
structure). May be undefined if this cluster holds documents only. */
],
},
/* ...more clusters */
],
"info": {
/* Additional information about the clustering: execution times,
the algorithm used, etc. */
}
}
给出下面的递归地抽取聚类的函数:
window.dumpClusters = function(arr, clusters, indent) {
indent = indent ? indent : "";
clusters.forEach(function(cluster) {
arr.push(
indent + cluster.label
+ (cluster.documents ? " [" + cluster.documents.length + " documents]" : "")
+ (cluster.clusters ? " [" + cluster.clusters.length + " subclusters]" : ""));
if (cluster.clusters) {
dumpClusters(arr, cluster.clusters, indent + " ");
}
});
return arr;
}
使用下面的 js 可以递归地获取所有类别标签:
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"max_hits": 0,
"query_hint": "data mining",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"]
}
};
$.post("/test/test/_search_with_clusters",
JSON.stringify(request),
function(response) {
$("#cluster-list-result").text(
dumpClusters([], response.clusters).join("\n"));
});
输出
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]
输出依赖于采用的聚类算法。下面的例子给出了使用逻辑的 url
字段来产生的聚类结果。我们不需要每个搜索结果,所以在响应中取消了。
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"max_hits": 0,
"query_hint": "data mining",
"field_mapping": {
"url": ["_source.url"]
},
"algorithm": "byurl"
};
$.post("/test/test/_search_with_clusters",
JSON.stringify(request), function(response) {
$("#cluster-list-result2").text(
dumpClusters([], response.clusters).join("\n"));
});
输出
com [13 subclusters]
microsoft.com [2 subclusters]
research.microsoft.com [2 documents]
Other Sites [2 documents]
yahoo.com [2 subclusters]
answers.yahoo.com [2 documents]
Other Sites [2 documents]
databases.about.com [2 documents]
datamining.typepad.com [2 documents]
dataminingconsultant.com [2 documents]
dmreview.com [2 documents]
oracle.com [2 documents]
spss.com [2 documents]
statsoft.com [2 documents]
the-data-mine.com [2 documents]
thearling.com [2 documents]
twocrows.com [2 documents]
Other Sites [32 documents]
org [3 subclusters]
en.wikipedia.org [2 documents]
siam.org [2 documents]
Other Sites [9 documents]
edu [2 subclusters]
ccsu.edu [2 documents]
Other Sites [10 documents]
ca [2 documents]
gov [2 documents]
net [2 documents]
Other Sites [2 documents]
下面是一个完全的响应请求可以对比其中的不同
var request = {
"search_request": {
"fields": [ "title", "content" ],
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"query_hint": "data mining",
"field_mapping": {
"title": ["fields.title"],
"content": ["fields.content"]
}
};
$.post("/test/test/_search_with_clusters",
JSON.stringify(request),
function(response) {
$("#simple-request-result").text(
JSON.stringify(response, false, " "));
});
输出
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 93,
"max_score": 1.1545734,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "6",
"_score": 1.1545734,
"fields": {
"content": [
"... complete data mining customer ... Data mining applications, on the other hand, embed ... it, our daily lives are influenced by data mining applications. ..."
],
"title": [
"Data Mining Software, Data Mining Applications and Data Mining Solutions"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "44",
"_score": 1.1312462,
"fields": {
"content": [
"Data mining terms concisely defined. ... Accuracy is an important factor in assessing the success of data mining. ... data mining ..."
],
"title": [
"Two Crows: Data mining glossary"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "55",
"_score": 1.1312462,
"fields": {
"content": [
""
],
"title": [
"data mining institute"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "84",
"_score": 1.0323554,
"fields": {
"content": [
"... Walmart, Fundraising Data Mining, Data Mining Activities, Web-based Data Mining, ... in many industries makes us the best choice for your data mining needs. ..."
],
"title": [
"Data Mining, Data Mining Process, Data Mining Techniques, Outsourcing Mining Data Services"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "35",
"_score": 1.0323384,
"fields": {
"content": [
"... Sapphire-a semiautomated, flexible data-mining software infrastructure. ... Data mining is not a new field. ... scale, scientific data-mining efforts such ..."
],
"title": [
"Data Mining"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "18",
"_score": 0.9796879,
"fields": {
"content": [
"... high performance networking, internet computing, data mining and related areas. ... Peter Stengard, Oracle Data Mining Technologies. prudsys AG, Chemnitz, ..."
],
"title": [
"Data Mining Group - DMG"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "22",
"_score": 0.96633387,
"fields": {
"content": [
"Using data mining functionality embedded in ... Oracle Data Mining JDeveloper and SQL Developer ... Oracle Magazine: Using the Oracle Data Mining API ..."
],
"title": [
"Oracle Data Mining"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "39",
"_score": 0.96633387,
"fields": {
"content": [
"Some example application areas are listed under Applications Of Data Mining ... Crows Introduction - \"Introduction to Data Mining and Knowledge Discovery\"- http: ..."
],
"title": [
"Data Mining - Introduction To Data Mining (Misc)"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "66",
"_score": 0.96318483,
"fields": {
"content": [
"... business intelligence, data warehousing, data mining, CRM, analytics, ... M2007 Data Mining Conference Hitting 10th Year and Going Strong ..."
],
"title": [
"Data Mining"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "2",
"_score": 0.96179265,
"fields": {
"content": [
"Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings."
],
"title": [
"KDnuggets: Data Mining, Web Mining, and Knowledge Discovery"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "7",
"_score": 0.96179265,
"fields": {
"content": [
"Commentary on text mining, data mining, social media and data visualization. ... Opinion Mining Startups ... in sentiment mining, deriving tuples of ..."
],
"title": [
"Data Mining: Text Mining, Visualization and Social Media"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "71",
"_score": 0.9561301,
"fields": {
"content": [
"Data Mining is the automated extraction of hidden predictive information from databases. ... The data mining tools can make this leap. ..."
],
"title": [
"Data Mining | NetworkDictionary"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "65",
"_score": 0.948279,
"fields": {
"content": [
"... Website for Data Mining Methods and ... data mining at Central Connecticut State University, he ... also provides data mining consulting and statistical ..."
],
"title": [
"DataMiningConsultant.com"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "36",
"_score": 0.9427052,
"fields": {
"content": [
"SQL Server Data Mining Portal ... information about our exciting data mining features. ... CTP of Microsoft SQL Server 2008 Data Mining Add-Ins for Office 2007 ..."
],
"title": [
"SQL Server Data Mining > Home"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "14",
"_score": 0.9127037,
"fields": {
"content": [
"From data mining tutorials to data warehousing techniques, you will find it all! ... Administration Design Development Data Mining Database Training Careers Reviews ..."
],
"title": [
"Data Mining and Data Warehousing"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 0.9124819,
"fields": {
"content": [
"Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
],
"title": [
"Data mining - Wikipedia, the free encyclopedia"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "15",
"_score": 0.9124819,
"fields": {
"content": [
"Oracle Data Mining Product Center ... Using data mining functionality embedded in Oracle Database 10g, you can find ... Mining High-Dimensional Data for ..."
],
"title": [
"Oracle Data Mining"
]
}
}
...
]
},
"clusters": [
{
"id": 0,
"score": 71.07595656014654,
"label": "Knowledge Discovery",
"phrases": [
"Knowledge Discovery"
],
"documents": [
"39",
"2",
"3",
"5",
"61",
"25",
"17",
"34",
"4",
"9",
"43",
"62",
"74"
]
},
{
"id": 1,
"score": 66.46874714157775,
"label": "Data Mining Process",
"phrases": [
"Data Mining Process"
],
"documents": [
"84",
"13",
"63",
"67",
"86",
"34",
"77",
"83",
"8",
"54",
"4",
"87"
]
},
{
"id": 2,
"score": 71.44252901633597,
"label": "Data Mining Applications",
"phrases": [
"Data Mining Applications"
],
"documents": [
"6",
"39",
"85",
"82",
"33",
"76",
"41",
"60",
"43",
"16",
"87"
]
},
{
"id": 3,
"score": 81.34385135697781,
"label": "Data Mining Tools",
"phrases": [
"Data Mining Tools"
],
"documents": [
"71",
"23",
"32",
"56",
"86",
"52",
"77",
"74",
"79"
]
},
{
"id": 4,
"score": 49.66400793807237,
"label": "Data Mining Conference",
"phrases": [
"Data Mining Conference"
],
"documents": [
"66",
"85",
"50",
"33",
"60",
"46",
"29",
"57"
]
},
{
"id": 5,
"score": 64.44592124795795,
"label": "Data Mining Solutions",
"phrases": [
"Data Mining Solutions"
],
"documents": [
"6",
"28",
"37",
"77",
"42",
"54",
"89",
"53"
]
},
...
],
"info": {
"algorithm": "lingo",
"search-millis": "12",
"clustering-millis": "296",
"total-millis": "309",
"include-hits": "true",
"max-hits": ""
}
}
深入字段映射
字段映射提供了联系实际数据和用以聚类的逻辑数据的方式。不同的字段映射源(_source.*
、hightlight.*
和 fields.*
)可以用来调整在请求中调整的数据的量以及传递给聚类引擎的文本的数量(最终反映在处理的成本上)。
- 如果
_source
作为搜索命中的一部分是可以获得的话,_source.*
映射可以直接从源文档中获取数据。通过这个映射指向的内容不会作为请求的一个部分返回,这只是在聚类的内部过程中使用到。警告!-source
可能不会由 Elasticsearch 的内部搜素架构所发布,尤其仅有挑选出的fields
是过滤的时候,源是不可以获得的。在未来的版本中应该会解决这个问题。 -
fields.*
映射必须与搜索请求中合适的fields
声明相关。这些字段的内容和请求一同返回,可以用作展示(只展示每个文档的标题)。 -
highlight.*
映射同样必须与搜索请求中合适highlight
声明相关。高亮请求说明可以用来调整传输给聚类引擎的内容的数量(分片的数量,宽度,限界等等)。这个在文档很长的时候特别重要(全部的内容都存储着):典型的情况是聚类算法在集中在包含查询的上下文环境时效果要比在所有文档的全部内容时好的多。任何高亮的内容将同样被作为请求的一部分返回。
对比下面的两个请求的输出可以看出其中的不同。
code 1
var request = {
"search_request": {
"fields": ["url", "title", "content"],
"query": {"match" : { "_all": "computer" }},
"size": 100
},
"query_hint": "computer",
"field_mapping": {
"url": ["fields.url"],
"title": ["fields.title"],
"content": ["fields.content"]
}
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#fields-request").text(JSON.stringify(response, false, " "));
});
code 2
var request = {
"search_request": {
"fields": ["url", "title"],
"query": {"match" : { "_all": "computer" }},
"size": 100,
"highlight" : {
"pre_tags" : ["", ""],
"post_tags" : ["", ""],
"fields" : {
"content" : { "fragment_size" : 100, "number_of_fragments" : 2 }
}
},
},
"query_hint": "computer",
"field_mapping": {
"url": ["fields.url"],
"title": ["fields.title"],
"content": ["highlight.content"]
}
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#highlight-request").text(JSON.stringify(response, false, " "));
});
输出 1
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.685061,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "62",
"_score": 0.685061,
"fields": {
"content": [
"Technical journal focused on the theory, techniques, and practice for extracting information from large databases."
],
"title": [
"Data Mining and Knowledge Discovery - Data Mining and Knowledge Discovery Journals, Books & Online Media | Springer"
],
"url": [
"http://www.springer.com/computer/database+management+&+information+retrieval/journal/10618"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 0.68239,
"fields": {
"content": [
"Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
],
"title": [
"Data mining - Wikipedia, the free encyclopedia"
],
"url": [
"http://en.wikipedia.org/wiki/Data-mining"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "51",
"_score": 0.5480488,
"fields": {
"content": [
"This page describes the term data mining and lists other pages on the Web where you can find additional information. ... Data Mining and Analytic Technologies ..."
],
"title": [
"What is data mining? - A Word Definition From the Webopedia Computer Dictionary"
],
"url": [
"http://www.webopedia.com/TERM/D/data_mining.html"
]
}
}
]
},
"clusters": [
{
"id": 0,
"score": 0.18077730227849886,
"label": "Data Mining and Knowledge Discovery",
"phrases": [
"Data Mining and Knowledge Discovery"
],
"documents": [
"62",
"3"
]
},
{
"id": 1,
"score": 0,
"label": "Other Topics",
"phrases": [
"Other Topics"
],
"other_topics": true,
"documents": [
"51"
]
}
],
"info": {
"algorithm": "lingo",
"search-millis": "22",
"clustering-millis": "25",
"total-millis": "47",
"include-hits": "true",
"max-hits": ""
}
}
输出 2
{
"took": 305,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.685061,
"hits": [
{
"_index": "test",
"_type": "test",
"_id": "62",
"_score": 0.685061,
"fields": {
"title": [
"Data Mining and Knowledge Discovery - Data Mining and Knowledge Discovery Journals, Books & Online Media | Springer"
],
"url": [
"http://www.springer.com/computer/database+management+&+information+retrieval/journal/10618"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "3",
"_score": 0.68239,
"fields": {
"title": [
"Data mining - Wikipedia, the free encyclopedia"
],
"url": [
"http://en.wikipedia.org/wiki/Data-mining"
]
},
"highlight": {
"content": [
"Data mining is considered a subfield within the Computer Science field of knowledge discovery"
]
}
},
{
"_index": "test",
"_type": "test",
"_id": "51",
"_score": 0.5480488,
"fields": {
"title": [
"What is data mining? - A Word Definition From the Webopedia Computer Dictionary"
],
"url": [
"http://www.webopedia.com/TERM/D/data_mining.html"
]
}
}
]
},
"clusters": [
{
"id": 0,
"score": 0.1807764758253202,
"label": "Data Mining and Knowledge Discovery",
"phrases": [
"Data Mining and Knowledge Discovery"
],
"documents": [
"62",
"3"
]
},
{
"id": 1,
"score": 0,
"label": "Other Topics",
"phrases": [
"Other Topics"
],
"other_topics": true,
"documents": [
"51"
]
}
],
"info": {
"algorithm": "lingo",
"search-millis": "305",
"clustering-millis": "11",
"total-millis": "317",
"include-hits": "true",
"max-hits": ""
}
}
选择算法
聚类插件包含了几种 Carrot2 项目开源的算法也有商业版本 Lingo3G 的聚类算法。
如何选择算法依赖于传输量(STC 比 Lingo 更快,但产生的结果较差;Lingo3G是更加快速的算法但不是开源免费的)和期望的结果(Lingo3G 提供层次化的聚类,Lingo 和 STC 提供扁平的聚类),以及输入的数据(每个算法都有微小的聚类差别)。对于这个问题,答案是不确定的。
下面的例子展示了选择不同的算法的效果
lingo 算法
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"query_hint": "data mining",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"]
},
"algorithm": "lingo"
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#request-algorithm1").text(dumpClusters([], response.clusters).join("\n"));
});
STC 算法
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"query_hint": "data mining",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"]
},
"algorithm": "stc"
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#request-algorithm2").text(dumpClusters([], response.clusters).join("\n"));
});
lingo 算法输出
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]
STC 算法结果
Knowledge Discovery [19 documents]
Data Mining Tools [9 documents]
Data Mining Solutions [7 documents]
Data Mining and Knowledge [5 documents]
Machine Learning [5 documents]
Text Mining [7 documents]
SQL, Microsoft SQL Server [4 documents]
Software [13 documents]
Process [11 documents]
Applications [10 documents]
Modeling [10 documents]
Predictive [9 documents]
Techniques [9 documents]
Databases [8 documents]
Developing [8 documents]
Other Topics [26 documents]
重载算法属性
默认算法集合包含对每个算法的所有初始属性的空的 stub。这些文件根据 {algorithm-name}-attributes.xml
进行命名,并由当前的 resources
配置设置进行处理(参考插件配置。
例如,为了对 lingo
算法的所有请求重载默认属性,我们需要创建一个 {es.home}/config/lingo-attributes.xml
文件,并把任何重载的属性放在那儿,如下:
<attribute-sets default="overridden-attributes">
<attribute-set id="overridden-attributes">
<value-set>
<label>overridden-attributes</label>
<attribute key="LingoClusteringAlgorithm.desiredClusterCountBase">
<value type="java.lang.Integer" value="5"/>
</attribute>
</value-set>
</attribute-set>
</attribute-sets>
也许最为方便的方式是直接从 Carrot2 Workbench 中直接导出其配置的 XML 文件。
运行时重载算法属性
每个聚类算法都包含若干能够改变其行为的参数(Carrot2 Workbench可以用来调整这些)。某个属性可以对每个查询请求进行定制,正如下面的例子展示的那样,我们可以随机地改变需要的聚类个数(多执行几遍下面的例子看看不同的结果)。
var request = {
"search_request": {
"query": {"match" : { "_all": "data mining" }},
"size": 100
},
"query_hint": "data mining",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"]
},
"algorithm": "lingo",
"attributes": {
"LingoClusteringAlgorithm.desiredClusterCountBase": Math.round(5 + Math.random() * 5)
}
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#request-attributes").text(dumpClusters([], response.clusters).join("\n"));
});
输出 1
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Technology [7 documents]
Microsoft SQL Server [2 documents]
Other Topics [54 documents]
输出 2
Knowledge Discovery [13 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Research [7 documents]
Text Mining [7 documents]
Predictive Modeling [5 documents]
Other Topics [50 documents]
输出 3
Knowledge Discovery [13 documents]
Data Mining Applications [11 documents]
Data Mining Techniques [9 documents]
Data Mining Tools [9 documents]
Text Mining [7 documents]
Association [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Microsoft SQL Server [2 documents]
Other Topics [43 documents]
多语言聚类
字段映射说明可以包含一个 language
元素,定义了标题和文档所采用的语言编码 ISO 639-1 。这个信息可以根据先验的知识(文档的源或者在索引时候执行语言探测过滤)存放在索引中。
Carrot2 框架中的算法接受定义在 Language
中使用 enum 的 ISO 语言代码。
语言 hint 让聚类算法更好地分析文档的内容,并选择正确的语言资源来进行聚类。如果你有多语言的查询结果(或者查询结果不同于英语),强烈建议对该项合理地进行设置。
下面的例子对整个文档应用了一个聚类算法。一些文档是德语的(他们拥有一个 de
语言代码),一些是英语的(则使用了 en
语言代码)。我们额外地设置语言聚合策略在 FLATTEN_NONE
上,使得顶层的群类表示在子群类的文档的语言。注意下面例子中在输出中顶层群类名称。
var request = {
"search_request": {
"query": {"match_all" : {}},
"size": 100
},
"query_hint": "bundestag",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"],
"language": ["_source.lang"]
},
"attributes": {
"MultilingualClustering.languageAggregationStrategy": "FLATTEN_NONE"
}
};
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
$("#language-fieldmapping").text(dumpClusters([], response.clusters).join("\n"));
});
输出
German [23 subclusters]
Parlament [8 documents]
Käfer im Bundestag in Berlin [5 documents]
Mitglieder [5 documents]
Seite [5 documents]
MdB [4 documents]
Bundestag Nachrichten [3 documents]
Restaurant [3 documents]
Tag [3 documents]
Abs.1 [2 documents]
Bundestagsfraktion Bündnis 90 die Grünen [2 documents]
Informationssystem für Parlamentarische Vorgänge [2 documents]
LINKE [2 documents]
Mehr [2 documents]
Nebeneinkünfte der Abgeordneten im Deutschen Bundestag [2 documents]
Petitionen Unterstützen Facebook [2 documents]
Reichstag Bundestag [2 documents]
Schule [2 documents]
Seite Lässt Dies Jedoch [2 documents]
Susanne Wiest [2 documents]
Tiergarten Telefon 030 22629933 Gerne weiter Empfehlen [2 documents]
Virtuelle Wählergedächtnis [2 documents]
Zentrale [2 documents]
Other Topics [15 documents]
English [19 subclusters]
Software [6 documents]
Data Mining Process [5 documents]
Conference [4 documents]
Data Mining Techniques [4 documents]
Knowledge Discovery [4 documents]
Web Mining [3 documents]
Analytic [2 documents]
Association [2 documents]
Business [2 documents]
Data Mining Technology [2 documents]
Data Warehousing [2 documents]
Downloads [2 documents]
Extraction of Hidden Predictive [2 documents]
Oracle Data Mining [2 documents]
Papers [2 documents]
SIAM International Conference on Data Mining [2 documents]
Visualization and Social Media [2 documents]
Website for Data Mining Methods [2 documents]
Other Topics [6 documents]
插件配置
插件有一些默认的设置可以直接使用。建议在非常必须得时候使用这些功能。
下面的配置文件和属性可以用来修改模型的插件配置。
{path.conf}/elasticsearch.yml
,
{path.conf}/elasticsearch.json
,
{path.conf}/elasticsearch.properties
主要的 ES 配置文件可以用来 启用/关闭 插件,对赋值给聚类请求的资源进行微调。
-
carrot2.enable
如果设置为false
,则关闭插件;甚至插件已经安装。 -
threadpool.search.*
聚类请求在 ES 内部的搜索 线程池 中执行。可能也有调整线程池的配置来限制并发的在计算节点上的聚类请求(因为聚类是非常消耗 CPU 的)。参见 ES 文档中相应的 线程池 部分。
{path.conf}/carrot2.yml
,
{path.conf}/carrot2.json
,
{path.conf}/carrot2.properties
可选的包含插件相关的配置文件。
-
suite
算法套件 XML。资源在path.conf
中和 classpath 查找。默认的套件资源名是carrot2.suite.xml
,包含了对所有开源算法的默认值并尝试载入 Lingo3G。 -
resources
供载入 Carrot2
lexical resources、 Lingo3G's lexical resources 和算法描述文件(包含任何初始时的属性)资源查找路径。相对路径通过 ES 的path.conf
变量进行解决(一般在config
文件夹)。该值也可以是绝对路径。任何不在这个位置的资源会从 classpath 进行装载。 -
controller.pool-size
算法实例的内部池的大小。该池规模依赖于 ES 搜索线程池的配置自动变化。如果太多的资源被消耗,这个池可以通过此项改成固定大小。