elasticsearch关于搜索,我们聊聊pm

elasticsearch-carrot2 详解

2015-05-07  本文已影响13899人  朱小虎XiaohuZhu

Neil Zhu,简书ID Not_GOD,University AI 创始人 & Chief Scientist,致力于推进世界人工智能化进程。制定并实施 UAI 中长期增长战略和目标,带领团队快速成长为人工智能领域最专业的力量。
作为行业领导者,他和UAI一起在2014年创建了TASA(中国最早的人工智能社团), DL Center(深度学习知识中心全球价值网络),AI growth(行业智库培训)等,为中国的人工智能人才建设输送了大量的血液和养分。此外,他还参与或者举办过各类国际性的人工智能峰会和活动,产生了巨大的影响力,书写了60万字的人工智能精品技术内容,生产翻译了全球第一本深度学习入门书《神经网络与深度学习》,生产的内容被大量的专业垂直公众号和媒体转载与连载。曾经受邀为国内顶尖大学制定人工智能学习规划和教授人工智能前沿课程,均受学生和老师好评。

[本文译自 elasticsearch-carrot2 用例]

引言


Carrot2 - Open Source Search Results Clustering Engine是一个开源搜索结果聚类引擎。它可以自动地根据内容将搜索结果组织成更小的主题分类。本文则是关于在 elasticsearch 中的 carrot2 插件的介绍。

基础概念


carrot2是聚类插件,可以自动地将相似的文档组织起来,并且给每个文档的群组分类贴上相应的较为用户可以理解的标签。这样的聚类也可以看做是一种动态的针对每个搜索和命中结果集合的动态 facet。可以在Carrot2 demo page 体验一下这个工具。
每个需要聚类的文档有若干逻辑单元:文档标识符,原始的 URL,标题,主要的内容和语言代码。只有标识符字段是强制的,其他部分都是可选得,但是至少一个其他字段是需要指定以保证操作的合理性的。
在 Elasticsearch 中索引的文档不需要按照任何的预设 schema 所以一个 JSON 文档的实际字段需要被映射到聚类插件要求的逻辑单元上。下面图示了一个例子:

mapping.png
请注意文档的两个字段被映射到 TITLE 上。这不是一个错误,任意数目的字段都可以映射到 TITLE 或者 CONTENT 上——这些字段的内容可以被连接起来用作聚类。
逻辑单元也可以用生成的内容进行填充,例如使用 高亮 在文档的字段上。这功能可以大大降低输入给聚类算法的文档数量(提高性能),同样会让聚类的内容更加与查询相关(聚类效果更佳)。下面的 REST API 会展示字段映射的细节。

Java API


用作聚类查询结果的 Java API 功能完备,也是下面提到的 REST 请求背后的工作原理的支撑。可以参考github 上插件的源码,尤其是单元测试和集成测试部分。

HTTP (REST) API


HTTP REST API 包含反映了 Java API 功能的几种方法。下面会详细介绍。

列举可用算法

  • /_algorithms(GET 或者 POST)

这个操作列举所有可用的聚类算法。返回的 标识符 可以用作 聚类 请求的参数。
请求 Request
简单的 GET 或者 POST/_algorithms URL 的请求。
响应 Response
响应就是一个 JSON 对象有一个 algorithms 的属性,其中存放一个算法的 标识符 列表。下面的例子展示了此插件用例的可用算法。默认算法就是出现在返回列表的第一个。

$.get("/_algorithms", function(response) {
    $("#list-of-algorithms").text(
      response.algorithms.join("\n"));
});
lingo
stc
kmeans
byurl

搜索和聚类结果

  • /_search_with_clusters (POST, GET)
  • /{index}/_search_with_clusters (POST, GET)
  • /{index}/{type}/_search_with_clusters (POST, GET)

这个操作执行一个搜索的查询,获取匹配的命中结果,并对其进行聚类。
indextype 这两个 URI 隐性地绑定了搜索请求到一个给定的索引和文档类型上,正如搜索请求API所示。
聚类的请求是一个 HTTP REST 请求,其中整个的参数集合通过 包含一个JSON body 的 HTTP POST 请求完成。通过 HTTP GET 方法也可以得到聚类功能的一个子集。

请求 (HTTP POST)
HTTP POST 请求应当包含一个 JSON 对象,该对象有如下的属性

{
  "url":      [_source.urlSource],
  "title":    [fields.subject],
  "content":  [_source.abstract, highlight.main],
  "language": [fields.lang]
}

字段源说明定义了 value 从哪里取来:搜素命中的字段,存放文档的内容,或者高亮的输出。字段源说明的语法如下:

注意
聚类需要至少一些文档的结果以具有合理性。聚类插件只是对查询的结果进行聚类(而不会在索引中查看,也不会看额外获得的文档)。确保自己指定获取窗口的 size 至少为 100. 如果响应不需要这么多的命中结果,命中结果可以使用 max_hits 参数来对聚类请求进行删减。

请求(HTTP GET)
HTTP GET 聚类请求支持一个 HTTP URI 参数(定义在 Elasticsearch 的 URI 搜索请求)的超集。所有额外的参数对应于这些聚类 POST 请求的 body 中典型定义。HTTP GET 支持下面的参数:

Important
HTTP GET 请求提供了完全版 HTTP POST 请求所有功能的子集。例如,不能指定一个字段映射到高亮字段值,不能定义定制的算法属性等等。推荐使用 HTTP POST。

下面给出一个使用 HTTP GET 聚类请求的例子。

var getUrl = "/test/test/_search_with_clusters?"
  + "q=data+mining&"
  + "size=100&"
  + "field_mapping_title=_source.title&"
  + "field_mapping_content=_source.content";
 
// Run HTTP GET via jquery and render cluster labels.
$.get(getUrl,
  function(response) {
    $("#cluster-httpget-result").text(
      dumpClusters([], response.clusters).join("\n"));
});
Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]

响应
响应的格式和一个正常的查询请求的响应基本相同,只是多了一些额外的属性。

{
  /* Typical search response fields. */
  "hits": { /* ... */ },
 
  /* Clustering response fields. */
  "clusters": [
    /* Each cluster is defined by the following. */
    {
      "id":    /* identifier */,
      "score": /* numeric score */,
      "label": /* primary cluster label */,
      "other_topics": /* if present, and true, this cluster groups
                         unrelated documents (no related topics) */,
      "phrases": [
        /* cluster label array, will include primary. */
      ],
      "documents": [
        /* This cluster's document ID references.
           May be undefined if this cluster holds sub-clusters only. */
      ],
      "clusters": [
        /* This cluster's subclusters (recursive objects of the same
           structure). May be undefined if this cluster holds documents only. */
      ],
    },
    /* ...more clusters */
  ],
  "info": {
    /* Additional information about the clustering: execution times,
       the algorithm used, etc. */
  }
}

给出下面的递归地抽取聚类的函数:

window.dumpClusters = function(arr, clusters, indent) {
  indent = indent ? indent : "";
  clusters.forEach(function(cluster) {
    arr.push(
        indent + cluster.label
        + (cluster.documents ? " [" + cluster.documents.length + " documents]"   : "")
        + (cluster.clusters  ? " [" + cluster.clusters.length  + " subclusters]" : ""));
    if (cluster.clusters) {
      dumpClusters(arr, cluster.clusters, indent + "  ");
    }
  });
  return arr;
}

使用下面的 js 可以递归地获取所有类别标签:

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },
 
  "max_hits": 0,
  "query_hint": "data mining",
  "field_mapping": {
    "title": ["_source.title"],
    "content": ["_source.content"]
  }
};
 
$.post("/test/test/_search_with_clusters",
  JSON.stringify(request),
  function(response) {
    $("#cluster-list-result").text(
      dumpClusters([], response.clusters).join("\n"));
});

输出

Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]

输出依赖于采用的聚类算法。下面的例子给出了使用逻辑的 url 字段来产生的聚类结果。我们不需要每个搜索结果,所以在响应中取消了。

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },
 
  "max_hits": 0,
  "query_hint": "data mining",
  "field_mapping": {
    "url": ["_source.url"]
  },
  "algorithm": "byurl"
};
 
$.post("/test/test/_search_with_clusters",
  JSON.stringify(request), function(response) {
    $("#cluster-list-result2").text(
      dumpClusters([], response.clusters).join("\n"));
});

输出

com [13 subclusters]
  microsoft.com [2 subclusters]
    research.microsoft.com [2 documents]
    Other Sites [2 documents]
  yahoo.com [2 subclusters]
    answers.yahoo.com [2 documents]
    Other Sites [2 documents]
  databases.about.com [2 documents]
  datamining.typepad.com [2 documents]
  dataminingconsultant.com [2 documents]
  dmreview.com [2 documents]
  oracle.com [2 documents]
  spss.com [2 documents]
  statsoft.com [2 documents]
  the-data-mine.com [2 documents]
  thearling.com [2 documents]
  twocrows.com [2 documents]
  Other Sites [32 documents]
org [3 subclusters]
  en.wikipedia.org [2 documents]
  siam.org [2 documents]
  Other Sites [9 documents]
edu [2 subclusters]
  ccsu.edu [2 documents]
  Other Sites [10 documents]
ca [2 documents]
gov [2 documents]
net [2 documents]
Other Sites [2 documents]

下面是一个完全的响应请求可以对比其中的不同

var request = {
  "search_request": {
    "fields": [ "title", "content" ],
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },
 
  "query_hint": "data mining",
  "field_mapping": {
    "title": ["fields.title"],
    "content": ["fields.content"]
  }
};
 
$.post("/test/test/_search_with_clusters",
  JSON.stringify(request),
  function(response) {
    $("#simple-request-result").text(
      JSON.stringify(response, false, "  "));
});

输出

{
  "took": 12,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 93,
    "max_score": 1.1545734,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "6",
        "_score": 1.1545734,
        "fields": {
          "content": [
            "... complete data mining customer ... Data mining applications, on the other hand, embed ... it, our daily lives are influenced by data mining applications. ..."
          ],
          "title": [
            "Data Mining Software, Data Mining Applications and Data Mining Solutions"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "44",
        "_score": 1.1312462,
        "fields": {
          "content": [
            "Data mining terms concisely defined. ... Accuracy is an important factor in assessing the success of data mining. ... data mining ..."
          ],
          "title": [
            "Two Crows: Data mining glossary"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "55",
        "_score": 1.1312462,
        "fields": {
          "content": [
            ""
          ],
          "title": [
            "data mining institute"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "84",
        "_score": 1.0323554,
        "fields": {
          "content": [
            "... Walmart, Fundraising Data Mining, Data Mining Activities, Web-based Data Mining, ... in many industries makes us the best choice for your data mining needs. ..."
          ],
          "title": [
            "Data Mining, Data Mining Process, Data Mining Techniques, Outsourcing Mining Data Services"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "35",
        "_score": 1.0323384,
        "fields": {
          "content": [
            "... Sapphire-a semiautomated, flexible data-mining software infrastructure. ... Data mining is not a new field. ... scale, scientific data-mining efforts such ..."
          ],
          "title": [
            "Data Mining"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "18",
        "_score": 0.9796879,
        "fields": {
          "content": [
            "... high performance networking, internet computing, data mining and related areas. ... Peter Stengard, Oracle Data Mining Technologies. prudsys AG, Chemnitz, ..."
          ],
          "title": [
            "Data Mining Group - DMG"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "22",
        "_score": 0.96633387,
        "fields": {
          "content": [
            "Using data mining functionality embedded in ... Oracle Data Mining JDeveloper and SQL Developer ... Oracle Magazine: Using the Oracle Data Mining API ..."
          ],
          "title": [
            "Oracle Data Mining"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "39",
        "_score": 0.96633387,
        "fields": {
          "content": [
            "Some example application areas are listed under Applications Of Data Mining ... Crows Introduction - \"Introduction to Data Mining and Knowledge Discovery\"- http: ..."
          ],
          "title": [
            "Data Mining - Introduction To Data Mining (Misc)"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "66",
        "_score": 0.96318483,
        "fields": {
          "content": [
            "... business intelligence, data warehousing, data mining, CRM, analytics, ... M2007 Data Mining Conference Hitting 10th Year and Going Strong ..."
          ],
          "title": [
            "Data Mining"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "2",
        "_score": 0.96179265,
        "fields": {
          "content": [
            "Newsletter on the data mining and knowledge industries, offering information on data mining, knowledge discovery, text mining, and web mining software, courses, jobs, publications, and meetings."
          ],
          "title": [
            "KDnuggets: Data Mining, Web Mining, and Knowledge Discovery"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "7",
        "_score": 0.96179265,
        "fields": {
          "content": [
            "Commentary on text mining, data mining, social media and data visualization. ... Opinion Mining Startups ... in sentiment mining, deriving tuples of ..."
          ],
          "title": [
            "Data Mining: Text Mining, Visualization and Social Media"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "71",
        "_score": 0.9561301,
        "fields": {
          "content": [
            "Data Mining is the automated extraction of hidden predictive information from databases. ... The data mining tools can make this leap. ..."
          ],
          "title": [
            "Data Mining | NetworkDictionary"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "65",
        "_score": 0.948279,
        "fields": {
          "content": [
            "... Website for Data Mining Methods and ... data mining at Central Connecticut State University, he ... also provides data mining consulting and statistical ..."
          ],
          "title": [
            "DataMiningConsultant.com"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "36",
        "_score": 0.9427052,
        "fields": {
          "content": [
            "SQL Server Data Mining Portal ... information about our exciting data mining features. ... CTP of Microsoft SQL Server 2008 Data Mining Add-Ins for Office 2007 ..."
          ],
          "title": [
            "SQL Server Data Mining > Home"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "14",
        "_score": 0.9127037,
        "fields": {
          "content": [
            "From data mining tutorials to data warehousing techniques, you will find it all! ... Administration Design Development Data Mining Database Training Careers Reviews ..."
          ],
          "title": [
            "Data Mining and Data Warehousing"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "3",
        "_score": 0.9124819,
        "fields": {
          "content": [
            "Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
          ],
          "title": [
            "Data mining - Wikipedia, the free encyclopedia"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "15",
        "_score": 0.9124819,
        "fields": {
          "content": [
            "Oracle Data Mining Product Center ... Using data mining functionality embedded in Oracle Database 10g, you can find ... Mining High-Dimensional Data for ..."
          ],
          "title": [
            "Oracle Data Mining"
          ]
        }
      }
...
    ]
  },
  "clusters": [
    {
      "id": 0,
      "score": 71.07595656014654,
      "label": "Knowledge Discovery",
      "phrases": [
        "Knowledge Discovery"
      ],
      "documents": [
        "39",
        "2",
        "3",
        "5",
        "61",
        "25",
        "17",
        "34",
        "4",
        "9",
        "43",
        "62",
        "74"
      ]
    },
    {
      "id": 1,
      "score": 66.46874714157775,
      "label": "Data Mining Process",
      "phrases": [
        "Data Mining Process"
      ],
      "documents": [
        "84",
        "13",
        "63",
        "67",
        "86",
        "34",
        "77",
        "83",
        "8",
        "54",
        "4",
        "87"
      ]
    },
    {
      "id": 2,
      "score": 71.44252901633597,
      "label": "Data Mining Applications",
      "phrases": [
        "Data Mining Applications"
      ],
      "documents": [
        "6",
        "39",
        "85",
        "82",
        "33",
        "76",
        "41",
        "60",
        "43",
        "16",
        "87"
      ]
    },
    {
      "id": 3,
      "score": 81.34385135697781,
      "label": "Data Mining Tools",
      "phrases": [
        "Data Mining Tools"
      ],
      "documents": [
        "71",
        "23",
        "32",
        "56",
        "86",
        "52",
        "77",
        "74",
        "79"
      ]
    },
    {
      "id": 4,
      "score": 49.66400793807237,
      "label": "Data Mining Conference",
      "phrases": [
        "Data Mining Conference"
      ],
      "documents": [
        "66",
        "85",
        "50",
        "33",
        "60",
        "46",
        "29",
        "57"
      ]
    },
    {
      "id": 5,
      "score": 64.44592124795795,
      "label": "Data Mining Solutions",
      "phrases": [
        "Data Mining Solutions"
      ],
      "documents": [
        "6",
        "28",
        "37",
        "77",
        "42",
        "54",
        "89",
        "53"
      ]
    },
...
  ],
  "info": {
    "algorithm": "lingo",
    "search-millis": "12",
    "clustering-millis": "296",
    "total-millis": "309",
    "include-hits": "true",
    "max-hits": ""
  }
}

深入字段映射


字段映射提供了联系实际数据和用以聚类的逻辑数据的方式。不同的字段映射源(_source.*hightlight.*fields.*)可以用来调整在请求中调整的数据的量以及传递给聚类引擎的文本的数量(最终反映在处理的成本上)。

对比下面的两个请求的输出可以看出其中的不同。
code 1

var request = {
  "search_request": {
    "fields": ["url", "title", "content"],
    "query": {"match" : { "_all": "computer" }},
    "size": 100
  },
 
  "query_hint": "computer",
  "field_mapping": {
    "url":     ["fields.url"],
    "title":   ["fields.title"],
    "content": ["fields.content"]
  }
};
 
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#fields-request").text(JSON.stringify(response, false, "  "));
});

code 2

var request = {
  "search_request": {
    "fields": ["url", "title"],
    "query": {"match" : { "_all": "computer" }},
    "size": 100,
    "highlight" : {
      "pre_tags" :  ["", ""],
      "post_tags" : ["", ""],
      "fields" : {
        "content" : { "fragment_size" : 100, "number_of_fragments" : 2 }
      }
    },
  },
 
  "query_hint": "computer",
  "field_mapping": {
    "url":     ["fields.url"],
    "title":   ["fields.title"],
    "content": ["highlight.content"]
  }
};
 
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#highlight-request").text(JSON.stringify(response, false, "  "));
});

输出 1

{
  "took": 17,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.685061,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "62",
        "_score": 0.685061,
        "fields": {
          "content": [
            "Technical journal focused on the theory, techniques, and practice for extracting information from large databases."
          ],
          "title": [
            "Data Mining and Knowledge Discovery - Data Mining and Knowledge Discovery Journals, Books & Online Media | Springer"
          ],
          "url": [
            "http://www.springer.com/computer/database+management+&+information+retrieval/journal/10618"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "3",
        "_score": 0.68239,
        "fields": {
          "content": [
            "Data mining is considered a subfield within the Computer Science field of knowledge discovery. ... claim to perform \"data mining\" by automating the creation ..."
          ],
          "title": [
            "Data mining - Wikipedia, the free encyclopedia"
          ],
          "url": [
            "http://en.wikipedia.org/wiki/Data-mining"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "51",
        "_score": 0.5480488,
        "fields": {
          "content": [
            "This page describes the term data mining and lists other pages on the Web where you can find additional information. ... Data Mining and Analytic Technologies ..."
          ],
          "title": [
            "What is data mining? - A Word Definition From the Webopedia Computer Dictionary"
          ],
          "url": [
            "http://www.webopedia.com/TERM/D/data_mining.html"
          ]
        }
      }
    ]
  },
  "clusters": [
    {
      "id": 0,
      "score": 0.18077730227849886,
      "label": "Data Mining and Knowledge Discovery",
      "phrases": [
        "Data Mining and Knowledge Discovery"
      ],
      "documents": [
        "62",
        "3"
      ]
    },
    {
      "id": 1,
      "score": 0,
      "label": "Other Topics",
      "phrases": [
        "Other Topics"
      ],
      "other_topics": true,
      "documents": [
        "51"
      ]
    }
  ],
  "info": {
    "algorithm": "lingo",
    "search-millis": "22",
    "clustering-millis": "25",
    "total-millis": "47",
    "include-hits": "true",
    "max-hits": ""
  }
}

输出 2

{
  "took": 305,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.685061,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "62",
        "_score": 0.685061,
        "fields": {
          "title": [
            "Data Mining and Knowledge Discovery - Data Mining and Knowledge Discovery Journals, Books & Online Media | Springer"
          ],
          "url": [
            "http://www.springer.com/computer/database+management+&+information+retrieval/journal/10618"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "3",
        "_score": 0.68239,
        "fields": {
          "title": [
            "Data mining - Wikipedia, the free encyclopedia"
          ],
          "url": [
            "http://en.wikipedia.org/wiki/Data-mining"
          ]
        },
        "highlight": {
          "content": [
            "Data mining is considered a subfield within the Computer Science field of knowledge discovery"
          ]
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "51",
        "_score": 0.5480488,
        "fields": {
          "title": [
            "What is data mining? - A Word Definition From the Webopedia Computer Dictionary"
          ],
          "url": [
            "http://www.webopedia.com/TERM/D/data_mining.html"
          ]
        }
      }
    ]
  },
  "clusters": [
    {
      "id": 0,
      "score": 0.1807764758253202,
      "label": "Data Mining and Knowledge Discovery",
      "phrases": [
        "Data Mining and Knowledge Discovery"
      ],
      "documents": [
        "62",
        "3"
      ]
    },
    {
      "id": 1,
      "score": 0,
      "label": "Other Topics",
      "phrases": [
        "Other Topics"
      ],
      "other_topics": true,
      "documents": [
        "51"
      ]
    }
  ],
  "info": {
    "algorithm": "lingo",
    "search-millis": "305",
    "clustering-millis": "11",
    "total-millis": "317",
    "include-hits": "true",
    "max-hits": ""
  }
}

选择算法


聚类插件包含了几种 Carrot2 项目开源的算法也有商业版本 Lingo3G 的聚类算法。
如何选择算法依赖于传输量(STC 比 Lingo 更快,但产生的结果较差;Lingo3G是更加快速的算法但不是开源免费的)和期望的结果(Lingo3G 提供层次化的聚类,Lingo 和 STC 提供扁平的聚类),以及输入的数据(每个算法都有微小的聚类差别)。对于这个问题,答案是不确定的。
下面的例子展示了选择不同的算法的效果
lingo 算法

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },
 
  "query_hint": "data mining",
  "field_mapping": {
    "title":   ["_source.title"],
    "content": ["_source.content"]
  },
  "algorithm": "lingo"
};
 
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#request-algorithm1").text(dumpClusters([], response.clusters).join("\n"));
});

STC 算法

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },
 
  "query_hint": "data mining",
  "field_mapping": {
    "title":   ["_source.title"],
    "content": ["_source.content"]
  },
  "algorithm": "stc"
};
 
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#request-algorithm2").text(dumpClusters([], response.clusters).join("\n"));
});

lingo 算法输出

Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Solutions [8 documents]
Data Mining Research [7 documents]
Data Mining Technology [7 documents]
Text Mining [7 documents]
Book on Data Mining [5 documents]
Predictive Modeling [5 documents]
Introduction to Data Mining [4 documents]
Machine Learning [4 documents]
Oracle Data Mining [4 documents]
Analysis Techniques [3 documents]
Association [3 documents]
Data Mining Consulting [3 documents]
Data Warehousing [3 documents]
People [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Data Mining Institute [2 documents]
Data Mining Project [2 documents]
Data-mining Software [2 documents]
Downloads [2 documents]
Encyclopedia [2 documents]
Microsoft SQL Server [2 documents]
SIAM International Conference on Data Mining [2 documents]
Other Topics [8 documents]

STC 算法结果

Knowledge Discovery [19 documents]
Data Mining Tools [9 documents]
Data Mining Solutions [7 documents]
Data Mining and Knowledge [5 documents]
Machine Learning [5 documents]
Text Mining [7 documents]
SQL, Microsoft SQL Server [4 documents]
Software [13 documents]
Process [11 documents]
Applications [10 documents]
Modeling [10 documents]
Predictive [9 documents]
Techniques [9 documents]
Databases [8 documents]
Developing [8 documents]
Other Topics [26 documents]

重载算法属性


默认算法集合包含对每个算法的所有初始属性的空的 stub。这些文件根据 {algorithm-name}-attributes.xml 进行命名,并由当前的 resources 配置设置进行处理(参考插件配置
例如,为了对 lingo 算法的所有请求重载默认属性,我们需要创建一个 {es.home}/config/lingo-attributes.xml 文件,并把任何重载的属性放在那儿,如下:

<attribute-sets default="overridden-attributes">
  <attribute-set id="overridden-attributes">
    <value-set>
      <label>overridden-attributes</label>
 
      <attribute key="LingoClusteringAlgorithm.desiredClusterCountBase">
        <value type="java.lang.Integer" value="5"/>
      </attribute>
    </value-set>
  </attribute-set>
</attribute-sets>

也许最为方便的方式是直接从 Carrot2 Workbench 中直接导出其配置的 XML 文件。

运行时重载算法属性


每个聚类算法都包含若干能够改变其行为的参数(Carrot2 Workbench可以用来调整这些)。某个属性可以对每个查询请求进行定制,正如下面的例子展示的那样,我们可以随机地改变需要的聚类个数(多执行几遍下面的例子看看不同的结果)。

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },
 
  "query_hint": "data mining",
  "field_mapping": {
    "title":   ["_source.title"],
    "content": ["_source.content"]
  },
  "algorithm": "lingo",
  "attributes": {
     "LingoClusteringAlgorithm.desiredClusterCountBase": Math.round(5 + Math.random() * 5)
  }
};
 
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#request-attributes").text(dumpClusters([], response.clusters).join("\n"));
});

输出 1

Knowledge Discovery [13 documents]
Data Mining Process [12 documents]
Data Mining Applications [11 documents]
Data Mining Technology [7 documents]
Microsoft SQL Server [2 documents]
Other Topics [54 documents]

输出 2

Knowledge Discovery [13 documents]
Data Mining Tools [9 documents]
Data Mining Conference [8 documents]
Data Mining Research [7 documents]
Text Mining [7 documents]
Predictive Modeling [5 documents]
Other Topics [50 documents]

输出 3

Knowledge Discovery [13 documents]
Data Mining Applications [11 documents]
Data Mining Techniques [9 documents]
Data Mining Tools [9 documents]
Text Mining [7 documents]
Association [3 documents]
Assist Management [2 documents]
Case Studies [2 documents]
Microsoft SQL Server [2 documents]
Other Topics [43 documents]

多语言聚类


字段映射说明可以包含一个 language 元素,定义了标题和文档所采用的语言编码 ISO 639-1 。这个信息可以根据先验的知识(文档的源或者在索引时候执行语言探测过滤)存放在索引中。
Carrot2 框架中的算法接受定义在 Language 中使用 enum 的 ISO 语言代码。
语言 hint 让聚类算法更好地分析文档的内容,并选择正确的语言资源来进行聚类。如果你有多语言的查询结果(或者查询结果不同于英语),强烈建议对该项合理地进行设置。
下面的例子对整个文档应用了一个聚类算法。一些文档是德语的(他们拥有一个 de 语言代码),一些是英语的(则使用了 en 语言代码)。我们额外地设置语言聚合策略在 FLATTEN_NONE 上,使得顶层的群类表示在子群类的文档的语言。注意下面例子中在输出中顶层群类名称。

var request = {
  "search_request": {
    "query": {"match_all" : {}},
    "size": 100
  },
 
  "query_hint": "bundestag",
  "field_mapping": {
    "title":    ["_source.title"],
    "content":  ["_source.content"],
    "language": ["_source.lang"]
  },
  "attributes": {
    "MultilingualClustering.languageAggregationStrategy": "FLATTEN_NONE"
  }
};
 
$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#language-fieldmapping").text(dumpClusters([], response.clusters).join("\n"));
});

输出

German [23 subclusters]
  Parlament [8 documents]
  Käfer im Bundestag in Berlin [5 documents]
  Mitglieder [5 documents]
  Seite [5 documents]
  MdB [4 documents]
  Bundestag Nachrichten [3 documents]
  Restaurant [3 documents]
  Tag [3 documents]
  Abs.1 [2 documents]
  Bundestagsfraktion Bündnis 90 die Grünen [2 documents]
  Informationssystem für Parlamentarische Vorgänge [2 documents]
  LINKE [2 documents]
  Mehr [2 documents]
  Nebeneinkünfte der Abgeordneten im Deutschen Bundestag [2 documents]
  Petitionen Unterstützen Facebook [2 documents]
  Reichstag Bundestag [2 documents]
  Schule [2 documents]
  Seite Lässt Dies Jedoch [2 documents]
  Susanne Wiest [2 documents]
  Tiergarten Telefon 030 22629933 Gerne weiter Empfehlen [2 documents]
  Virtuelle Wählergedächtnis [2 documents]
  Zentrale [2 documents]
  Other Topics [15 documents]
English [19 subclusters]
  Software [6 documents]
  Data Mining Process [5 documents]
  Conference [4 documents]
  Data Mining Techniques [4 documents]
  Knowledge Discovery [4 documents]
  Web Mining [3 documents]
  Analytic [2 documents]
  Association [2 documents]
  Business [2 documents]
  Data Mining Technology [2 documents]
  Data Warehousing [2 documents]
  Downloads [2 documents]
  Extraction of Hidden Predictive [2 documents]
  Oracle Data Mining [2 documents]
  Papers [2 documents]
  SIAM International Conference on Data Mining [2 documents]
  Visualization and Social Media [2 documents]
  Website for Data Mining Methods [2 documents]
  Other Topics [6 documents]

插件配置


插件有一些默认的设置可以直接使用。建议在非常必须得时候使用这些功能。
下面的配置文件和属性可以用来修改模型的插件配置。
{path.conf}/elasticsearch.yml,
{path.conf}/elasticsearch.json,
{path.conf}/elasticsearch.properties
主要的 ES 配置文件可以用来 启用/关闭 插件,对赋值给聚类请求的资源进行微调。

{path.conf}/carrot2.yml,
{path.conf}/carrot2.json,
{path.conf}/carrot2.properties
可选的包含插件相关的配置文件。

上一篇下一篇

猜你喜欢

热点阅读