AI分词器

2021-07-25 本文已影响0人 zjxchase

安装及配置

下载地址

https://github.com/medcl/elasticsearch-analysis-ik/releases

注意：ik分词器的版本要和 Elasticsearch 的版本保持一致

安装

将下载的安装包 elasticsearch-analysis-ik-7.10.2.zip 复制到 elasticsearch 根目录下的 plugins 文件夹中，然后解压 elasticsearch-analysis-ik-7.10.2.zip ，解压完成后删除压缩包，并把分词器文件夹重命名为 ik，重启 Elasticsearch 即可。

功能介绍

ik分词器提供两种分词方式：

分词器名称	说明
ik_smart	会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“国歌”，适合Phrase查询
ik_max_word	会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“中华人民”、“中华”、“华人”、“人民共和国”、“人民”、“人”、“民”、“共和国”、“共和”、“和”、“国”、“国歌”，会穷尽各种可能的组合，适合Term Query。

ex:

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n22" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_smart",
"text": "中华人民共和国"
}

执行结果

{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
}
]
}
</pre>

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n23" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_max_word",
"text": "中华人民共和国"
}

执行结果：

{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "人民共和国",
"start_offset" : 2,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "共和国",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "国",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 8
}
]
}
</pre>

自定义分词器

配置自定义分词器前先看一个例子

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n26" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_smart",
"text": ["十三届全国人大三次会议表决通过了“民法典”，自2021年1月1日起施行。"]
}

执行结果：

{
"tokens" : [
{
"token" : "十",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "三届",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "全国人大",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "三次",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "会议",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "表决",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "通过了",
"start_offset" : 13,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "民法典",
"start_offset" : 16,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "自",
"start_offset" : 20,
"end_offset" : 21,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "2021年",
"start_offset" : 21,
"end_offset" : 26,
"type" : "TYPE_CQUAN",
"position" : 9
},
{
"token" : "1月",
"start_offset" : 26,
"end_offset" : 28,
"type" : "TYPE_CQUAN",
"position" : 10
},
{
"token" : "1日",
"start_offset" : 28,
"end_offset" : 30,
"type" : "TYPE_CQUAN",
"position" : 11
},
{
"token" : "起",
"start_offset" : 30,
"end_offset" : 31,
"type" : "CN_CHAR",
"position" : 12
},
{
"token" : "施行",
"start_offset" : 31,
"end_offset" : 33,
"type" : "CN_WORD",
"position" : 13
}
]
}</pre>

创建自定义词库

在安装的 ik 分词器的 config 中创建文件夹 custom : D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config\custom 在 custom 中创建 mydic.dic(自定义词库) 和 ext_stopwork.dic（停用词词库）

在 mydic.dic 中添加内容

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n32" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">十三届全国人大</pre>

在 ext_stopwork.dic 中添加内容

配置自定义词库

在目录 D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config 下的 IKAnalyzer.cfg.xml 中配置刚创建的两个文件，主要内容如下：

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n40" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><properties>
<comment>IK Analyzer 扩展配置</comment>

<entry key="ext_dict">custom/mydic.dic</entry>

<entry key="ext_stopwords">custom/ext_stopwork.dic</entry>

</properties></pre>

重启 Elasticsearch 服务，再次运行前面的例子：

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n44" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET _analyze
{
"analyzer": "ik_smart",
"text": ["十三届全国人大三次会议表决通过了“民法典”，自2021年1月1日起施行。"]
}

执行结果：

{
"tokens" : [
{
"token" : "十三届全国人大",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "三次",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "会议",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "表决",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "通过了",
"start_offset" : 13,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "民法典",
"start_offset" : 17,
"end_offset" : 20,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "2021年",
"start_offset" : 23,
"end_offset" : 28,
"type" : "TYPE_CQUAN",
"position" : 6
},
{
"token" : "1月",
"start_offset" : 28,
"end_offset" : 30,
"type" : "TYPE_CQUAN",
"position" : 7
},
{
"token" : "1日",
"start_offset" : 30,
"end_offset" : 32,
"type" : "TYPE_CQUAN",
"position" : 8
},
{
"token" : "施行",
"start_offset" : 33,
"end_offset" : 35,
"type" : "CN_WORD",
"position" : 9
}
]
}
</pre>

使用IK构建索引库

使用 ik 分词器创建索引库

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n50" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">PUT news
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
}

查看索引库信息

GET news/_mapping

执行结果

{
"news" : {
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer" : "ik_max_word",
"search_analyzer" : "ik_smart"
},
"title" : {
"type" : "text",
"analyzer" : "ik_max_word",
"search_analyzer" : "ik_smart"
}
}
}
}
}</pre>

注意在创建索引字段数据类型时， title 和 content 的 analyzer （分词器）使用的是 ik_max_word, 这是因为在创建倒排索引时尽量进行细粒度的拆分，尽量满足更多的搜索需求，而 search_analyzer (搜索) 是 ik_smart , 即搜索时尽量粗粒度的划分，满足搜索的精确性。
创建测试用例数据

<pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n55" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">POST news/_bulk
{"index": {}}
{"title": "柳岩为何40岁也无人敢取？", "content": "娱乐圈里的女星那么多，但要说到性感女星就一定要提到柳岩，毕竟像刘岩这样有料又有身材的女星，参加活动还是很吃香的。"}
{"index": {}}
{"title": "刘德华首当音乐老师", "content": "刘德华表示，希望自己首度塑造的音乐老师形象能够得到大家的认可，尤其希望能的到全国老师，家长和同学们的认可，“如果真的有机会做老师，我也想做音乐老师，因为我觉得音乐课很重要，音乐的力量是可以改变人生的！”"}
{"index": {}}
{"title": "奥巴马怒怼特朗普抗疫不力", "content": "奥巴马现身费城的竞选集会并发表讲话，他对特朗普四年的执政工作进行了猛烈攻击，谴责特朗普政府抗疫不力，搞砸美国经济。"}
{"index": {}}
{"title": "韩星柳真怀孕4个月喜迎二胎", "content": "韩星柳真怀孕4个月喜迎二胎，柳真为什么选择奇太映女儿为啥姓金？说起韩星柳真有些人可能不认识，不过只要追过S.E.S组合的网友应该都知道她，她曾经在韩国也有“国民妖精”之称，据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"}</pre>

测试 ex1

<pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n57" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET news/_search
{
"query": {
"match": {
"title": "刘德华"
}
}
}

执行结果

{
"took" : 26,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.547678,
"hits" : [
{
"_index" : "news",
"_type" : "_doc",
"_id" : "l_JQC3gB8u3smGzBUQjj",
"_score" : 1.547678,
"_source" : {
"title" : "刘德华首当音乐老师",
"content" : "刘德华表示，希望自己首度塑造的音乐老师形象能够得到大家的认可，尤其希望能的到全国老师，家长和同学们的认可，“如果真的有机会做老师，我也想做音乐老师，因为我觉得音乐课很重要，音乐的力量是可以改变人生的！”"
}
}
]
}
}</pre>

测试ex2

<pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n59" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET news/_search
{
"query": {
"match": {
"title": "柳岩"
}
}
}

执行结果：

{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.875202,
"hits" : [
{
"_index" : "news",
"_type" : "_doc",
"_id" : "lvJQC3gB8u3smGzBUQjj",
"_score" : 1.875202,
"_source" : {
"title" : "柳岩为何40岁也无人敢取？",
"content" : "娱乐圈里的女星那么多，但要说到性感女星就一定要提到柳岩，毕竟像刘岩这样有料又有身材的女星，参加活动还是很吃香的。"
}
},
{
"_index" : "news",
"_type" : "_doc",
"_id" : "mfJQC3gB8u3smGzBUQjj",
"_score" : 0.6017173,
"_source" : {
"title" : "韩星柳真怀孕4个月喜迎二胎",
"content" : "韩星柳真怀孕4个月喜迎二胎，柳真为什么选择奇太映女儿为啥姓金？说起韩星柳真有些人可能不认识，不过只要追过S.E.S组合的网友应该都知道她，她曾经在韩国也有“国民妖精”之称，据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"
}
}
]
}
}
</pre>

测试ex2 执行结果分析：当搜索 “柳岩” 时出现了柳岩和柳真两条结果，通过分词查看可知

<pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n62" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET _analyze
{
"analyzer": "ik_smart",
"text": ["柳岩"]
}

执行结果：

{
"tokens" : [
{
"token" : "柳",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "岩",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
}
]
}</pre>

分词器把柳岩拆分了 “柳” 和 “研” 两个字去搜索了，当搜索“柳”字时把柳岩和柳真都搜索出来了

AI分词器

安装及配置

功能介绍

执行结果

执行结果：

自定义分词器

执行结果：

执行结果：

使用IK构建索引库

查看索引库信息

执行结果

执行结果

执行结果：

执行结果：

动态更新索引数据

猜你喜欢

热点阅读