AI分词器
安装及配置
下载地址
https://github.com/medcl/elasticsearch-analysis-ik/releases
注意:ik分词器的版本要和 Elasticsearch 的版本保持一致
安装
将下载的安装包 elasticsearch-analysis-ik-7.10.2.zip 复制到 elasticsearch 根目录下的 plugins 文件夹中, 然后解压 elasticsearch-analysis-ik-7.10.2.zip ,解压完成后删除压缩包,并把分词器文件夹重命名为 ik,重启 Elasticsearch 即可。
功能介绍
ik分词器提供两种分词方式:
分词器名称 | 说明 |
---|---|
ik_smart | 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“国歌”,适合Phrase查询 |
ik_max_word | 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“中华人民”、“中华”、“华人”、“人民共和国”、“人民”、“人”、“民”、“共和国”、“共和”、“和”、“国”、“国歌”,会穷尽各种可能的组合,适合Term Query。 |
ex:
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n22" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_smart",
"text": "中华人民共和国"
}
执行结果
{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
}
]
}
</pre>
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n23" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_max_word",
"text": "中华人民共和国"
}
执行结果:
{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "人民共和国",
"start_offset" : 2,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "共和国",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "国",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 8
}
]
}
</pre>
自定义分词器
配置自定义分词器前先看一个例子
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n26" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_smart",
"text": ["十三届全国人大三次会议表决通过了“民法典”,自2021年1月1日起施行。"]
}
执行结果:
{
"tokens" : [
{
"token" : "十",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "三届",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "全国人大",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "三次",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "会议",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "表决",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "通过了",
"start_offset" : 13,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "民法典",
"start_offset" : 16,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "自",
"start_offset" : 20,
"end_offset" : 21,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "2021年",
"start_offset" : 21,
"end_offset" : 26,
"type" : "TYPE_CQUAN",
"position" : 9
},
{
"token" : "1月",
"start_offset" : 26,
"end_offset" : 28,
"type" : "TYPE_CQUAN",
"position" : 10
},
{
"token" : "1日",
"start_offset" : 28,
"end_offset" : 30,
"type" : "TYPE_CQUAN",
"position" : 11
},
{
"token" : "起",
"start_offset" : 30,
"end_offset" : 31,
"type" : "CN_CHAR",
"position" : 12
},
{
"token" : "施行",
"start_offset" : 31,
"end_offset" : 33,
"type" : "CN_WORD",
"position" : 13
}
]
}</pre>
- 创建自定义词库
在 安装的 ik 分词器的 config 中创建文件夹 custom : D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config\custom
在 custom 中 创建 mydic.dic(自定义词库) 和 ext_stopwork.dic(停用词词库)
在 mydic.dic 中添加内容
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n32" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">十三届全国人大</pre>
在 ext_stopwork.dic 中添加内容
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n34" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">自
起</pre>
- 配置自定义词库
在 目录 D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config
下的 IKAnalyzer.cfg.xml 中配置刚创建的两个文件,主要内容如下:
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n40" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><properties>
<comment>IK Analyzer 扩展配置</comment>
<entry key="ext_dict">custom/mydic.dic</entry>
<entry key="ext_stopwords">custom/ext_stopwork.dic</entry>
</properties></pre>
-
重启 Elasticsearch 服务,再次运行前面的例子:
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n44" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET _analyze
{
"analyzer": "ik_smart",
"text": ["十三届全国人大三次会议表决通过了“民法典”,自2021年1月1日起施行。"]
}执行结果:
{
"tokens" : [
{
"token" : "十三届全国人大",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "三次",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "会议",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "表决",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "通过了",
"start_offset" : 13,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "民法典",
"start_offset" : 17,
"end_offset" : 20,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "2021年",
"start_offset" : 23,
"end_offset" : 28,
"type" : "TYPE_CQUAN",
"position" : 6
},
{
"token" : "1月",
"start_offset" : 28,
"end_offset" : 30,
"type" : "TYPE_CQUAN",
"position" : 7
},
{
"token" : "1日",
"start_offset" : 30,
"end_offset" : 32,
"type" : "TYPE_CQUAN",
"position" : 8
},
{
"token" : "施行",
"start_offset" : 33,
"end_offset" : 35,
"type" : "CN_WORD",
"position" : 9
}
]
}
</pre>
使用IK构建索引库
-
使用 ik 分词器创建索引库
<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n50" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">PUT news
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
}查看索引库信息
GET news/_mapping
执行结果
{
"news" : {
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer" : "ik_max_word",
"search_analyzer" : "ik_smart"
},
"title" : {
"type" : "text",
"analyzer" : "ik_max_word",
"search_analyzer" : "ik_smart"
}
}
}
}
}</pre>注意在创建索引 字段 数据类型时, title 和 content 的 analyzer (分词器)使用的是 ik_max_word, 这是因为在创建倒排索引时尽量进行细粒度的拆分,尽量满足更多的搜索需求,而 search_analyzer (搜索) 是 ik_smart , 即搜索时尽量粗粒度的划分,满足搜索的精确性。
-
创建测试用例数据
<pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n55" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">POST news/_bulk
{"index": {}}
{"title": "柳岩为何40岁也无人敢取?", "content": "娱乐圈里的女星那么多,但要说到性感女星就一定要提到柳岩,毕竟像刘岩这样有料又有身材的女星,参加活动还是很吃香的。"}
{"index": {}}
{"title": "刘德华首当音乐老师", "content": "刘德华表示,希望自己首度塑造的音乐老师形象能够得到大家的认可,尤其希望能的到全国老师,家长和同学们的认可,“如果真的有机会做老师,我也想做音乐老师,因为我觉得音乐课很重要,音乐的力量是可以改变人生的!”"}
{"index": {}}
{"title": "奥巴马怒怼特朗普抗疫不力", "content": "奥巴马现身费城的竞选集会并发表讲话,他对特朗普四年的执政工作进行了猛烈攻击,谴责特朗普政府抗疫不力,搞砸美国经济。"}
{"index": {}}
{"title": "韩星柳真怀孕4个月喜迎二胎", "content": "韩星柳真怀孕4个月喜迎二胎,柳真为什么选择奇太映女儿为啥姓金?说起韩星柳真有些人可能不认识,不过只要追过S.E.S组合的网友应该都知道她,她曾经在韩国也有“国民妖精”之称,据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"}</pre>测试 ex1
<pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n57" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET news/_search
{
"query": {
"match": {
"title": "刘德华"
}
}
}执行结果
{
"took" : 26,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.547678,
"hits" : [
{
"_index" : "news",
"_type" : "_doc",
"_id" : "l_JQC3gB8u3smGzBUQjj",
"_score" : 1.547678,
"_source" : {
"title" : "刘德华首当音乐老师",
"content" : "刘德华表示,希望自己首度塑造的音乐老师形象能够得到大家的认可,尤其希望能的到全国老师,家长和同学们的认可,“如果真的有机会做老师,我也想做音乐老师,因为我觉得音乐课很重要,音乐的力量是可以改变人生的!”"
}
}
]
}
}</pre>测试ex2
<pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n59" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET news/_search
{
"query": {
"match": {
"title": "柳岩"
}
}
}执行结果:
{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.875202,
"hits" : [
{
"_index" : "news",
"_type" : "_doc",
"_id" : "lvJQC3gB8u3smGzBUQjj",
"_score" : 1.875202,
"_source" : {
"title" : "柳岩为何40岁也无人敢取?",
"content" : "娱乐圈里的女星那么多,但要说到性感女星就一定要提到柳岩,毕竟像刘岩这样有料又有身材的女星,参加活动还是很吃香的。"
}
},
{
"_index" : "news",
"_type" : "_doc",
"_id" : "mfJQC3gB8u3smGzBUQjj",
"_score" : 0.6017173,
"_source" : {
"title" : "韩星柳真怀孕4个月喜迎二胎",
"content" : "韩星柳真怀孕4个月喜迎二胎,柳真为什么选择奇太映女儿为啥姓金?说起韩星柳真有些人可能不认识,不过只要追过S.E.S组合的网友应该都知道她,她曾经在韩国也有“国民妖精”之称,据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"
}
}
]
}
}
</pre>测试ex2 执行结果分析:当搜索 “柳岩” 时出现了 柳岩 和 柳真 两条结果,通过分词查看可知
<pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n62" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET _analyze
{
"analyzer": "ik_smart",
"text": ["柳岩"]
}执行结果:
{
"tokens" : [
{
"token" : "柳",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "岩",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
}
]
}</pre>分词器把 柳岩 拆分了 “柳” 和 “研” 两个字去搜索了,当搜索“柳”字时把柳岩 和 柳真 都搜索出来了