AI分词器

2021-07-25  本文已影响0人  zjxchase

安装及配置

下载地址

https://github.com/medcl/elasticsearch-analysis-ik/releases

注意:ik分词器的版本要和 Elasticsearch 的版本保持一致

安装

将下载的安装包 elasticsearch-analysis-ik-7.10.2.zip 复制到 elasticsearch 根目录下的 plugins 文件夹中, 然后解压 elasticsearch-analysis-ik-7.10.2.zip ,解压完成后删除压缩包,并把分词器文件夹重命名为 ik,重启 Elasticsearch 即可。

功能介绍

ik分词器提供两种分词方式:

分词器名称 说明
ik_smart 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“国歌”,适合Phrase查询
ik_max_word 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国”、“中华人民”、“中华”、“华人”、“人民共和国”、“人民”、“人”、“民”、“共和国”、“共和”、“和”、“国”、“国歌”,会穷尽各种可能的组合,适合Term Query。

ex:

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n22" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_smart",
"text": "中华人民共和国"
}

执行结果

{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
}
]
}
</pre>

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n23" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_max_word",
"text": "中华人民共和国"
}

执行结果:

{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "人民共和国",
"start_offset" : 2,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "共和国",
"start_offset" : 4,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "国",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 8
}
]
}
</pre>

自定义分词器

配置自定义分词器前先看一个例子

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n26" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">GET _analyze
{
"analyzer": "ik_smart",
"text": ["十三届全国人大三次会议表决通过了“民法典”,自2021年1月1日起施行。"]
}

执行结果:

{
"tokens" : [
{
"token" : "十",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "三届",
"start_offset" : 1,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "全国人大",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "三次",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "会议",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "表决",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "通过了",
"start_offset" : 13,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "民法典",
"start_offset" : 16,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "自",
"start_offset" : 20,
"end_offset" : 21,
"type" : "CN_CHAR",
"position" : 8
},
{
"token" : "2021年",
"start_offset" : 21,
"end_offset" : 26,
"type" : "TYPE_CQUAN",
"position" : 9
},
{
"token" : "1月",
"start_offset" : 26,
"end_offset" : 28,
"type" : "TYPE_CQUAN",
"position" : 10
},
{
"token" : "1日",
"start_offset" : 28,
"end_offset" : 30,
"type" : "TYPE_CQUAN",
"position" : 11
},
{
"token" : "起",
"start_offset" : 30,
"end_offset" : 31,
"type" : "CN_CHAR",
"position" : 12
},
{
"token" : "施行",
"start_offset" : 31,
"end_offset" : 33,
"type" : "CN_WORD",
"position" : 13
}
]
}</pre>

  1. 创建自定义词库

在 安装的 ik 分词器的 config 中创建文件夹 custom : D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config\custom 在 custom 中 创建 mydic.dic(自定义词库) 和 ext_stopwork.dic(停用词词库)

在 mydic.dic 中添加内容

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n32" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">十三届全国人大</pre>

在 ext_stopwork.dic 中添加内容

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n34" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">自
起</pre>

  1. 配置自定义词库

在 目录 D:\elasticsearch\elasticsearch-7.10.2\plugins\ik\config 下的 IKAnalyzer.cfg.xml 中配置刚创建的两个文件,主要内容如下:

<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n40" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><properties>
<comment>IK Analyzer 扩展配置</comment>

<entry key="ext_dict">custom/mydic.dic</entry>

<entry key="ext_stopwords">custom/ext_stopwork.dic</entry>




</properties></pre>

  1. 重启 Elasticsearch 服务,再次运行前面的例子:

    <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n44" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET _analyze
    {
    "analyzer": "ik_smart",
    "text": ["十三届全国人大三次会议表决通过了“民法典”,自2021年1月1日起施行。"]
    }

    执行结果:

    {
    "tokens" : [
    {
    "token" : "十三届全国人大",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 0
    },
    {
    "token" : "三次",
    "start_offset" : 7,
    "end_offset" : 9,
    "type" : "CN_WORD",
    "position" : 1
    },
    {
    "token" : "会议",
    "start_offset" : 9,
    "end_offset" : 11,
    "type" : "CN_WORD",
    "position" : 2
    },
    {
    "token" : "表决",
    "start_offset" : 11,
    "end_offset" : 13,
    "type" : "CN_WORD",
    "position" : 3
    },
    {
    "token" : "通过了",
    "start_offset" : 13,
    "end_offset" : 16,
    "type" : "CN_WORD",
    "position" : 4
    },
    {
    "token" : "民法典",
    "start_offset" : 17,
    "end_offset" : 20,
    "type" : "CN_WORD",
    "position" : 5
    },
    {
    "token" : "2021年",
    "start_offset" : 23,
    "end_offset" : 28,
    "type" : "TYPE_CQUAN",
    "position" : 6
    },
    {
    "token" : "1月",
    "start_offset" : 28,
    "end_offset" : 30,
    "type" : "TYPE_CQUAN",
    "position" : 7
    },
    {
    "token" : "1日",
    "start_offset" : 30,
    "end_offset" : 32,
    "type" : "TYPE_CQUAN",
    "position" : 8
    },
    {
    "token" : "施行",
    "start_offset" : 33,
    "end_offset" : 35,
    "type" : "CN_WORD",
    "position" : 9
    }
    ]
    }
    </pre>

使用IK构建索引库

  1. 使用 ik 分词器创建索引库

    <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="" cid="n50" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">PUT news
    {
    "mappings": {
    "properties": {
    "title": {
    "type": "text",
    "analyzer": "ik_max_word",
    "search_analyzer": "ik_smart"
    },
    "content": {
    "type": "text",
    "analyzer": "ik_max_word",
    "search_analyzer": "ik_smart"
    }
    }
    }
    }

    查看索引库信息

    GET news/_mapping

    执行结果

    {
    "news" : {
    "mappings" : {
    "properties" : {
    "content" : {
    "type" : "text",
    "analyzer" : "ik_max_word",
    "search_analyzer" : "ik_smart"
    },
    "title" : {
    "type" : "text",
    "analyzer" : "ik_max_word",
    "search_analyzer" : "ik_smart"
    }
    }
    }
    }
    }</pre>

    注意在创建索引 字段 数据类型时, title 和 content 的 analyzer (分词器)使用的是 ik_max_word, 这是因为在创建倒排索引时尽量进行细粒度的拆分,尽量满足更多的搜索需求,而 search_analyzer (搜索) 是 ik_smart , 即搜索时尽量粗粒度的划分,满足搜索的精确性。

  2. 创建测试用例数据

    <pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n55" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">POST news/_bulk
    {"index": {}}
    {"title": "柳岩为何40岁也无人敢取?", "content": "娱乐圈里的女星那么多,但要说到性感女星就一定要提到柳岩,毕竟像刘岩这样有料又有身材的女星,参加活动还是很吃香的。"}
    {"index": {}}
    {"title": "刘德华首当音乐老师", "content": "刘德华表示,希望自己首度塑造的音乐老师形象能够得到大家的认可,尤其希望能的到全国老师,家长和同学们的认可,“如果真的有机会做老师,我也想做音乐老师,因为我觉得音乐课很重要,音乐的力量是可以改变人生的!”"}
    {"index": {}}
    {"title": "奥巴马怒怼特朗普抗疫不力", "content": "奥巴马现身费城的竞选集会并发表讲话,他对特朗普四年的执政工作进行了猛烈攻击,谴责特朗普政府抗疫不力,搞砸美国经济。"}
    {"index": {}}
    {"title": "韩星柳真怀孕4个月喜迎二胎", "content": "韩星柳真怀孕4个月喜迎二胎,柳真为什么选择奇太映女儿为啥姓金?说起韩星柳真有些人可能不认识,不过只要追过S.E.S组合的网友应该都知道她,她曾经在韩国也有“国民妖精”之称,据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"}</pre>

    测试 ex1

    <pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n57" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET news/_search
    {
    "query": {
    "match": {
    "title": "刘德华"
    }
    }
    }

    执行结果

    {
    "took" : 26,
    "timed_out" : false,
    "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
    },
    "hits" : {
    "total" : {
    "value" : 1,
    "relation" : "eq"
    },
    "max_score" : 1.547678,
    "hits" : [
    {
    "_index" : "news",
    "_type" : "_doc",
    "_id" : "l_JQC3gB8u3smGzBUQjj",
    "_score" : 1.547678,
    "_source" : {
    "title" : "刘德华首当音乐老师",
    "content" : "刘德华表示,希望自己首度塑造的音乐老师形象能够得到大家的认可,尤其希望能的到全国老师,家长和同学们的认可,“如果真的有机会做老师,我也想做音乐老师,因为我觉得音乐课很重要,音乐的力量是可以改变人生的!”"
    }
    }
    ]
    }
    }</pre>

    测试ex2

    <pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n59" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET news/_search
    {
    "query": {
    "match": {
    "title": "柳岩"
    }
    }
    }

    执行结果:

    {
    "took" : 11,
    "timed_out" : false,
    "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
    },
    "hits" : {
    "total" : {
    "value" : 2,
    "relation" : "eq"
    },
    "max_score" : 1.875202,
    "hits" : [
    {
    "_index" : "news",
    "_type" : "_doc",
    "_id" : "lvJQC3gB8u3smGzBUQjj",
    "_score" : 1.875202,
    "_source" : {
    "title" : "柳岩为何40岁也无人敢取?",
    "content" : "娱乐圈里的女星那么多,但要说到性感女星就一定要提到柳岩,毕竟像刘岩这样有料又有身材的女星,参加活动还是很吃香的。"
    }
    },
    {
    "_index" : "news",
    "_type" : "_doc",
    "_id" : "mfJQC3gB8u3smGzBUQjj",
    "_score" : 0.6017173,
    "_source" : {
    "title" : "韩星柳真怀孕4个月喜迎二胎",
    "content" : "韩星柳真怀孕4个月喜迎二胎,柳真为什么选择奇太映女儿为啥姓金?说起韩星柳真有些人可能不认识,不过只要追过S.E.S组合的网友应该都知道她,她曾经在韩国也有“国民妖精”之称,据说他所在的S.E.S更是韩国乐坛的第一支女子组合。"
    }
    }
    ]
    }
    }
    </pre>

    测试ex2 执行结果分析:当搜索 “柳岩” 时出现了 柳岩 和 柳真 两条结果,通过分词查看可知

    <pre class="md-fences mock-cm md-end-block" spellcheck="false" lang="" cid="n62" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit;">GET _analyze
    {
    "analyzer": "ik_smart",
    "text": ["柳岩"]
    }

    执行结果:

    {
    "tokens" : [
    {
    "token" : "柳",
    "start_offset" : 0,
    "end_offset" : 1,
    "type" : "CN_CHAR",
    "position" : 0
    },
    {
    "token" : "岩",
    "start_offset" : 1,
    "end_offset" : 2,
    "type" : "CN_CHAR",
    "position" : 1
    }
    ]
    }</pre>

    分词器把 柳岩 拆分了 “柳” 和 “研” 两个字去搜索了,当搜索“柳”字时把柳岩 和 柳真 都搜索出来了

动态更新索引数据

上一篇下一篇

猜你喜欢

热点阅读