es6.2.4学习----ik分词器 & 栗子

2018-06-01  本文已影响0人  轻易流逝

Elasticsearch 内置的分词器对中文不友好,会把中文分成单个字来进行全文检索,不能达到想要的结果,在全文检索及新词发展如此快的互联网时代,IK可以进行友好的分词及自定义分词。
IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。

ik 带有两个分词器
ik_max_word :会将文本做最细粒度的拆分;尽可能多的拆分出词语
ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有

标准分词

GET _analyze
{
  "analyzer": "standard",
  "text":"好好学习,天天向上"
}

分词结果是将每个字作为一个词

  "tokens": [
    {
      "token": "好",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<IDEOGRAPHIC>",
      "position": 0
    },
    {
      "token": "好",
      "start_offset": 1,
      "end_offset": 2,
      "type": "<IDEOGRAPHIC>",
      "position": 1
    },
    {
      "token": "学",
      "start_offset": 2,
      "end_offset": 3,
      "type": "<IDEOGRAPHIC>",
      "position": 2
    },
    {
      "token": "习",
      "start_offset": 3,
      "end_offset": 4,
      "type": "<IDEOGRAPHIC>",
      "position": 3
    },
    {
      "token": "天",
      "start_offset": 5,
      "end_offset": 6,
      "type": "<IDEOGRAPHIC>",
      "position": 4
    },
    {
      "token": "天",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<IDEOGRAPHIC>",
      "position": 5
    },
    {
      "token": "向",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<IDEOGRAPHIC>",
      "position": 6
    },
    {
      "token": "上",
      "start_offset": 8,
      "end_offset": 9,
      "type": "<IDEOGRAPHIC>",
      "position": 7
    }
  ]
}

ik_smart分词以及结果(做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有

GET _analyze
{
  "analyzer": "ik_smart",
  "text":"好好学习,天天向上"
}
{
  "tokens": [
    {
      "token": "好好学习",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "天天向上",
      "start_offset": 5,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

ik_max_word分词以及结果(将文本做最细粒度的拆分;尽可能多的拆分出词语)

GET _analyze
{
  "analyzer": "ik_max_word",
  "text":"好好学习,天天向上"
}
{
  "tokens": [
    {
      "token": "好好学习",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "好好学",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "好好",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "好学",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "学习",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "天天向上",
      "start_offset": 5,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "天天",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "向上",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 7
    }
  ]
}

栗子:对ik分词器的演示

新建索引,并设置mapping

PUT /ik_index


PUT /ik_index/text/_mapping
{
  "properties": {
    "context":{
      "type": "text",
      "fields": {
        "context_ik_smart":{
          "type": "text", 
          "analyzer": "ik_smart",
          "search_analyzer": "ik_smart"
        },
        "context_ik_max_word":{
          "type": "text", 
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      } 
    }
  }
}

添加多个文档

POST /ik_index/text
{
  "context":"好好学习,天天向上"
}


POST /ik_index/text
{
  "context":"学和习,有什么区别" 
}

POST /ik_index/text
{
  "context":"es的分词该怎么学的"
}


POST /ik_index/text
{
  "context":"ik是怎么把句子分成词的"
}

搜索“学习”

//标准分词器搜索
GET /ik_index/text/_search?pretty
{
  "query": {
    "match": {
      "context": "学习"
    }
  }
}

//ik_smart分词搜索
GET /ik_index/text/_search?pretty
{
  "query": {
    "match": {
      "context.context_ik_smart": "学习"
    }
  }
}

//ik_max_word分词搜索
GET /ik_index/text/_search?pretty
{
  "query": {
    "match": {
      "context.context_ik_max_word": "学习"
    }
  }
}

标准分词分词后搜索结果


brandard

ik_smart分词后搜索结果


ik_smart

ik_max_word分词后搜索结果


ik_max_word
上一篇 下一篇

猜你喜欢

热点阅读