十八、Elasticsearch进行phrase matchin

2017-07-16 本文已影响85人编程界的小学生

1、match query

{
    "match": {
        "content": "java spark"
    }
}

match query：只能搜索到包含java和spark的document，但是不知道java和spark是不是离的很近或者紧挨着构成了短语。

match query会将包含java或包含spark或包含java spark短语的document都会返回回来。

2、需求

需求：
（1）java spark：就靠在一起，中间不能插入任何其他字符，就要搜索出来这种短语的document
（2）java spark：但是要求，java和spark两个单词靠的越近，doc的分数越高，排名越靠前。

这两个需求用match query是肯定做不到的。

需求1可以用phrase match或proximity match（近似匹配）。而需求2必须用proximity match来实现。

今天讲解的是phrase match

3、phrase matching

（1）数据准备

POST /forum/article/5/_update
{
  "doc": {
    "content": "spark is best big data solution based on scala ,an programming language similar to java spark"
  }
}

（2）搜索

GET /forum/article/_search
{
  "query": {
    "match_phrase": {
      "content": "java spark"
    }
  }
}

结果：

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "5",
        "_score": 0.5753642,
        "_source": {
          "articleID": "DHJK-B-1395-#Ky5",
          "userID": 3,
          "hidden": false,
          "postDate": "2017-03-01",
          "tag": [
            "elasticsearch"
          ],
          "tag_cnt": 1,
          "view_cnt": 10,
          "title": "this is spark blog",
          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
          "sub_title": "haha, hello world",
          "author_first_name": "Tonny",
          "author_last_name": "Peter Smith",
          "new_author_last_name": "Peter Smith",
          "new_author_first_name": "Tonny"
        }
      }
    ]
  }
}

只返回了只有包含java spark这个短语的document返回了。

4、term position

比如：
doc1:hello world,java spark
doc2:spark java

term position对应关系如下：

term：单词，词项，关键字
position：词项在doc中出现的位置

term	position
hello	doc1(0)
world	doc1(1)
java	doc1(2) doc2(2)
spark	doc1(3) doc2(1)

我们也可以根据GET _analyze命令去查看position是否正确

GET _analyze
{
  "text": "hello world, java spark",
  "analyzer": "standard"
}

结果

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "java",
      "start_offset": 13,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "spark",
      "start_offset": 18,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

5、match_phrase的基本原理
上面我们刚说了term position

term	position
hello	doc1(0)
world	doc1(1)
java	doc1(2) doc2(2)
spark	doc1(3) doc2(1)

java doc1(2) doc2(2)
spark doc1(3) doc2(1)

要找到每个term都在的一个共有的那些document，就是要求一个doc必须包含每个term，才能拿出来继续计算。

doc1 --》 java和spark --》 spark position是3，java position是2；3-2=1；恰巧是1，正好挨着。凑成了短语。满足条件。

doc2 --》 java和spark --》 spark position是1，java position是2,；1-2=-1，凑成了spark java而不是java spark，所以不满足条件。

6、一定要弄懂这块原理，我说的已经很详细很详细了，因为后面我会写proximity match近似匹配的文章，proximity match和match phrase原理一模一样！！！！

若有兴趣，欢迎来加入群，【Java初学者学习交流群】：458430385，此群有Java开发人员、UI设计人员和前端工程师。有问必答，共同探讨学习，一起进步！
欢迎关注我的微信公众号【Java码农社区】，会定时推送各种干货：

qrcode_for_gh_577b64e73701_258.jpg

十八、Elasticsearch进行phrase matchin

猜你喜欢

热点阅读