ES-全文检索(full text query)

2020-02-28  本文已影响0人  一个菜鸟JAVA

本文所使用的ES版本为7.4.在了解本文前,需要知道什么是query DSL.同时已经了解是什么analyzer.官网介绍的Analyzer比较多,如果只想粗略了解什么是Analyzer,可以参考本人之前写的ES之分析器Analyzer.

1.创建测试数据

DELETE test_index

PUT test_index

PUT test_index/_mapping
{
  "properties":{
    "content":{
      "type":"text"
    }
  }
}

POST test_index/_bulk
{"index":{"_id":"1"}}
{"content":"java is my favorite language"}
{"index":{"_id":"2"}}
{"content":"Go is my favorite language"}
{"index":{"_id":"3"}}
{"content":"java language is very good"}

2.match query

match_query在查询之前,会先对查询的文本进行分词.例如下面例子:

POST test_index/_search
{
  "query": {
    "match": {
      "content": "java language"
    }
  }
}

在查询前,会先将文本分词.因为我们没有设置分词器,es将会使用standard分词器将查询语句java language拆分成两个词javalanguage.它将在索引中找到有这两个词的数据返回.查询结果如下所示:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.60353506,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.60353506,
        "_source" : {
          "content" : "java is my favorite language"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "3",
        "_scor    e" : 0.60353506,
        "_source" : {
          "content" : "java language is very good"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.13353139,
        "_source" : {
          "content" : "Go is my favorite language"
        }
      }
    ]
  }
}

从结果可以看出,所有结果都已经全部查询出来.而且还根据相似度打了分.同时它还支持几个常用的参数.

POST test_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "java language",
        "operator": "and"
      }
    }
  }
}

查询结果如下:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.60353506,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.60353506,
        "_source" : {
          "content" : "java is my favorite language"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.60353506,
        "_source" : {
          "content" : "java language is very good"
        }
      }
    ]
  }
}

与之前的结果比较,可以发现Go is my favorite language没有出现,因为它只满足language.

POST test_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "java language so good",
        "minimum_should_match": "75%"
      }
    }
  }
}

java language so good将会被standard分词器分为四个词,那么最小满足75%说明至少需要满足三个词才能被查询出来.所以结果如下:

{
  "tokens" : [
    {
      "token" : "java",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "language",
      "start_offset" : 5,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "so",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 17,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}
POST test_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "jave"
      }
    }
  }
}

POST test_index/_search
{
  "query": {
    "match": {
      "content": {
        "query": "jave",
        "fuzziness": 1
      }
    }
  }
}

上面两个查询,第一个无法查找含有java的内容,但是查询二因为设置了fuzziness所以可以正确的匹配到.

3.match_bool_prefix

该查询内部使用analyzer将查询文本分词,然后基于分词的内容进行bool query,除了最后一个使用prefix查询,其他都是term query.

POST test_index/_search
{
  "query": {
    "match_bool_prefix":{
      "content":"java is m"
    }
  }
}

上面的查询可以解释为下面的查询:

POST test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {"term":{"content": "java"}},
        {"term":{"content": "is"}},
        {"prefix":{"content": "m"}}
      ]
    }
  }
}

这两个查询时等价的,最后的查询结果如下:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.603535,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.603535,
        "_source" : {
          "content" : "java is my favorite language"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.1335313,
        "_source" : {
          "content" : "Go is my favorite language"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.60353506,
        "_source" : {
          "content" : "java language is very good"
        }
      }
    ]
  }
}

4.match_phrase

match_phrase查询将分析文本,并从分析的文本中创建短语查询.

POST test_index/_search
{
  "query": {
    "match_phrase": {
      "content": "java is"
    }
  }
}

例如上面的查询,它会将java is做为一个短语查询,它的查询结果为:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.60353506,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.60353506,
        "_source" : {
          "content" : "java is my favorite language"
        }
      }
    ]
  }
}

上面的结果中,只有java is my favorite language被查询出来了,它与查询条件匹配,因为javais之间是仅仅挨着的,不需要调换他们之间的位置.

我们再添加一条测试语句,如下:

POST test_index/_doc/4
{
  "content":"java and go is program language"
}

如果我们想让java and go is program languagejava language is very good都能被查询出来呢?那么我们可以使用slop查询来改变.例如:

POST test_index/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "java is",
        "slop": 2
      }
    }
  }
}

使用查询中,我们设置slop为2.java language is very good只需要移动1次便可以匹配到java is,而java and go is program language则需要两次.查询结果如下:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.471215,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.471215,
        "_source" : {
          "content" : "java is my favorite language"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.30669597,
        "_source" : {
          "content" : "java language is very good"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.20387812,
        "_source" : {
          "content" : "java and go is program language"
        }
      }
    ]
  }
}

如果我们将slop设置为1,则java and go is program language将无法满足条件无法被查询出来.我们再添加一条语句,如下:

POST test_index/_doc/5
{
  "content":"my favorite language is java"
}

我们还是使用上面的查询,my favorite language is java还是可以被我们查询出来.需要注意的是,调换位置,slop为2,而不是1.

5.match_phrase_prefix

返回包含所提供文本单词的文档,顺序与提供的顺序相同.所提供的文本的最后一个术语被视为前缀,与以该术语开头的任何单词相匹配.
它与match_phrase最大的区别在于它多了一个prefix匹配.

POST test_index/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": "my favorite l"
    }
  }
}

上面的例子中,匹配短语为my favorite,匹配的前缀为l.现在再添加一个文档.

POST test_index/_doc/6
{
  "content":"My favorite food and language are potatoes and Chinese "
}

查询条件如下:

POST test_index/_search
{
  "query": {
    "match_phrase_prefix": {
      "content": {
        "query": "my favorite l",
        "slop":2
      }
    }
  }
}

查询结果如下:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 1.0172215,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0172215,
        "_source" : {
          "content" : "java is my favorite language"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0172215,
        "_source" : {
          "content" : "Go is my favorite language"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 1.0172215,
        "_source" : {
          "content" : "my favorite language is java"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 0.34737897,
        "_source" : {
          "content" : "My favorite food and language are potatoes and Chinese "
        }
      }
    ]
  }
}

可以发现,我们新添加的文档也可以被查询到.因为match_phrase_prefix同样也支持slop参数.使用match_phrase_prefix会消耗大量资源.因为l可能会匹配到成千上万的term.所以官方提供了一个参数max_expansions限制了扩展匹配的个数.默认该值为50.需要特别注意的一点是,不能以最后返回的结果作为max_expansions这个参数是否生效的验证.

6.mutil_match

multi_matchmatch为基础,允许多个字段同时查询.例如下面例子插入测试数据:

PUT test_index2

POST test_index2/_bulk
{"index":{"_id":"1"}}
{"title":"big data","content":"we can use java process big data"}
{"index":{"_id":"2"}}
{"title":"what is scala","content":"scala run in jvm and used in process big data"}
{"index":{"_id":"3"}}
{"title":"favorite language","content":"java and Go is my favorite language"}
{"index":{"_id":"4"}}
{"title":"About java","content":"java is language and run jvm"}
{"index":{"_id":"5"}}
{"title":"what is jvm ?","content":"Java Virtual Machine"}

然后查询titlecontent里面存在java的记录.

POST test_index2/_search
{
  "query": {
    "multi_match": {
      "query": "java",
      "fields": ["title","content"]
    }
  }
}

查询结果如下:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 1.4877305,
    "hits" : [
      {
        "_index" : "test_index2",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.4877305,
        "_source" : {
          "title" : "About java",
          "content" : "java is language and run jvm"
        }
      },
      {
        "_index" : "test_index2",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.37031418,
        "_source" : {
          "title" : "what is jvm ?",
          "content" : "Java Virtual Machine"
        }
      },
      {
        "_index" : "test_index2",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.28072202,
        "_source" : {
          "title" : "big data",
          "content" : "we can use java process big data"
        }
      },
      {
        "_index" : "test_index2",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.28072202,
        "_source" : {
          "title" : "favorite language",
          "content" : "java and Go is my favorite language"
        }
      }
    ]
  }
}

multi_match查询的类型

这个概念对于multi_match来说很重要.默认multi_match的类型为best_fields.例如上面的查询,完整的样子应该如下所示:

POST test_index2/_search
{
  "query": {
    "multi_match": {
      "query": "java",
      "fields": ["title","content"],
      "type": "best_fields",
    }
  }
}

除此之外还有另外几种:most_fields,cross_fields,phrase,phrase_prefix,bool_prefix.

best_fields

best_fields指的就是搜索结果中应该返回某一个字段匹配到了最多的关键词的文档.要搞懂best_fields首先要搞懂什么是dis_max.它的中文意思就是分离最大化查询的意思.

PUT test_index3

POST test_index3/_bulk
{"index":{"_id":"1"}}
{"title":"I like chinese food","content":"my favorite food is rice"}
{"index":{"_id":"2"}}
{"title":"food is very import","content":"to many people don't have food"}
POST test_index3/_search
{
  "query": {
    "match": {
      "title": "chinese food"
    }
  }
}

POST test_index3/_search
{
  "query": {
    "match": {
      "content": "chinese food"
    }
  }
}

POST test_index3/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match":{"title":"chinese food"}},
        {"match":{"content":"chinese food"}}
        ]
    }
  }
}

查询结果如下:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.87546873,
    "hits" : [
      {
        "_index" : "test_index3",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.87546873,
        "_source" : {
          "title" : "I like chinese food",
          "content" : "my favorite food is rice"
        }
      },
      {
        "_index" : "test_index3",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "title" : "food is very import",
          "content" : "to many people don't have food"
        }
      }
    ]
  }
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.18936403,
    "hits" : [
      {
        "_index" : "test_index3",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.18936403,
        "_source" : {
          "title" : "I like chinese food",
          "content" : "my favorite food is rice"
        }
      },
      {
        "_index" : "test_index3",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.17578416,
        "_source" : {
          "title" : "food is very import",
          "content" : "to many people don't have food"
        }
      }
    ]
  }
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.87546873,
    "hits" : [
      {
        "_index" : "test_index3",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.87546873,
        "_source" : {
          "title" : "I like chinese food",
          "content" : "my favorite food is rice"
        }
      },
      {
        "_index" : "test_index3",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.18232156,
        "_source" : {
          "title" : "food is very import",
          "content" : "to many people don't have food"
        }
      }
    ]
  }
}

注意观察结果的得分,最后的得分是获取满足条件的最大得分.而其他得分会被舍去,并不会计算到最后结果评分中去.但是有的时候我们希望其他匹配的条件也能为搜索贡献自己的分数时,我们可以通过设置tie_breaker来使其他得分按比例计算到最终得分中去.该值的取值范围在0.0~1.0之间.例如下面的例子:

POST test_index3/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match":{"title":"chinese food"}},
        {"match":{"content":"chinese food"}}
        ],
        "tie_breaker": 1
    }
  }
}

最后的结果如下:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0648328,
    "hits" : [
      {
        "_index" : "test_index3",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0648328,
        "_source" : {
          "title" : "I like chinese food",
          "content" : "my favorite food is rice"
        }
      },
      {
        "_index" : "test_index3",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.35810572,
        "_source" : {
          "title" : "food is very import",
          "content" : "to many people don't have food"
        }
      }
    ]
  }
}

我们设置了tie_breaker为1,意思是将其他得分按100%的比例计算到最终的得分.从最后的结果和之前的结果对比可以看出该参数确实有效.

理解了什么是dis_max了,就能很好理解什么是best_fields策略了.对于下面的查询可以理解为同等意思.

POST test_index2/_search
{
  "query": {
    "multi_match": {
      "query": "java",
      "fields": ["title","content"],
      "type": "best_fields"
    }
  }
}

POST test_index2/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match":{"title":"java"}},
        {"match":{"content":"java"}}
        ]
    }
  }
}

而这两个查询最后的结果也是一样的.

most_fields

most_fields这个与best_fields刚好相反,它会累计所有field的得分而不是取最高得分.best_fields就是搜索结果应该返回匹配了更多的字段的document优先返回回来.例如:

POST test_index2/_search
{
  "query": {
    "multi_match": {
      "query": "java",
      "fields": ["title","content"],
      "type": "most_fields"
    }
  }
}

它就等于下面这种查询:

POST test_index2/_search
{
  "query": {
    "bool": {
      "should": [
        {"match":{"content": "java"}},
        {"match":{"title": "java"}}
      ]
    }
  }
}

phrase and phrase_prefix

这两种同phrase queryphrase_prefix qeury,不同的在于支持多字段同时匹配.需要注意的是它们默认的得分策略像best_fields只取最高分.例如下面的查询可以看做是相同的作用:

POST test_index2/_search
{
  "query": {
    "multi_match": {
      "query": "java",
      "fields": ["title","content"],
      "type": "phrase"
    }
  }
}

POST test_index2/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match_phrase":{"title":"java"}},
        {"match_phrase":{"content":"java"}}
        ]
    }
  }
}

cross_fields

插入示例数据:

DELETE test_index4
PUT test_index4
POST test_index4/_bulk
{ "index": { "_id": "1"} }
{"first_name" : "Peter", "last_name" : "Smith"}
{ "index": { "_id": "2"} }
{"first_name" : "Smith", "last_name" : "Williams"}
{ "index": { "_id": "3"} }
{"first_name" : "Jack", "last_name" : "Ma"}
{ "index": { "_id": "4"} }
{"first_name" : "Robbin", "last_name" : "Li"}
{ "index": { "_id": "5"} }
{"first_name" : "Tonny", "last_name" : "Peter Smith"}

现在我们想一个全名里面含有perter smith的人.根据之前介绍的内容,你可能会写下面两种:

POST test_index4/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "fields": ["first_name", "last_name"],
      "type": "best_fields"
    }
  }
}

POST test_index4/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "fields": ["first_name", "last_name"],
      "type": "most_fields"
    }
  }
}

这两种的返回结果中都含有这条数据:

{
        "_index" : "test_index4",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.3862944,
        "_source" : {
          "first_name" : "Smith",
          "last_name" : "Williams"
        }
}

实际这条数据并不太符合我们的要求,因为里面没有perter.可能我们会修改上面的两个查询,增加operator提高他的精准度.

POST test_index4/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "fields": ["first_name", "last_name"],
      "type": "best_fields",
      "operator": "and"
    }
  }
}

POST test_index4/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "fields": ["first_name", "last_name"],
      "type": "most_fields",
      "operator": "and"
    }
  }
}

但是上面这两个查询还是会有一个问题,那就是这条数据无法被我们找到:

{
        "_index" : "test_index4",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.3862944,
        "_source" : {
          "first_name" : "Peter",
          "last_name" : "Smith"
        }
}

其实这条数据符合我们的要求,但是却无法被找到.如果想解决这个查询,我们可以在创建索引的时候使用copy_to,将first_namelast_name添加到一个新字段.例如下面这种:

PUT test_index5

POST test_index5/_mapping
{
  "properties":{
    "first_name":{
      "type":"text",
      "copy_to":"full_name"
    },
    "last_name":{
      "type":"text",
      "copy_to":"full_name"
    },
    "full_name":{
      "type":"text"
    }
  }
}

POST test_index5/_bulk
{ "index": { "_id": "1"} }
{"first_name" : "Peter", "last_name" : "Smith"}
{ "index": { "_id": "2"} }
{"first_name" : "Smith", "last_name" : "Williams"}
{ "index": { "_id": "3"} }
{"first_name" : "Jack", "last_name" : "Ma"}
{ "index": { "_id": "4"} }
{"first_name" : "Robbin", "last_name" : "Li"}
{ "index": { "_id": "5"} }
{"first_name" : "Tonny", "last_name" : "Peter Smith"}

POST test_index5/_search
{
  "query": {
    "match": {
      "full_name": {
        "query": "Peter Smith",
        "operator": "and"
      }
    }
  }
}

上面这种固然很好,但是并不是很方便.我们可以使用cross_fields实现同样的功能.

POST test_index4/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "fields": ["first_name","last_name"],
      "operator": "and",
      "type": "cross_fields"
    }
  }
}

cross_fields它会将多个field合并成一个field然后查询.这与建立一个新的索引保存多个field,然后再新字段上查询很相似.但是有点在于它不需要你修改mapping然后重建索引,而且他还能动态的设置字段的权重.例如下面这面这种方式:

POST test_index4/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "fields": ["first_name^3","last_name"],
      "operator": "and",
      "type": "cross_fields"
    }
  }
}

通过^数字这种方式,我们可以动态的修改权重,从而影响得分.总体而言,cross_fields这种方式更像是从多个字段中进行查询,但是却像在一个字段内查询这样.

bool_prefix

该类型的评分行为类似于most_fields,但使用match_bool_prefix查询.

上一篇 下一篇

猜你喜欢

热点阅读