Elasticsearch 深入搜索-全文搜索

2020-08-27 本文已影响0人觉释

基于词项和基于全文

GET /_search
{
    "query": {
        "constant_score": {
            "filter": {
                "term": { "gender": "female" }
            }
        }
    }
}

匹配查询

匹配查询 match 是个核心查询。无论需要查询什么字段， match 查询都应该会是首选的查询方式。它是一个高级全文查询，这表示它既能处理全文字段，又能处理精确字段。
这就是说， match 查询主要的应用场景就是进行全文搜索，我们以下面一个简单例子来说明全文搜索是如何工作的：
先索引一些数据

DELETE /my_index 

PUT /my_index
{ "settings": { "number_of_shards": 1 }} 

POST /my_index/_bulk
{ "index": { "_id": 1 }}
{ "title": "The quick brown fox" }
{ "index": { "_id": 2 }}
{ "title": "The quick brown fox jumps over the lazy dog" }
{ "index": { "_id": 3 }}
{ "title": "The quick brown fox jumps over the quick dog" }
{ "index": { "_id": 4 }}
{ "title": "Brown fox brown dog" }

单个词查询

GET /my_index/_search
{
    "query": {
        "match": {
            "title": "QUICK!"
        }
    }
}

多词查询

如果我们一次只能搜索一个词，那么全文搜索就会不太灵活，幸运的是 match 查询让多词查询变得简单：

GET /my_index/_search
{
    "query": {
        "match": {
            "title": "BROWN DOG!"
        }
    }
}

提高精度

GET /my_index/_search
{
    "query": {
        "match": {
            "title": {      
                "query":    "BROWN DOG!",
                "operator": "and"
            }
        }
    }
}

控制精度

GET /my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query":                "quick brown dog",
        "minimum_should_match": "75%"
      }
    }
  }
}

组合查询

GET /my_index/_search
{
  "query": {
    "bool": {
      "must":     { "match": { "title": "quick" }},
      "must_not": { "match": { "title": "lazy"  }},
      "should": [
                  { "match": { "title": "brown" }},
                  { "match": { "title": "dog"   }}
      ]
    }
  }
}

评分计算

bool 查询会为每个文档计算相关度评分 _score ，再将所有匹配的 must 和 should 语句的分数 _score 求和，最后除以 must 和 should 语句的总数。
must_not 语句不会影响评分；它的作用只是将不相关的文档排除。
控制精度

GET /my_index/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "brown" }},
        { "match": { "title": "fox"   }},
        { "match": { "title": "dog"   }}
      ],
      "minimum_should_match": 2 
    }
  }
}

如何使用布尔匹配

下面两个等价

{
    "match": { "title": "brown fox"}
} 
{
  "bool": {
    "should": [
      { "term": { "title": "brown" }},
      { "term": { "title": "fox"   }}
    ]
  }
}

如果使用 and 操作符，所有的 term 查询都被当作 must 语句，所以所有（all）语句都必须匹配。以下两个查询是等价的：

{
    "match": {
        "title": {
            "query":    "brown fox",
            "operator": "and"
        }
    }
}

{
  "bool": {
    "must": [
      { "term": { "title": "brown" }},
      { "term": { "title": "fox"   }}
    ]
  }
}

如果指定参数 minimum_should_match ，它可以通过 bool 查询直接传递，使以下两个查询等价：

{
    "match": {
        "title": {
            "query":                "quick brown fox",
            "minimum_should_match": "75%"
        }
    }
} 
{
  "bool": {
    "should": [
      { "term": { "title": "brown" }},
      { "term": { "title": "fox"   }},
      { "term": { "title": "quick" }}
    ],
    "minimum_should_match": 2 
  }
}

查询语句提升权重

一个简单的 bool 查询允许我们写出如下这种非常复杂的逻辑：

GET /_search
{
    "query": {
        "bool": {
            "must": {
                "match": {
                    "content": { 
                        "query":    "full text search",
                        "operator": "and"
                    }
                }
            },
            "should": [ 
                { "match": { "content": "Elasticsearch" }},
                { "match": { "content": "Lucene"        }}
            ]
        }
    }
}

我们可以通过指定 boost 来控制任何查询语句的相对的权重， boost 的默认值为 1 ，大于 1 会提升一个语句的相对权重。所以下面重写之前的查询：

GET /_search
{
    "query": {
        "bool": {
            "must": {
                "match": {  
                    "content": {
                        "query":    "full text search",
                        "operator": "and"
                    }
                }
            },
            "should": [
                { "match": {
                    "content": {
                        "query": "Elasticsearch",
                        "boost": 3 
                    }
                }},
                { "match": {
                    "content": {
                        "query": "Lucene",
                        "boost": 2 
                    }
                }}
            ]
        }
    }
}

这些语句使用默认的 boost 值 1 。
这条语句更为重要，因为它有最高的 boost 值。
这条语句比使用默认值的更重要，但它的重要性不及 Elasticsearch 语句。

控制分析

查询只能查找倒排索引表中真实存在的项，所以保证文档在索引时与查询字符串在搜索时应用相同的分析过程非常重要，这样查询的项才能够匹配倒排索引中的项。

尽管是在说文档，不过分析器可以由每个字段决定。每个字段都可以有不同的分析器，既可以通过配置为字段指定分析器，也可以使用更高层的类型（type）、索引（index）或节点（node）的默认配置。在索引时，一个字段值是根据配置或默认分析器分析的。

例如为 my_index 新增一个字段：

PUT /my_index/_mapping
{
    "my_type": {
        "properties": {
            "english_title": {
                "type":     "string",
                "analyzer": "english"
            }
        }
    }
}

现在我们就可以通过使用 analyze API 来分析单词 Foxes ，进而比较 english_title 字段和 title 字段在索引时的分析结果：

GET /my_index/_analyze
{
  "field": "my_type.title",   
  "text": "Foxes"
}

GET /my_index/_analyze
{
  "field": "my_type.english_title",   
  "text": "Foxes"
}

这意味着，如果使用底层 term 查询精确项 fox 时， english_title 字段会匹配但 title 字段不会。

如同 match 查询这样的高层查询知道字段映射的关系，能为每个被查询的字段应用正确的分析器。可以使用 validate-query API 查看这个行为：

GET /my_index/_validate/query?explain
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title":         "Foxes"}},
                { "match": { "english_title": "Foxes"}}
            ]
        }
    }
}