Elasticsearch：索引，映射，文档操作

2022-02-27 本文已影响0人 xiaogp

摘要：Elasticsearch
《Elasticsearch搜索引擎构建入门与实战》第三章读书笔记

索引操作

索引操作主要有创建，删除，关闭，打开，别名等

（1）创建索引

请求类型为PUT，语法为

PUT /${index_name}
{
    "settings": {
        ...
    },
    "mappings": {
    ...
    }
}

其中settings中设置索引的配置项，比如主分片数和副分片数，mappings填写数据组织结构，例如如下语句创建了主分片3，副分片1，两个字段的索引

PUT /my_label
{
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
          "ent_name": {
            "type": "keyword"
             },
          "score": {
             "type": "double"
            }
       }
    }
}

查看kibana的Index Management，已经显示了primaries=3，replicas=1

（2）创建索引

删除索引，使用DELETE请求

DELETE /my_index

（3）关闭索引

关闭索引之后ES索引只负责数据存储，不能提供数据更新和搜索功能，知道索引再次打开，使用POST请求，_close路由

POST /my_label/_close

（4）打开索引

同理POST请求_open路由

POST /my_label/_open

（5）索引别名

可以给一个或者多个es索引定义一个另一个名称，相当于linux的用户名和用户组名，这样就可以实现对多个索引进行查询（用户组），而不是一个一个查询索引（用户），关系如下

举例先创建3个索引，最后三个都别名为同一个索引

PUT /my_index_1
{
  "mappings": {
   "properties": {
      "title":{
         "type": "text"
},
"city":{
         "type": "keyword"
},
"price": {
         "type": "double"
    }
   }
  }
}

PUT /my_index_2
{
  "mappings": {
   "properties": {
      "title":{
         "type": "text"
},
"city":{
         "type": "keyword"
},
"price": {
         "type": "double"
    }
   }
  }
}

PUT /my_index_3
{
  "mappings": {
   "properties": {
      "title":{
         "type": "text"
},
"city":{
         "type": "keyword"
},
"price": {
         "type": "double"
    }
   }
  }
}

再插入三条数据

POST /my_index_1/_doc/001
{
  "title":"好再来餐厅",
  "city": "青岛",
  "price": 578.23
}
POST /my_index_3/_doc_/001
{
  "title":"好再来网吧",
  "city": "青岛",
  "price": 578.23
}
POST /my_index_2/_doc_/001
{
  "title":"好再来浴室",
  "city": "青岛",
  "price": 578.23
}

将my_index，my_index_2，my_index_3三个索引都别名为my_index_all

POST /_aliases
{
    "actions": [
        {
        "add": {
              "index": "my_index_1", 
              "alias": "my_index_all"
           }
    },
      {
        "add": {
              "index": "my_index_2", 
              "alias": "my_index_all"
           }
    },
    {
        "add": {
              "index": "my_index_3", 
              "alias": "my_index_all"
           }
    }
    ]
}

此时对别名之后的索引集合做搜索，所有id是001的文档

GET /my_index_all/_doc/001

报错和多个索引相关，无法定位

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "alias [my_index_all] has more than one index associated with it [my_index_2, my_index_3, my_index_1], can't execute a single index op"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "alias [my_index_all] has more than one index associated with it [my_index_2, my_index_3, my_index_1], can't execute a single index op"
  },
  "status" : 400
}

可以进行其他条件查询

POST /my_index_all/_search
{
    "query":{
            "match":{
                  "title": "好再"
          }                         
    }
}

返回三条结果，每一条文档给出了所在的索引（_index）

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "my_index_1",
        "_id" : "001",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "好再来餐厅",
          "city" : "青岛",
          "price" : 578.23
        }
      },
      {
        "_index" : "my_index_2",
        "_id" : "001",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "好再来浴室",
          "city" : "青岛",
          "price" : 578.23
        }
      },
      {
        "_index" : "my_index_3",
        "_id" : "001",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "好再来网吧",
          "city" : "青岛",
          "price" : 578.23
        }
      }
    ]
  }
}

如果要删除别名使用如下语法

POST /_aliases
{
    "actions":[
         { "remove":{"index": "my_index_1", "alias": "my_index_all"}},
         { "remove":{"index": "my_index_2", "alias": "my_index_all"}},
         { "remove":{"index": "my_index_3", "alias": "my_index_all"}}
   ]
}

映射操作

映射类似于传统数据库的表结构，ES可以自动推断数据类型，建议用户手动创建。

（1）创建映射

创建映射的基本语法如下，在创建索引的时候直接创建

PUT /${index_name}
{
    "mappings": {
        "properties": {
               "cols1": {"type": ""} 
            }
    }
}

也可以先创建索引，再创建mappings，使用POST请求传给_mapping路由

PUT /my_index_4

POST /my_index_4/_mapping
{
    "properties": {
            "title": {"type": "text"},
            "city": {"type": "keyword"},
            "price": {"type": "double"}
    }
}

（2）查看映射

查看映射直接使用GET和路由

GET /my_index_4/_mapping

（3）拓展映射

映射中已经定义的字段的属性或者类型是不能修改，这能增加字段，增加字段的DSL是一样的，使用POST请求_mapping路由

POST /my_index_4/_mapping
{
    "properties": {
          "degree": {"type": "keyword"}
    }
}

（3）基本的数据类型

1.keyword类型

keyword代表不进行切分的字符串类型，在构建索引时，ES直接对keyword的字符串做倒排索引，而不是对切分之后的子部分都做倒排索引。keyword一般用于字符串比较相等，用于过滤，排序，聚合的场景，在DSL中使用term查询
例如查询某个字段为某个值进行过滤

GET /my_index_4/_search
{
    "query":{
        "term": {
          "city": {"value": "扬州"}   
         }
    }
}

如果对keyword字段用match进行部分内容的全文检索是不会命中文档的，例如

GET /my_index_4/_search
{
    "query":{
        "match": {
          "city": "州"
         }
    }
}

2.text类型

text类型是对于字符串进行切割，切割的每一部分加入倒排索引中，搜索匹配的时候会进行打分

GET /my_label/_search
{
"query": {
  "match": {
"title": "好来药酒"
  }
}
}

返回结果按照score进行降序

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0577903,
    "hits" : [
      {
        "_index" : "my_label",
        "_id" : "003",
        "_score" : 1.0577903,
        "_source" : {
          "title" : "好再来药店",
          "city" : "青岛",
          "price" : 578.23
        }
      },
      {
        "_index" : "my_label",
        "_id" : "001",
        "_score" : 0.8630463,
        "_source" : {
          "title" : "好再来酒店",
          "city" : "青岛",
          "price" : 578.23
        }
      },
      {
        "_index" : "my_label",
        "_id" : "002",
        "_score" : 0.36464313,
        "_source" : {
          "title" : "好再来饭店",
          "city" : "青岛",
          "price" : 578.23
        }
      }
    ]
  }
}

如果对text字段使用term搜索会搜索不到，因为text已经被切割了

POST /my_label/_search
{
    "query": {
            "match": {
                   "title": {"value": "好再来饭店" }  
           }
    }
}

返回空文档，分数为null

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

如果text类型在mapping里手指了参数：index：false，则该字段无法被索引到，只能用来展示，无法用来匹配搜索

PUT /hotel/_doc/_mapping
{
  "properties": {
    "no_index_col": {"type": "text", "index": false}
  }
}

搜索该字段会报错没有被索引，同样给keyword字段设置该属性也无法检索

        "reason": "Cannot search on field [no_index_col] since it is not indexed

3.数值类型

ES支持多种数值类型（long，integer，short，byte，double，float等），应该在满足业务需求的情况下尽量算则范围小的数值类型。举例

PUT /my_index_5
{
    "mappings":{
            "properties": {
                     "name": {"type": "keyword"},
                     "age": {"type": "integer"},
                     "score": {"type": "double"},
                     "no": {"type": "long"}
              }
    }
}

插入几条数据

POST /my_index_5/_doc/001
{
    "name": "xiaogp", 
    "age":13, 
    "score": 98.5, 
    "no": 123456789
}
POST /my_index_5/_doc/002
{
    "name": "wangfan", 
    "age":92, 
    "score": 33.5, 
    "no": 123456786
}
POST /my_index_5/_doc/003
{
    "name": "xuguangfeng", 
    "age":33, 
    "score": 71.5, 
    "no": 123456788
}

数值类型主要用于term搜索和范围搜索range，例如查找score在60-100之间的文档

POST /my_index_5/_search
{
    "query": {
        "range": {
            "score": {
                "gt": 60,
                "lt": 100
            } 
        }
    }
}

结果返回两条文档

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index_5",
        "_id" : "001",
        "_score" : 1.0,
        "_source" : {
          "name" : "xiaogp",
          "age" : 13,
          "score" : 98.5,
          "no" : 123456789
        }
      },
      {
        "_index" : "my_index_5",
        "_id" : "003",
        "_score" : 1.0,
        "_source" : {
          "name" : "xuguangfeng",
          "age" : 33,
          "score" : 71.5,
          "no" : 123456788
        }
      }
    ]
  }
}

4.布尔类型

布尔类型在mapping中使用boolean定义，搜索时使用term精确匹配，匹配值可以直接是true，false，也可以是字符串格式的true，false

# 给my_index_5新增一个字段
POST /my_index_5/_mapping
{
    "properties": {
              "is_good": {"type": "boolean"}
    }
}

给my_index_5中001文档增加新字段的数据

POST /my_index_5/_doc/001
{
    "name": "xiaogp", 
    "age":13, 
    "score": 98.5, 
    "no": 123456789,
    "is_good": "true"
}

搜索boolean字段

GET /my_index_5/_search
{
    "query": {
        "term": {
            "is_good": {"value": "true"}  # 可以不带双引号
          }
    }
}

5.日期类型

在ES中时间日期类型是date，默认支持的类型中不包含yyyy-MM-dd HH:mm:ss，需要在设置映射时增加format属性

PUT /my_label
{
  "mappings": {
    "properties": {
      "ent_name": {"type": "keyword"},
      "update_date": {"type": "date"},
      "score": {"type": "double"}
    }
  }
}

插入yyyy-MM-dd数据成功

POST /my_label/_doc/001
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "update_date": "2021-01-01"
}

插入yyyyMMdd数据成功

POST /my_label/_doc/002
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "update_date": "20210109"
}

看一下插入的数据，虽然这两种格式不一样，但是都是ES的date默认支持的格式，因此都成功写入了，且展示的格式不一样

{
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "001",
        "_score" : 1.0,
        "_source" : {
          "ent_name" : "xiaogp",
          "score" : 23.3,
          "update_date" : "2021-01-01"
        }
      },
      {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "002",
        "_score" : 1.0,
        "_source" : {
          "ent_name" : "xiaogp",
          "score" : 23.3,
          "update_date" : "20210109"
        }
      }

再插入yyyy-MM-dd HH:mm:ss数据报错

POST /my_label/_doc/003
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "update_date": "2021-01-09 11:11:11"
}

报错信息如下显示日期类型解析错误

{
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [update_date] of type [date] in document with id '003'"
      }

如果要插入和显示yyyy-MM-dd HH:mm:ss数据，需要求改mapping

PUT /my_label


POST /my_label/_doc/_mapping
{
  "properties": {
      "ent_name": {"type": "keyword"},
      "update_date": 
          {"type": "date", 
            "format": "yyyy-MM-dd HH:mm:ss"
          },
      "score": {"type": "double"}
    }
}

再插一次显示成功

# GET /my_label/_doc/003
{
  "_index" : "my_label",
  "_type" : "_doc",
  "_id" : "003",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "ent_name" : "xiaogp",
    "score" : 23.3,
    "update_date" : "2021-01-09 11:11:11"
  }
}

date类型的常用查询是range查询，例如查询时间范围的文档

GET /my_label/_search
{
  "query": {
    "range": {
      "update_date": {
        "gte": "2021-01-01",
        "lte": "2022-01-01"
      }
    }
  }
}

返回如下

"hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "003",
        "_score" : 1.0,
        "_source" : {
          "ent_name" : "xiaogp",
          "score" : 33.3,
          "update_date" : "2022-01-09 11:11:11"
        }
      }
    ]
  }

6.数组类型

数据类型是不需要定义的，只需要定义数组元素的类型，比如定义为keyword，写入数据的时候使用类似于JSONArray的格式即可

# 重新创建一个索引，tag是数组字段，内部元素都是keyword
PUT /my_label


POST /my_label/_doc/_mapping
{
  "properties": {
      "ent_name": {"type": "keyword"},
      "tag": {"type": "keyword"},
      "score": {"type": "double"}
    }
}

插入一条数据，DSL中tag字段使用JSONArray格式

POST /my_label/_doc/001
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "tag": ["好人", "有钱", "有才"]
}

GET /my_label/_doc/001

数据返回如下

{
  "_index" : "my_label",
  "_type" : "_doc",
  "_id" : "001",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "ent_name" : "xiaogp",
    "score" : 23.3,
    "tag" : [
      "好人",
      "有钱",
      "有才"
    ]
  }
}

如果插入的数据是JSONArray，保存的时候想采用String的格式，则需要转义，使用三引号插入

POST /my_label/_doc/003
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "tag": """["好人", "有钱", "男人"]"""
}

在搜索的时候kibana也会显示出三引号

# GET /my_label/_doc/003
{
  "_index" : "my_label",
  "_type" : "_doc",
  "_id" : "003",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "ent_name" : "xiaogp",
    "score" : 23.3,
    "tag" : """["好人", "有钱", "男人"]"""
  }
}

用Python客户端验证一下使用三引号和不使用直接插入Array在读取数据时是否能够区分

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch(hosts="192.168.61.240", port=8200,  timeout=200)
>>> es.get(index="my_label", doc_type="_doc", id='001')['_source']['tag']
['好人', '有钱', '有才']
>>> es.get(index="my_label", doc_type="_doc", id='003')['_source']['tag']
'["好人", "有钱", "男人"]'

可以看到取数的时候三引号没有了，保留了插入时候的字符串格式，如果按照数组插入，取数的时候也返回Python数组
数组查询的时候实际上是对数据内部元素做与或非查询，最简单的查询是搜索数组字段中包含某个keyword的文档

GET /my_label/_search
{
  "query": {
    "term": {
      "tag": {
        "value": "有才"
      }
    }
  }
}

使用term查询，此时只要tag中包含‘有才；的文档都会被返回，如果数组中有多个值需要搜索，使用terms

GET /my_label/_search
{
  "query": {
    "terms": {
      "tag": [
        "好人",
        "有才"
      ]
    }
  }
}

terms传入的对象是一个数组，只要tag中有数组中的任意一个，文档就会被返回，相当于元素的并集or查询，结合bool+must语句可以实现与查询，取数据元素的交集

GET /my_label/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {
          "tag": {
            "value": "好人"
          }
        }},
        {"term": {
          "tag": {
            "value": "有才"
          }
        }}
      ]
    }
  }
}

文档操作

（1）写入单条文档

写入文档的请求类型是POST，请求语法如下

POST /${index_name}/_doc/${_id}
{
   ...
}

这种方式是用户直接定义_id值，不使用es生成的id，请求体是JSON格式，用户也可以不指定_id直接POST+请求体，此时ES将会自动生成id

POST /${index_name}/_doc
{
   ...
}

例如

POST /my_label/_doc
{
  "title": "123",
  "city": "234", 
  "price": 23.3 
}

GET /my_label/_search

返回结果的_id是ES自动随机生成的

{
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "YIygTn8Bxh2kjPU0z9Pg",
        "_score" : 1.0,
        "_source" : {
          "title" : "123",
          "city" : "234",
          "price" : 23.3
        }
      }

（2）批量写入文档

批量写入多条文档同样是POST请求，例子如下

POST /_bulk
{"index": {"_index": "my_label", "_type": "_doc", "_id": "009"}}
{"title": "123","city": "234", "price": 93.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "010"}}
{"title": "777","city": "567", "price": 123.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "011"}}
{"title": "666","city": "ftyg", "price": 31.3 }

以上一共插入了3条数据，每条数据是上下两行，第一行代表要插入的索引，_type以及_id，在新版中中_type可以不指定默认是_doc，_id不指定随机生成，返回如下插入成功

{
  "took" : 103,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "009",
        "_version" : 2,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 1,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "index" : {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "010",
        "_version" : 2,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 1,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "index" : {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "011",
        "_version" : 2,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 2,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

如果数据量很大，行数很多，推荐使用linux的curl进行批量插入，例如将以上3条数据共6行写入一个文件

# vim bulk_data.json
{"index": {"_index": "my_label", "_type": "_doc", "_id": "012"}}
{"title": "1233333","city": "234", "price": 93.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "013"}}
{"title": "777777","city": "567", "price": 123.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "014"}}
{"title": "66666","city": "ftyg", "price": 31.3 }

curl -H "Content-Type: application/json" -X POST '192.168.61.240:8200/_bulk?pretty' --data-binary "@bulk_data.json"

最终达到的效果是一样的，解释一下curl和请求路由中的相关参数
*-H：自定义头信息传递给服务器，引号字符串
*-X：指定 HTTP 请求的方法，curl默认是GET请求
*--data-binary：HTTP POST请求中的数据为纯二进制数据，value如果是@file_name，则保留文件中的回车符和换行符，不做任何转换
*pretty：让es美化输出，美化为JSON格式

linux控制台会有输出，如果不想看到输出可以写入文件>或者使用curl的-o

（3）更新单条文档

更新文档也是POST请求，在请求路由最后增加_update即可，例如

POST /my_label/_doc/010/_update
{
  "doc": {
      "title" : "888"
  }
}

该语句只会修改_id为010的文档的title字段，其他不做修改，如果不加_update就是直接覆盖原有的010文档，覆盖之后只有title字段其他全部删除，如果对一个不存在的_id做更新会直接报错document_missing_exception，因此只能对现有文档做更新，如果要实现有则更新无则插入的操作需要使用upsert

POST /my_label/_doc/099/_update
{
  "doc": {
      "title" : "888",
      "city": "成都",
      "price": 12.3
  }, 
  "upsert": {
    "title" : "888",
      "city": "成都",
      "price": 12.3
  }
}

相当于如果文档不存在执行doc的更新内容，如果已经存在，执行upsert的插入内容

（4）批量更新文档

批量更新文档的bulk语句和批量插入类似，例子如下

POST /_bulk
{"update": {"_index": "my_label", "_type": "_doc", "_id": "010"}}
{"doc": {"title": "999", "city": "郑州"}}
{"update": {"_index": "my_label", "_type": "_doc", "_id": "0123"}}
{"doc": {"title": "999", "city": "郑州"}, "upsert": {"title": "999", "city": "郑州"}}

更新两条数据，其中第二条没有就是用upsert操作

（5）根据条件更新文档

类似于关系型数据库的update set where，es使用_update_by_query实现，语法如下

POST /${index_name}/_update_by_query
{
    "query": {   // 条件查询
     },
     "script":{   // 更新脚本
     }
}

直接看一个例子

POST /my_label/_update_by_query
{
  "query": {
    "term": {
      "city": {
        "value": "郑州"
      }
    }
  },
  "script": {
    "source": "ctx._source['city']='苏州'",
    "lang": "painless"
  }
}

先找到city字段等于郑州的，全部更新为苏州，script的语法使用painless，是es的默认脚本。如果在请求体中不加入query，则会更新全部文档

（6）删除单条文档

使用DELETE请求，请求体指定文档_id即可

DELETE /my_label/_doc/010

（7）批量删除文档

批量删除数据也需要POST请求和_bulk路由，例子如下

POST /_bulk
{"delete": {"_index": "my_label", "_type": "_doc", "_id": "009"}}
{"delete": {"_index": "my_label", "_type": "_doc", "_id": "012"}}

（8）根据条件删除文档

类似结构型数据库的delete from where，在es中使用_delete_by_query路由，和update_by_query不同的是，_delete_by_query只需要指定query，不需要script，因为执行的操作就是删除是单一的确定的，例子如下

POST /my_label/_doc/_delete_by_query
{
  "query": {
    "term": {
      "city": {
        "value": "苏州"
      }
    }
  }
}