学习空间收藏大数据学习

Elasticsearch:索引,映射,文档操作

2022-02-27  本文已影响0人  xiaogp

摘要:Elasticsearch
《Elasticsearch搜索引擎构建入门与实战》第三章读书笔记

索引操作

索引操作主要有创建,删除,关闭,打开,别名等

(1)创建索引

请求类型为PUT,语法为

PUT /${index_name}
{
    "settings": {
        ...
    },
    "mappings": {
    ...
    }
}

其中settings中设置索引的配置项,比如主分片数和副分片数,mappings填写数据组织结构,例如如下语句创建了主分片3,副分片1,两个字段的索引

PUT /my_label
{
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
          "ent_name": {
            "type": "keyword"
             },
          "score": {
             "type": "double"
            }
       }
    }
}

查看kibana的Index Management,已经显示了primaries=3,replicas=1

(2)创建索引

删除索引,使用DELETE请求

DELETE /my_index
(3)关闭索引

关闭索引之后ES索引只负责数据存储,不能提供数据更新和搜索功能,知道索引再次打开,使用POST请求,_close路由

POST /my_label/_close
(4)打开索引

同理POST请求_open路由

POST /my_label/_open
(5)索引别名

可以给一个或者多个es索引定义一个另一个名称,相当于linux的用户名用户组名,这样就可以实现对多个索引进行查询(用户组),而不是一个一个查询索引(用户),关系如下

举例先创建3个索引,最后三个都别名为同一个索引

PUT /my_index_1
{
  "mappings": {
   "properties": {
      "title":{
         "type": "text"
},
"city":{
         "type": "keyword"
},
"price": {
         "type": "double"
    }
   }
  }
}
PUT /my_index_2
{
  "mappings": {
   "properties": {
      "title":{
         "type": "text"
},
"city":{
         "type": "keyword"
},
"price": {
         "type": "double"
    }
   }
  }
}
PUT /my_index_3
{
  "mappings": {
   "properties": {
      "title":{
         "type": "text"
},
"city":{
         "type": "keyword"
},
"price": {
         "type": "double"
    }
   }
  }
}

再插入三条数据

POST /my_index_1/_doc/001
{
  "title":"好再来餐厅",
  "city": "青岛",
  "price": 578.23
}
POST /my_index_3/_doc_/001
{
  "title":"好再来网吧",
  "city": "青岛",
  "price": 578.23
}
POST /my_index_2/_doc_/001
{
  "title":"好再来浴室",
  "city": "青岛",
  "price": 578.23
}

将my_index,my_index_2,my_index_3三个索引都别名为my_index_all

POST /_aliases
{
    "actions": [
        {
        "add": {
              "index": "my_index_1", 
              "alias": "my_index_all"
           }
    },
      {
        "add": {
              "index": "my_index_2", 
              "alias": "my_index_all"
           }
    },
    {
        "add": {
              "index": "my_index_3", 
              "alias": "my_index_all"
           }
    }
    ]
}

此时对别名之后的索引集合做搜索,所有id是001的文档

GET /my_index_all/_doc/001

报错和多个索引相关,无法定位

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "alias [my_index_all] has more than one index associated with it [my_index_2, my_index_3, my_index_1], can't execute a single index op"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "alias [my_index_all] has more than one index associated with it [my_index_2, my_index_3, my_index_1], can't execute a single index op"
  },
  "status" : 400
}

可以进行其他条件查询

POST /my_index_all/_search
{
    "query":{
            "match":{
                  "title": "好再"
          }                         
    }
}

返回三条结果,每一条文档给出了所在的索引(_index

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "my_index_1",
        "_id" : "001",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "好再来餐厅",
          "city" : "青岛",
          "price" : 578.23
        }
      },
      {
        "_index" : "my_index_2",
        "_id" : "001",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "好再来浴室",
          "city" : "青岛",
          "price" : 578.23
        }
      },
      {
        "_index" : "my_index_3",
        "_id" : "001",
        "_score" : 0.5753642,
        "_source" : {
          "title" : "好再来网吧",
          "city" : "青岛",
          "price" : 578.23
        }
      }
    ]
  }
}

如果要删除别名使用如下语法

POST /_aliases
{
    "actions":[
         { "remove":{"index": "my_index_1", "alias": "my_index_all"}},
         { "remove":{"index": "my_index_2", "alias": "my_index_all"}},
         { "remove":{"index": "my_index_3", "alias": "my_index_all"}}
   ]
}

映射操作

映射类似于传统数据库的表结构,ES可以自动推断数据类型,建议用户手动创建。

(1)创建映射

创建映射的基本语法如下,在创建索引的时候直接创建

PUT /${index_name}
{
    "mappings": {
        "properties": {
               "cols1": {"type": ""} 
            }
    }
}

也可以先创建索引,再创建mappings,使用POST请求传给_mapping路由

PUT /my_index_4
POST /my_index_4/_mapping
{
    "properties": {
            "title": {"type": "text"},
            "city": {"type": "keyword"},
            "price": {"type": "double"}
    }
}
(2)查看映射

查看映射直接使用GET和路由

GET /my_index_4/_mapping
(3)拓展映射

映射中已经定义的字段的属性或者类型是不能修改,这能增加字段,增加字段的DSL是一样的,使用POST请求_mapping路由

POST /my_index_4/_mapping
{
    "properties": {
          "degree": {"type": "keyword"}
    }
}
(3)基本的数据类型
1.keyword类型

keyword代表不进行切分的字符串类型,在构建索引时,ES直接对keyword的字符串做倒排索引,而不是对切分之后的子部分都做倒排索引。keyword一般用于字符串比较相等,用于过滤排序聚合的场景,在DSL中使用term查询
例如查询某个字段为某个值进行过滤

GET /my_index_4/_search
{
    "query":{
        "term": {
          "city": {"value": "扬州"}   
         }
    }
}

如果对keyword字段用match进行部分内容的全文检索是不会命中文档的,例如

GET /my_index_4/_search
{
    "query":{
        "match": {
          "city": "州"
         }
    }
}
2.text类型

text类型是对于字符串进行切割,切割的每一部分加入倒排索引中,搜索匹配的时候会进行打分

GET /my_label/_search
{
"query": {
  "match": {
"title": "好来药酒"
  }
}
}

返回结果按照score进行降序

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0577903,
    "hits" : [
      {
        "_index" : "my_label",
        "_id" : "003",
        "_score" : 1.0577903,
        "_source" : {
          "title" : "好再来药店",
          "city" : "青岛",
          "price" : 578.23
        }
      },
      {
        "_index" : "my_label",
        "_id" : "001",
        "_score" : 0.8630463,
        "_source" : {
          "title" : "好再来酒店",
          "city" : "青岛",
          "price" : 578.23
        }
      },
      {
        "_index" : "my_label",
        "_id" : "002",
        "_score" : 0.36464313,
        "_source" : {
          "title" : "好再来饭店",
          "city" : "青岛",
          "price" : 578.23
        }
      }
    ]
  }
}

如果对text字段使用term搜索会搜索不到,因为text已经被切割了

POST /my_label/_search
{
    "query": {
            "match": {
                   "title": {"value": "好再来饭店" }  
           }
    }
}

返回空文档,分数为null

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

如果text类型在mapping里手指了参数:index:false,则该字段无法被索引到,只能用来展示,无法用来匹配搜索

PUT /hotel/_doc/_mapping
{
  "properties": {
    "no_index_col": {"type": "text", "index": false}
  }
}

搜索该字段会报错没有被索引,同样给keyword字段设置该属性也无法检索

        "reason": "Cannot search on field [no_index_col] since it is not indexed
3.数值类型

ES支持多种数值类型(long,integer,short,byte,double,float等),应该在满足业务需求的情况下尽量算则范围小的数值类型。举例

PUT /my_index_5
{
    "mappings":{
            "properties": {
                     "name": {"type": "keyword"},
                     "age": {"type": "integer"},
                     "score": {"type": "double"},
                     "no": {"type": "long"}
              }
    }
}

插入几条数据

POST /my_index_5/_doc/001
{
    "name": "xiaogp", 
    "age":13, 
    "score": 98.5, 
    "no": 123456789
}
POST /my_index_5/_doc/002
{
    "name": "wangfan", 
    "age":92, 
    "score": 33.5, 
    "no": 123456786
}
POST /my_index_5/_doc/003
{
    "name": "xuguangfeng", 
    "age":33, 
    "score": 71.5, 
    "no": 123456788
}

数值类型主要用于term搜索范围搜索range,例如查找score在60-100之间的文档

POST /my_index_5/_search
{
    "query": {
        "range": {
            "score": {
                "gt": 60,
                "lt": 100
            } 
        }
    }
}

结果返回两条文档

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index_5",
        "_id" : "001",
        "_score" : 1.0,
        "_source" : {
          "name" : "xiaogp",
          "age" : 13,
          "score" : 98.5,
          "no" : 123456789
        }
      },
      {
        "_index" : "my_index_5",
        "_id" : "003",
        "_score" : 1.0,
        "_source" : {
          "name" : "xuguangfeng",
          "age" : 33,
          "score" : 71.5,
          "no" : 123456788
        }
      }
    ]
  }
}
4.布尔类型

布尔类型在mapping中使用boolean定义,搜索时使用term精确匹配,匹配值可以直接是true,false,也可以是字符串格式的true,false

# 给my_index_5新增一个字段
POST /my_index_5/_mapping
{
    "properties": {
              "is_good": {"type": "boolean"}
    }
}

给my_index_5中001文档增加新字段的数据

POST /my_index_5/_doc/001
{
    "name": "xiaogp", 
    "age":13, 
    "score": 98.5, 
    "no": 123456789,
    "is_good": "true"
}

搜索boolean字段

GET /my_index_5/_search
{
    "query": {
        "term": {
            "is_good": {"value": "true"}  # 可以不带双引号
          }
    }
}
5.日期类型

在ES中时间日期类型是date,默认支持的类型中不包含yyyy-MM-dd HH:mm:ss,需要在设置映射时增加format属性

PUT /my_label
{
  "mappings": {
    "properties": {
      "ent_name": {"type": "keyword"},
      "update_date": {"type": "date"},
      "score": {"type": "double"}
    }
  }
}

插入yyyy-MM-dd数据成功

POST /my_label/_doc/001
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "update_date": "2021-01-01"
}

插入yyyyMMdd数据成功

POST /my_label/_doc/002
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "update_date": "20210109"
}

看一下插入的数据,虽然这两种格式不一样,但是都是ES的date默认支持的格式,因此都成功写入了,且展示的格式不一样

{
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "001",
        "_score" : 1.0,
        "_source" : {
          "ent_name" : "xiaogp",
          "score" : 23.3,
          "update_date" : "2021-01-01"
        }
      },
      {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "002",
        "_score" : 1.0,
        "_source" : {
          "ent_name" : "xiaogp",
          "score" : 23.3,
          "update_date" : "20210109"
        }
      }

再插入yyyy-MM-dd HH:mm:ss数据报错

POST /my_label/_doc/003
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "update_date": "2021-01-09 11:11:11"
}

报错信息如下显示日期类型解析错误

{
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [update_date] of type [date] in document with id '003'"
      }

如果要插入和显示yyyy-MM-dd HH:mm:ss数据,需要求改mapping

PUT /my_label


POST /my_label/_doc/_mapping
{
  "properties": {
      "ent_name": {"type": "keyword"},
      "update_date": 
          {"type": "date", 
            "format": "yyyy-MM-dd HH:mm:ss"
          },
      "score": {"type": "double"}
    }
}

再插一次显示成功

# GET /my_label/_doc/003
{
  "_index" : "my_label",
  "_type" : "_doc",
  "_id" : "003",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "ent_name" : "xiaogp",
    "score" : 23.3,
    "update_date" : "2021-01-09 11:11:11"
  }
}

date类型的常用查询是range查询,例如查询时间范围的文档

GET /my_label/_search
{
  "query": {
    "range": {
      "update_date": {
        "gte": "2021-01-01",
        "lte": "2022-01-01"
      }
    }
  }
}

返回如下

"hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "003",
        "_score" : 1.0,
        "_source" : {
          "ent_name" : "xiaogp",
          "score" : 33.3,
          "update_date" : "2022-01-09 11:11:11"
        }
      }
    ]
  }
6.数组类型

数据类型是不需要定义的,只需要定义数组元素的类型,比如定义为keyword,写入数据的时候使用类似于JSONArray的格式即可

# 重新创建一个索引,tag是数组字段,内部元素都是keyword
PUT /my_label


POST /my_label/_doc/_mapping
{
  "properties": {
      "ent_name": {"type": "keyword"},
      "tag": {"type": "keyword"},
      "score": {"type": "double"}
    }
}

插入一条数据,DSL中tag字段使用JSONArray格式

POST /my_label/_doc/001
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "tag": ["好人", "有钱", "有才"]
}

GET /my_label/_doc/001

数据返回如下

{
  "_index" : "my_label",
  "_type" : "_doc",
  "_id" : "001",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "ent_name" : "xiaogp",
    "score" : 23.3,
    "tag" : [
      "好人",
      "有钱",
      "有才"
    ]
  }
}

如果插入的数据是JSONArray,保存的时候想采用String的格式,则需要转义,使用三引号插入

POST /my_label/_doc/003
{
  "ent_name": "xiaogp",
  "score": 23.3,
  "tag": """["好人", "有钱", "男人"]"""
}

在搜索的时候kibana也会显示出三引号

# GET /my_label/_doc/003
{
  "_index" : "my_label",
  "_type" : "_doc",
  "_id" : "003",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "ent_name" : "xiaogp",
    "score" : 23.3,
    "tag" : """["好人", "有钱", "男人"]"""
  }
}

用Python客户端验证一下使用三引号和不使用直接插入Array在读取数据时是否能够区分

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch(hosts="192.168.61.240", port=8200,  timeout=200)
>>> es.get(index="my_label", doc_type="_doc", id='001')['_source']['tag']
['好人', '有钱', '有才']
>>> es.get(index="my_label", doc_type="_doc", id='003')['_source']['tag']
'["好人", "有钱", "男人"]'

可以看到取数的时候三引号没有了,保留了插入时候的字符串格式,如果按照数组插入,取数的时候也返回Python数组
数组查询的时候实际上是对数据内部元素做与或非查询,最简单的查询是搜索数组字段中包含某个keyword的文档

GET /my_label/_search
{
  "query": {
    "term": {
      "tag": {
        "value": "有才"
      }
    }
  }
}

使用term查询,此时只要tag中包含‘有才;的文档都会被返回,如果数组中有多个值需要搜索,使用terms

GET /my_label/_search
{
  "query": {
    "terms": {
      "tag": [
        "好人",
        "有才"
      ]
    }
  }
}

terms传入的对象是一个数组,只要tag中有数组中的任意一个,文档就会被返回,相当于元素的并集or查询,结合bool+must语句可以实现与查询,取数据元素的交集

GET /my_label/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {
          "tag": {
            "value": "好人"
          }
        }},
        {"term": {
          "tag": {
            "value": "有才"
          }
        }}
      ]
    }
  }
}

文档操作

(1)写入单条文档

写入文档的请求类型是POST,请求语法如下

POST /${index_name}/_doc/${_id}
{
   ...
}

这种方式是用户直接定义_id值,不使用es生成的id,请求体是JSON格式,用户也可以不指定_id直接POST+请求体,此时ES将会自动生成id

POST /${index_name}/_doc
{
   ...
}

例如

POST /my_label/_doc
{
  "title": "123",
  "city": "234", 
  "price": 23.3 
}

GET /my_label/_search

返回结果的_id是ES自动随机生成的

{
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "YIygTn8Bxh2kjPU0z9Pg",
        "_score" : 1.0,
        "_source" : {
          "title" : "123",
          "city" : "234",
          "price" : 23.3
        }
      }
(2)批量写入文档

批量写入多条文档同样是POST请求,例子如下

POST /_bulk
{"index": {"_index": "my_label", "_type": "_doc", "_id": "009"}}
{"title": "123","city": "234", "price": 93.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "010"}}
{"title": "777","city": "567", "price": 123.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "011"}}
{"title": "666","city": "ftyg", "price": 31.3 }

以上一共插入了3条数据,每条数据是上下两行,第一行代表要插入的索引,_type以及_id,在新版中中_type可以不指定默认是_doc,_id不指定随机生成,返回如下插入成功

{
  "took" : 103,
  "errors" : false,
  "items" : [
    {
      "index" : {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "009",
        "_version" : 2,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 1,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "index" : {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "010",
        "_version" : 2,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 1,
        "_primary_term" : 1,
        "status" : 200
      }
    },
    {
      "index" : {
        "_index" : "my_label",
        "_type" : "_doc",
        "_id" : "011",
        "_version" : 2,
        "result" : "updated",
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 2,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

如果数据量很大,行数很多,推荐使用linux的curl进行批量插入,例如将以上3条数据共6行写入一个文件

# vim bulk_data.json
{"index": {"_index": "my_label", "_type": "_doc", "_id": "012"}}
{"title": "1233333","city": "234", "price": 93.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "013"}}
{"title": "777777","city": "567", "price": 123.3 }
{"index": {"_index": "my_label", "_type": "_doc", "_id": "014"}}
{"title": "66666","city": "ftyg", "price": 31.3 }
curl -H "Content-Type: application/json" -X POST '192.168.61.240:8200/_bulk?pretty' --data-binary "@bulk_data.json"

最终达到的效果是一样的,解释一下curl和请求路由中的相关参数
*-H:自定义头信息传递给服务器,引号字符串
*-X:指定 HTTP 请求的方法,curl默认是GET请求
*--data-binary:HTTP POST请求中的数据为纯二进制数据,value如果是@file_name,则保留文件中的回车符和换行符,不做任何转换
*pretty:让es美化输出,美化为JSON格式

linux控制台会有输出,如果不想看到输出可以写入文件>或者使用curl的-o

(3)更新单条文档

更新文档也是POST请求,在请求路由最后增加_update即可,例如

POST /my_label/_doc/010/_update
{
  "doc": {
      "title" : "888"
  }
}

该语句只会修改_id为010的文档的title字段,其他不做修改,如果不加_update就是直接覆盖原有的010文档,覆盖之后只有title字段其他全部删除,如果对一个不存在的_id做更新会直接报错document_missing_exception,因此只能对现有文档做更新,如果要实现有则更新无则插入的操作需要使用upsert

POST /my_label/_doc/099/_update
{
  "doc": {
      "title" : "888",
      "city": "成都",
      "price": 12.3
  }, 
  "upsert": {
    "title" : "888",
      "city": "成都",
      "price": 12.3
  }
}

相当于如果文档不存在执行doc的更新内容,如果已经存在,执行upsert的插入内容

(4)批量更新文档

批量更新文档的bulk语句和批量插入类似,例子如下

POST /_bulk
{"update": {"_index": "my_label", "_type": "_doc", "_id": "010"}}
{"doc": {"title": "999", "city": "郑州"}}
{"update": {"_index": "my_label", "_type": "_doc", "_id": "0123"}}
{"doc": {"title": "999", "city": "郑州"}, "upsert": {"title": "999", "city": "郑州"}}

更新两条数据,其中第二条没有就是用upsert操作

(5)根据条件更新文档

类似于关系型数据库的update set where,es使用_update_by_query实现,语法如下

POST /${index_name}/_update_by_query
{
    "query": {   // 条件查询
     },
     "script":{   // 更新脚本
     }
}

直接看一个例子

POST /my_label/_update_by_query
{
  "query": {
    "term": {
      "city": {
        "value": "郑州"
      }
    }
  },
  "script": {
    "source": "ctx._source['city']='苏州'",
    "lang": "painless"
  }
}

先找到city字段等于郑州的,全部更新为苏州,script的语法使用painless,是es的默认脚本。如果在请求体中不加入query,则会更新全部文档

(6)删除单条文档

使用DELETE请求,请求体指定文档_id即可

DELETE /my_label/_doc/010
(7)批量删除文档

批量删除数据也需要POST请求和_bulk路由,例子如下

POST /_bulk
{"delete": {"_index": "my_label", "_type": "_doc", "_id": "009"}}
{"delete": {"_index": "my_label", "_type": "_doc", "_id": "012"}}
(8)根据条件删除文档

类似结构型数据库的delete from where,在es中使用_delete_by_query路由,和update_by_query不同的是,_delete_by_query只需要指定query,不需要script,因为执行的操作就是删除是单一的确定的,例子如下

POST /my_label/_doc/_delete_by_query
{
  "query": {
    "term": {
      "city": {
        "value": "苏州"
      }
    }
  }
}
上一篇 下一篇

猜你喜欢

热点阅读