elasticsearch 数据建模(一)
文章转自:https://www.jianshu.com/p/098236cf3a44
https://blog.csdn.net/napoay/article/details/62233031
例1:电商推广数据结构
{
"id": 536600477,
"name": "黑色外穿打底裤女春秋薄款铅笔裤2019新款高腰九分显瘦紧身小脚裤",
"image": "http://img.alicdn.com/bao/uploaded/i4/1687728515/O1CN015vKRk22Clv2z9jVKM_!!0-item_pic.jpg",
"item_url": "http://item.taobao.com/item.htm?id=536600477798",
"shop_name": "XXX旗舰店",
"price": 35.00,
"sales": 12866,
"contact_info": "XXX旗舰店",
"short_url": "https://s.click.taobao.com/6dhjX0w",
"sales_url": "https://s.click.taobao.com/t?e=m%3D2%26s%3DhqNnFErxaS0cQipKwQzePOeEDrYVVa64K7Vc7tFgwiG3bLqV5UHdqSJ215tW5ra7%2Fl0%2B1yuzCtL9CVjm9%2FaTIMEcIrQjme5phH%2FwEhdaGdpwfW9VvJkbiUOLibAxXu8J4DrzI0Q%2Bh5mWydDa%2BK5%2FZ44CXhN9RDLu87eUjW4Ylwlp3E7b2H5imSCyCj9paIOIxiXvDf8DaRs%3D",
"sales_pass": "¥q6vvYNlY15Y¥",
"coupon_total_num": 50000,
"coupon_remaining_num": 49981,
"coupon_quota": "满35减10",
"coupon_start_date": "2019-09-20",
"coupon_end_date": "2019-09-25",
"coupon_url": "https://uland.taobao.com/coupon/edetail?e=EpEKjA4ejsRt3vqbdXnGlgxMgopp14njlHycenxkSuDwJfMHI%2FfVmw2KFrzHTGtgHv69%2F64THFCtOwU1ltpiC5ZrJ2LltVbgH31ZeQAUzbQ%3D&af=1&pid=mm_226490165_153450382_44990650090",
"coupon_pass": "¥b0NmYNlbC8t¥",
"coupon_short_url": "https://s.click.taobao.com/XRkjX0w"
}
"id"为整形,设置为long类型
"name" 名称是字符串类型,需要作为查询条件,并且需要分词。类型设置为"text",指定中文分词器为"ik_max_word",搜索的时候指定"ik_smart"分词器。
注意:1、"type": "text"会分词, "type": "keyword"不会分词
2、"ik_max_word" 为最细粒度分词,"ik_smart"为粗粒度分词,
索引时,为了提高索引的范围,通常会采用"ik_max_word" ,会以最细粒度分词索引,
搜索是,为了提高搜索的准确性,会采用"ik_smart"分词器为粗粒度分词;
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。
字段mapping设置如下:
"name": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"image" 字段是一个链接,不需要搜索,只需要显示就可以,索引不必添加索引,节省内存和空间,也不需要做集合分析,可以直接设置"enabled":false。其它类似需要也可以和这个字段一样处理。
"shop_name"是店铺名称,可以和"name"一样使用分词
"coupon_pass"是优惠券推广口令,不需要分词,但是需要放进索引中,设置"keyword"。
对应的数据模型
PUT item_index
{
"mappings": {
"dynamic": false,
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"image": {
"enabled": false
},
"item_url": {
"enabled": false
},
"shop_name": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"price": {
"type": "double"
},
"sales": {
"type": "integer"
},
"contact_info": {
"type": "keyword"
},
"short_url": {
"enabled": false
},
"sales_url": {
"enabled": false
},
"sales_pass": {
"type": "keyword"
},
"coupon_total_num": {
"type": "integer"
},
"coupon_remaining_num": {
"type": "integer"
},
"coupon_quota": {
"type": "keyword"
},
"coupon_start_date": {
"type": "date",
"format": "yyyy-MM-dd"
},
"coupon_end_date": {
"type": "date",
"format": "yyyy-MM-dd"
},
"coupon_url": {
"enabled": false
},
"coupon_pass": {
"type": "keyword"
},
"coupon_short_url": {
"enabled": false
},
}
}
}
例2:服务器日志数据结构
222.67.85.228 - - [14/Nov/2018:14:30:34 +0800] "GET /search?keyword=&hasCoupon=0&pageNum=1&pageSize=100 HTTP/1.1" 200 12268 "-" "Apache-HttpClient/4.5.5 (Java/1.8.0_131)" "-"
通过日志格式化,将nginx日志转换成以下数据结构:
{
"ip": "222.67.85.228",
"username": "-",
"time": "2018-11-14 14:30:34",
"request_action": "GET",
"request_url": "/search?keyword=&hasCoupon=0&pageNum=1&pageSize=100",
"http_version": "1.1",
"response_status": 200,
"byte": 12268,
"referrer": "-",
"agent": "Apache-HttpClient/4.5.5 (Java/1.8.0_131)",
"http_forward": "-"
}
一般查看日志按照时间和响应状态这两个维度作为查询条件。比如说,需要查询从2019年01月01日至今为止的响应状态为500的请求。整个日志字段基本不需要做分词处理,基本都是做一个展示,字符串数据基本就是"keyword"类型,日期类型注意格式化。
PUT nginx_log_index
{
"mappings": {
"dynamic": false,
"properties": {
"ip": {
"type": "keyword"
},
"username": {
"type": "keyword"
},
"time": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"request_action": {
"type": "keyword"
},
"request_url": {
"enabled": false
},
"http_version": {
"type": "keyword"
},
"response_status": {
"type": "integer"
},
"bytes": {
"type": "long"
},
"referrer": {
"type": "keyword"
},
"agent": {
"type": "keyword"
},
"http_forward": {
"type": "keyword"
}
}
}
}
例3:博客数据结构
image.png{
"id": "89546eff3cd0",
"url": "https://www.jianshu.com/p/89546eff3cd0",
"title": "简单剖析代理模式实现原理",
"author": "梦想实现家_Z",
"content": "代理模式在java中随处可见,其他编程语言也一样,它的作用就是用来解耦的。代理模式又分为静态代理和动态代理。......省略剩下的内容",
"time": "2019.04.10 21:08:21",
"word_num": 1056,
"read_num": 161,
"like_num": 1,
"reward_num": 0
}
因为博客内容特别大,避免每次查询都带上庞大的博客内容,建议将字段分开存储,查询的时候按需要展示。所有建议将"_source"字段设置为"enabled":false,但是要整的每个字段单独设置"store":true
PUT blog_index
{
"mappings": {
"dynamic": false,
"_source": {
"enabled": false
},
"properties": {
"id": {
"type": "keyword",
"store": true,
},
"url": {
"type": "keyword",
"store": true,
"ignore_above": 100,
"doc_values": false,
"norms": false,
},
"title": {
"type": "text",
"store": true,
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"author": {
"type": "keyword",
"store": true,
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"store": true
},
"time": {
"type": "text",
"format": "yyyy.MM.dd HH:mm:ss",
"store": true
},
"word_num": {
"type": "integer",
"store": true
},
"read_num": {
"type": "integer",
"store": true
},
"like_num": {
"type": "integer",
"store": true
},
"reward_num": {
"type": "integer",
"store": true
}
}
}
}
补充一下,"_source" 是在默认配置是“true”,在某个字段特别大的情况下
,不放入索引中,把大字段的内容存在Elasticsearch中只会增大索引,这一点文档数量越大结果越明显,如果一条文档节省几KB,放大到亿万级的量结果也是非常可观的。这里的博客内容就是这样的例子
"_source"的使用方法参考
参考:https://blog.csdn.net/napoay/article/details/62233031