他山之石

2018-07-24 本文已影响0人 Rainysong

0、Markdown编辑器语法：https://segmentfault.com/markdown/； https://blog.csdn.net/wiki_su/article/details/74764731

1、FlashText：关键词搜索替换算法

时间复杂度：O（N），文档字符N，关键词数M。（正则的复杂度是O（M*N））
只匹配完整单词、最长字符串
基于 Trie 字典数据结构和 Aho Corasick 的算法。它的工作方式是，首先它将所有相关的关键字作为输入。使用这些关键字建立一个 trie 字典。start 和 eot 是两个特殊的字符，用来定义词的边界，这和我们上面提到的正则表达式（\b）是一样的。这个 trie 字典就是我们后面要用来搜索和替换的数据结构。

2、Jenkins —— 定时启动程序

3、tableau —— 可视化工具

4、数据库范式

5、算法和数据结构

6、ElasticSearch

Python elasticsearch 官方文档（参数配置）：https://elasticsearch-py.readthedocs.io/en/master/

①字符串字段类型

keyword：存储数据时候，不会分词建立索引

text：存储数据时候，会自动分词，并生成索引（这是很智能的，但在有些字段里面是没用的，所以对于有些字段使用text则浪费了空间）。

精确查找时：

如果精确查找zuName字段

{ "query": { "term": { "zuName": "墙体钢结构" } } }

会出现空数据，表示查不到数据，这是因为墙体钢结构这个值在存储的时候被分词了，倒排索引里面只有‘墙体’,'钢结构',这两个词所以会出现查找为空的记录
2018-11-8：注意！！！用term/terms查询text字段时，是将对应字段内容分词后，看与term/terms的字符串是否有匹配！！！所以如果你terms给的一些词，与text在es中的分词后不一样，就会搜索不到！！！

这种情况下的分词是存储数据时候的分词，还有一种分词是在你搜索的时候根据你的搜索参数进行分词后再进行搜索的。es提供了许多开箱即用的分析器analyzer，大家也可以去下载被人开发好的分词器然后安装在es的plugins下，然后在声明使用。在zuName这个字段我用的是ik的分词器，是一个大家基本都会用到的中文分词器，git地址为 https://github.com/medcl/elasticsearch-analysis-ik。

如果精确查找zuMakert字段

{ "query": { "term": { "zuMakert": "张三李四" } } }

这时候这条记录是存在的，因为keyword字段不会进行分词。

这查询是精确查询出现的结果，如果你使用分词查询，结果就会一样，但搜索结构的权重是不一样的。

② ElasticSearch多种查询方式
https://www.cnblogs.com/sunfie/p/7019701.html

eg:
1)full-text search（全文检索）

GET /ecommerce/product/_search
　　{
　　　　"query" : {
　　　　　　"match" : {
　　　　　　　　"producer" : "yagao producer"
　　　　　　}
　　　　}
　　}
2)phrase search（短语搜索）

跟全文检索相对应，相反，全文检索会将输入的搜索串拆解开来，去倒排索引里面去一一匹配，只要能匹配上任意一个拆解后的单词，就可以作为结果返回，phrase search，要求输入的搜索串，必须在指定的字段文本中，完全包含一模一样的，才可以算匹配，才能作为结果返回
{
"query": {
"match_phrase": {
"url": "https://m.weibo.cn/status/4218271161401312"
}
}
}

3)类似SQL where in 查elasticsearch的text类型字段（精确匹配） —— should是or的意思，且与filter平级，故不能在filter里面。
{
"query": {
"bool": {
"should": [
{
"match_phrase": {
"url": "https://m.weibo.cn/status/4218271161401312"
}
},
{
"match_phrase": {
"url": "https://m.weibo.cn/status/4216916723449408"
}
}
]
}
}
}

③ es查询两个字段值相等
{
"query": {
"bool": {
"must": [{
"match_all": {}
}],
"filter": [{
"script": {
"script": {
"inline": "doc['weibo_del_num'].value - doc['event_weibo_num'].value == 0",
"lang": "painless"
}
}
}],
"must_not": [],
"should": []
}
},
"from": 0,
"size": 10
}

④ 注意：查询多个range / term ，只能每个写一个，不能像下面这么写：
{
"query": {
"bool": {
"should": [],
"must": [
{
"range": {
"sensitive_word": {"gt": "''"},
"weibo_del_num": { "gte": 0}
}
}
],
"must_not": []
}
}
}
会报错：

image.png

正确的写法：
{
"query": {
"bool": {
"must": [
{
"range": {"sensitive_word": {"gt": "''"}}
}
,
{
"range": {"weibo_del_num": {"gte": "0"}}
}
],
"must_not": [ ],
"should": [ ]
}
},
"from": 0,
"size": 10,
"sort": [ ],
"aggs": { }
}

⑤ 【万能的解决方法】不知道怎么写query的时候可以用SQL！！！
1）直接网页访问拼SQL语句的链接：
http://192.168.0.135:9200/_sql?sql=select * from zk_social where url in ('https://m.weibo.cn/status/4199823517293095','https://m.weibo.cn/status/4199819234949332')

返回结果：

image.png

2）SQL语句转换为es的query语法：
http://192.168.0.135:9200/_sql/_explain?sql=select * from zk_social where url in ('https://m.weibo.cn/status/4199823517293095','https://m.weibo.cn/status/4199819234949332')

返回结果：

image.png

3）from yf.L: —— 2018/12/7 冷哥走之前对es的小tips
http://192.168.0.135:9200/_sql/?sql=select * from zk_social where gov_id = '2' and pub_time between '2018-05-21 00:00:00' and '2018-05-21 23:59:59' order by pub_time limit 100000

集群上面每一台都装了，把前面的IP换成下面的5台都可以：
192.168.0.135
192.168.0.133
192.168.0.38
192.168.0.118
192.168.0.88

如果不知道es查询语句怎么写，还可以把sql转成es的查询语句：
http://192.168.0.135:9200/_sql/_explain?sql=select * from zk_social where is_deleted is null and gov_id = '2'

⑥ ElasticSearch 评分（排序）公司：https://blog.csdn.net/flashflight/article/details/52187413?utm_source=blogxgwz0

⑦ ES - groupby "gov_id"：
es_agg_query={
"size":0,
"aggregations": {
"gov_id": {
"terms": {
"field": "gov_id",
"size": 9999 # size限定groupby后输出前多少个gov_id的数据情况 #
},
"aggregations":{
"sum_count_read": {
"sum": {
"field": "count_read"
}
},
"sum_count_share": {
"sum": {
"field": "count_share"
}
},
"sum_count_comment": {
"sum": {
"field": "count_comment"
}
}
}
}
}
}

⑧ ES - 对某时段的微博信息groupby
{
"size": 0,
"aggregations": {
"gov_id": {
"terms": {"field": "gov_id", "size": 9999},
"aggregations": {
"sum_count_read": {
"sum": {
"field": "count_read"
}
},
"sum_count_share": {
"sum": {
"field": "count_share"
}
},
"sum_count_comment": {
"sum": {
"field": "count_comment"
}
}
}
}
},
"query": {
"bool": {
"filter": [
{
"range": {
"pub_time": {
"lt": "2018-12-01 00:00:00",
"gte": "2018-11-01 00:00:00"
}
}
}
]
}
}
}