Elasticsearch 安装与简单使用

2017-10-30  本文已影响78人  秦汉邮侠

1.Search Engine

2.Search based on Lucene

3.ES Vs Relational DB

4.ElasticSearch Install

├─bin
├─config
├─lib
└─modules
    ├─lang-expression
    ├─lang-groovy
    └─reindex

单机Elasticsearch不需要修改过多的配置,就能直接使用,如果需要修改,修改config目录下Elasticsearch.yml文件即可,例如数据文件目录、日志文件目录、Elasticsearch服务器的ip和端口号以及JVM堆内存大小

{
    "name": "D'Ken",
    "cluster_name": "elasticsearch",
    "version": {
        "number": "2.3.4",
        "build_hash": "e455fd0c13dceca8dbbdbb1665d068ae55dabe3f",
        "build_timestamp": "2016-06-30T11:24:31Z",
        "build_snapshot": false,
        "lucene_version": "5.5.0"
    },
    "tagline": "You Know, for Search"
}

5.Basic Operator of ElasticSearch

curl -XPUT http://localhost:9200/blog/article/1 -d '{"titile": "New version of ElasticSearch released!", "content":"Version 1.0 released today!", "tags" : ["announce", "elasticsearch"," release"]}'

http://localhost:9200/blog/article/1,blog是index,article是indextype,1代表文档的id,在同一个index、同一个indextype下是唯一的。也可以不指定文档id,通过post方法新建文档,生成的文档id是随机的。如下所示:

curl -XPOST http://localhost:9200/blog/article -d '{"titile": "New version of ElasticSearch released!", "content":"Version 1.0 released today!", "tags" : ["announce", "elasticsearch","release"]}'
curl -XGET http://localhost:9200/blog/article/_search –d 
‘{
    "fields": [
        "title",
        "author"
    ],
    "query": {
        "match_all": {}
    }
}’

6.Devlopment Process

<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>2.3.4</version>
</dependency>
curl  -XGET http://localhost:9200/blog/article/_mapping/

或者通过head插件查看

结果如何

{
    "blog": {
        "mappings": {
            "article": {
                "_ttl": {
                    "enabled": false
                },
                "properties": {
                    "author": {
                        "type": "string",
                        "index": "not_analyzed"
                    },
                    "content": {
                        "type": "string",
                        " analyzer ": "ik "
                    },
                    "publish_time": {
                        "type": "date",
                        "format": "strict_date_optional_time||epoch_millis"
                    },
                    "read_count": {
                        "type": "integer"
                    },
                    "title": {
                        "type": "string",
                        "analyzer": "ik"
                    }
                }
            }
        }
    }
}

主要需要指定字段type类型,对于string类型,需要指定是否需要分词,以及分词器的类型

public class BlogMapping {
    public static XContentBuilder getMapping(){
        XContentBuilder mapping = null;
        try {
            mapping = jsonBuilder()
                    .startObject()
                    //开启倒计时功能
                    .startObject("_ttl")
                    .field("enabled",false)
                    .endObject()
                    .startObject("properties")
                    .startObject("title")
                    .field("type","string")
                    .field("analyzer", "ik")
                    .endObject()
                    .startObject("content")
                    .field("type","string")
                    .field("index","not_analyzed")
                    .endObject()
                    .startObject("author")
                    .field("type","string")
                    .field("index","not_analyzed")
                    .endObject()
                    .startObject("publish_time")
                    .field("type","date")
                    .field("index","not_analyzed")
                    .endObject()
                    .startObject("read_count")
                    .field("type","integer")
                    .field("index","not_analyzed")
                    .endObject()
                    .endObject()
                    .endObject();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return mapping;
    }
}
curl -XGET 'localhost:9200/_analyze' -d '
{
  "analyzer": "ik",
  "text": "百度张亚勤:ABC时代来了,迎战云计算“马拉松”"
}

命令行下官方推荐使用get方法,使用post方法也是可以的,而且命令行环境下对汉字的支持不是很有友好,可以通过head插件或者其他restclient图形界面工具查询分词结果,通过图形界面工具查询务必使用post方法

GET http://localhost:9200/blog/article/_search
{
  "query": {
    "term": {
      "title": "百度"
    }
  }
}

当title的值为“百度”或者“马拉松”时可以检索到内容,当title为“张亚勤或者云计算时,无法检索到文档,因为分词结果是[百度,百,度,张,亚,勤,abc,时代,来了,迎战,战云,计算,马拉松,马,拉,松],无法匹配“张亚勤”和“云计算”
如果使用match query检索

{
  "query": {
    "match": {
      "title": {
        "query": "张亚勤",
        "operator": "and",
        "minimum_should_match": "3"
      }
    }
  }
}

使用match query检索,ik分词器会把张亚勤分[张,亚,勤],会把三个元素依次和[百度,百,度,张,亚,勤,abc,时代,来了,迎战,战云,计算,马拉松,马,拉,松]里面的数据比对,operator=and,表示每个分词比较结果是与的关系,minimum_should_match=3,表示至少三次匹配成功,此时就能检索到内容
multi_match
如果我们希望两个字段进行匹配,其中一个字段有这个文档就满足的话,使用multi_match,用户输入一个搜索词,我们很少根据有一个字段进行索引,通常是根据某几个字段进行索引,例如我们想根据title或者content字段搜包含“云计算”的内容,那么相应DSL语句:

{
  "query": {
    "multi_match": {
        "query" : "云计算",
        "fields" : ["title", "content"]
    }
  }
}
POST http://localhost:9200/blog/article/_search?explain
{
  "query": {
    "match": {
      "title": {
        "query": "罗一笑",
        "operator": "and",
        "minimum_should_match": "3"
      }
    }
  }
}

搜索词:罗一笑
查询时将搜索词分为4个分词(term),罗,一笑,一,笑
每个词有一个评分
罗:0.042223614
一笑:0.11945401
一:0.11945401
笑:0.11945401
根据罗一笑查询到文档的总评分为:0.40058565=0.042223614+0.11945401+0.11945401+0.11945401
对于每一个term的评分规则是:term评分= queryweight * fieldweight
一笑的评分:0.11945401 = 0.54607546 * 0.21875
查询权重query weight = idf * queryNorm

{
    "value": 0.54607546,
    "description": "queryWeight, product of:",
    "details": [
        {
            "value": 1,
            "description": "idf(docFreq=1, maxDocs=2)",
            "details": []
        },
        {
            "value": 0.54607546,
            "description": "queryNorm",
            "details": []
        }
    ]
}

域权重fieldweight= tf * idf * fieldNorm

{
    "value": 0.21875,
    "description": "fieldWeight in 0, product of:",
    "details": [
        {
            "value": 1,
            "description": "tf(freq=1.0), with freq of:",
            "details": [
                {
                    "value": 1,
                    "description": "termFreq=1.0",
                    "details": []
                }
            ]
        },
        {
            "value": 1,
            "description": "idf(docFreq=1, maxDocs=2)",
            "details": []
        },
        {
            "value": 0.21875,
            "description": "fieldNorm(doc=0)",
            "details": []
        }
    ]
}

主要有个两个概念Term Frequency和Inverse document frequency
Term Frequency:某单个关键词(term) 在某文档的某字段中出现的频率次数, 显然, 出现频率越高意味着该文档与搜索的相关度也越高
具体计算公式是 tf(q in d) = sqrt(termFreq)
Inverse document frequency:某个关键词(term) 在索引(单个分片)之中出现的频次. 出现频次越高, 这个词的相关度越低. 相对的, 当某个关键词(term)在一大票的文档下面都有出现, 那么这个词在计算得分时候所占的比重就要比那些只在少部分文档出现的词所占的得分比重要低. 说的那么长一句话, 用人话来描述就是 “物以稀为贵”。
具体计算公式:
idf = 1 + ln(maxDocs/(docFreq + 1))
详细计算公式请参考lucene源码或者参考http://www.hankcs.com/program/java/lucene-scoring-algorithm-explained.html
lucene的评分机制过于复杂,我们可以根据上述公式派生出简单容易理解的规则
匹配的词条越罕见,文档的得分越高
文档的字段越小,文档的得分越高
字段的加权越高,文档的得分越高

官方指导文档

上一篇 下一篇

猜你喜欢

热点阅读