多字段与自定义Analyzer
2020-07-08 本文已影响0人
滴流乱转的小胖子
一、多字段
字段实现精确
- 增加一个keyword字段
使用不同的analyzer
- 不同语言
- pinyin字段的搜索
- 还支持为搜索和索引指定不同的analyzer
二、Exact Values vs Full Text
Exact Values:包括数字/日期/具体一个字符串(例如“Apple Store”)
- ES 中的 keyword
全文本,非结构化的文本数据
-
ES中的text
image.png
Exact Values 不需要被分词
- ES为每一个字段创建一个倒排索引
-
Exact Value在索引时,不需要做特殊的分词处理
image.png
- Character Filters
- Tokenizer
- Token Filter
三、自定义分词
当es自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现
3.1 Character Filters
在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个 Character Filters。会影响Tokenizer和offset信息
一些自带的 Character Filters
- HTML script -- 去除html标签
- Mapping -- 字符串替换
- Pattern replace -- 正则匹配替换
3.2 Tokenizer
- 将原始的文本按照一定的规则,切分为词(term or token)
- ES内置的Tokenizers
whitespace / standard / uax_url_email / pattern / keyword / path hierarchy - 可以用java 开发组件实现自己的 Tokenizer
3.3 Token Filters
- 将Tokenizer 输出的单词(term),进行增加,修改,删除
-
自带的 Token Filters
Lowercase / stop /synonym (添加 近义词)
image.png
词的执行过程
1:分析器的大体执行过程
char filter - >tokenizer -> token filter
2:分词的时机
分词在索引时做,也就是数据写入时。目的是创建倒排索引提高搜索速度。写入的原始数据保存在_source中

PUT logs/_doc/1
{"level":"DEBUG"}
GET /logs/_mapping
POST _analyze
{
"tokenizer":"keyword",
"char_filter":["html_strip"],
"text": "<b>hello world</b>"
}
POST _analyze
{
"tokenizer":"path_hierarchy",
"text":"/user/ymruan/a/b/c/d/e"
}
#使用char filter进行替换
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ "- => _"]
}
],
"text": "123-456, I-test! test-990 650-555-1234"
}
//char filter 替换表情符号
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "mapping",
"mappings" : [ ":) => happy", ":( => sad"]
}
],
"text": ["I am felling :)", "Feeling :( today"]
}
// white space and snowball
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop","snowball"],
"text": ["The gilrs in China are playing this game!"]
}
// whitespace与stop
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop","snowball"],
"text": ["The rain in Spain falls mainly on the plain."]
}
//remove 加入lowercase后,The被当成 stopword删除
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase","stop","snowball"],
"text": ["The gilrs in China are playing this game!"]
}
//正则表达式
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type" : "pattern_replace",
"pattern" : "http://(.*)",
"replacement" : "$1"
}
],
"text" : "http://www.elastic.co"
}
自定义分析器 示例
自定义分析器标准格式是:
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": { ... custom character filters ... },//字符过滤器
"tokenizer": { ... custom tokenizers ... },//分词器
"filter": { ... custom token filters ... }, //词单元过滤器
"analyzer": { ... custom analyzers ... }
}
}
}
============================实例===========================
PUT /my_index
{
"settings": {
"analysis": {
"char_filter": {
"&_to_and": {
"type": "mapping",
"mappings": [ "&=> and "]
}},
"filter": {
"my_stopwords": {
"type": "stop",
"stopwords": [ "the", "a" ]
}},
"analyzer": {
"my_analyzer": {
"type": "custom",
"char_filter": [ "html_strip", "&_to_and" ],
"tokenizer": "standard",
"filter": [ "lowercase", "my_stopwords" ]
}}
}}}
============================实例===========================
比如自定义好的analyzer名字是my_analyzer,在此索引下的某个新增字段应用此分析器
PUT /my_index/_mapping
{
"properties":{
"username":{
"type":"text",
"analyzer" : "my_analyzer"
},
"password" : {
"type" : "text"
}
}
}
=================插入数据====================
PUT /my_index/_doc/1
{
"username":"The quick & brown fox ",
"password":"The quick & brown fox "
}
====username采用自定义分析器my_analyzer,password采用默认的standard分析器==
===验证
GET /index_v1/_analyze
{
"field":"username",
"text":"The quick & brown fox"
}
GET /index_v1/_analyze
{
"field":"password",
"text":"The quick & brown fox"
}
//官网权威指南是真的讲得好,虽然版本太老,Elasticsearch 2.x 版本,一些api已经不适用了,自定义分析器地址:https://www.elastic.co/guide/cn/elasticsearch/guide/cn/custom-analyzers.html