ES 多字段特性及setting中自定义Analyzer
2020-02-02 本文已影响0人
鸿雁长飞光不度
1.多字段类型
1.多字段特性
- 一个字符串需要精准匹配
增加一个keyword字段 - 使用不同的analyzer
- 不同的语言
- pinyin字段搜索
- 支持为搜索和索引指定不同的analyzer
案例
PUT products
{
"mappings": {
"properties": {
"company":{
"type": "text",
"fields": {
"keyword":{
"type":"keword",
"ignore_above":256
}
}
},
"comment":{
"type": "text",
"english_comment":{
"type":"text",
"analyzer":"english",
"search_analyzer":"english"
}
}
}
}
}
1.2精确值和全文本
- 精确值
包括数字、日期、具体的一个字符串(Apple Store),ES里面的keyword。在做索引的时候不需要做分词处理。 - 全文本,非结构化的文本数据,对应ES里面的text
2.自定义分词器
当ES自带的分词器不能满足需求的情况下,可以通过组合不同的Character Filters
,Tokenizer
,Token Filter
来实现。详细内容
https://www.jianshu.com/p/4cff1721934a
2.1 Character Filters案例
//html过滤
POST _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text":"<p>haha</p>"
}
//字符替换,-替换成_
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type":"mapping",
"mappings":["- => _"]
}],
"text":"010-123-1231"
}
//正则
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type":"pattern_replace",
"pattern":"http://(.*)",
"replacement":"$1"
}
],
"text":"http://www.baidu.com"
}
2.2 Tokenizer 案例
按路径分割
POST _analyze
{
"tokenizer": "path_hierarchy",
"text":"a/b/c"
}
2.3Token Filter
按照空格分词后,转换为小写,去掉停用词输出。
GET _analyze
{
"tokenizer": "whitespace",
"filter":["lowercase","stop"],
"text":["The girls in here is singing"]
}
2.4 在索引中自定义的完整案例
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer":{
"type":"custom",
"char_filter":["emoticons"],
"tokenizer":"punctuation",
"filter":["lowercase","english_stop"]
}
},
"tokenizer":{
"punctuation":{
"type":"pattern",
"pattern":"[.,!?]"
}
},
"char_filter":{
"emoticons":{
"type":"mapping",
"mappings":[
":) => happy",
":( => sad"
]
}
},
"filter":{
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
}
}
测试案例
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "I am :) person"
}