999 - Elasticsearch Analysis 02

2019-05-09  本文已影响0人  歌哥居士

Standard Analyzer

standard analyzer由以下构成:

Standard Analyzer 示例

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

Standard Analyzer 配置

参数 说明
max_token_length 提取单词时,允许的单词长度。默认255。
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_english_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_array_analyzer": {
          "type": "standard",
          "stopwords": ["the","2","quick","brown","foxes","jumped","over","dog's","bone"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_array_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[lazy]

Simple Analyzer

simple analyzer由以下构成:

Simple Analyzer 示例

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Whitespace Analyzer

whitespace analyzer由以下构成:

Whitespace Analyzer 示例

POST _analyze
{
  "analyzer": "whitespace"
  , "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

Stop Analyzer

stop analyzer由以下构成:

Stop Analyzer 示例

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

Stop Analyzer 配置

参数 说明
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_english_
stopwords_path 包含停止符的文件的路径,路径相对于Elasticsearch的config目录。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer":{
          "type": "stop",
          "stopwords":  ["the","2","quick","brown","foxes","jumped","over","dog","s","bone"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_stop_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ lazy ]

Keyword Analyzer

keyword analyzer由以下构成:

Keyword Analyzer 示例

POST _analyze
{
  "analyzer": "keyword", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

Pattern Analyzer

pattern analyzer由以下构成:

Pattern Analyzer 示例

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Pattern Analyzer 配置

参数 说明
pattern 使用Java正则表达式。默认\W+
flags Java正则表达式flags,多个用|分离,例如"CASE_INSENSITIVE | COMMENTS"。
lowercase 是否小写。默认true
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_pattern_analyzer": {
          "type": "pattern",
          "pattern": "\\W|_",
          "lowercase": true
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_pattern_analyzer", 
  "text": "John_Smith@foo-bar.com"
}

产生[ john, smith, foo, bar, com ]

Fingerprint Analyzer

Fingerprint Analyzer 示例

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

产生[ and consistent godel is said sentence this yes ]

Fingerprint Analyzer 配置

参数 说明
separator 连接条件。默认是空格。
max_output_size 词元最大长度,超过会被丢弃(不是超过部分被丢弃,而且超过这个长度整条被丢弃)。默认255。
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

产生[ consistent godel said sentence yes ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "stopwords": "_english_",
          "separator": "-"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

产生[ consistent-godel-said-sentence-yes ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "max_output_size": 30
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

什么都不产生,整条被丢弃,

补充说明

自定义Analyzer

自定义Analyzer的配置

参数 说明
tokenizer 内置或自定义的tokenizer
char_filter 内置或自定义的character filter,可选
filter 内置或自定义的token filter,可选
position_increment_gap 当一个字段值为数组且有多个值时,为了防止跨值匹配,修改值的position。默认100。例如[ "John Abraham", "Lincoln Smith"]为拆分之后position为1,2, 103,104,这样就防止了跨值匹配。更具体的看Mapping文章的position_increment_gap部分。

示例1:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter":[
            "html_strip"
            ],
          "filter": [
            "lowercase",
            "asciifolding"
            ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

产生[ is, this, deja, vu ]

示例2

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
              "emoticons"
            ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
            ]
        }
      },
      "tokenizer": {
        "punctuation": {
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
            ]
        }
      },
      "filter": {
        "english_stop":{
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}

产生[ i'm, _happy_, person, you ]

上一篇 下一篇

猜你喜欢

热点阅读