ES7 Tokenizer

2020-05-02  本文已影响0人  逸章

一、例子

1. The standard tokenizer( "tokenizer": "standard")

uses Unicode Text Segmentation to divide the text

POST _analyze 
{
    "tokenizer": "standard",
    "text": "Those who dare to fail miserably can achieve greatly."
}
图片.png

2. The letter tokenizer("tokenizer": "letter")

breaks the text into individual words whenever it meets a character that is not a letter

POST _analyze
 {
   "tokenizer": "letter",
   "text": "You're a wizard, Harry."
 }

切割为:

[You, re, a, wizard, Harry]

3. The lowercase tokenizer("tokenizer": "lowercase")

letter tokenizer && it also turns all terms into lowercase

POST _analyze
 {
   "tokenizer": "lowercase",
   "text": "You're a wizard, Harry."
 }

4. The whitespace tokenizer("tokenizer": "whitespace")

The whitespace tokenizer breaks the text into individual words whenever whitespace is encountered

POST _analyze
 {
   "tokenizer": "whitespace",
   "text": "You're a wizard, Harry."
 }

输出为:[You're, a, wizard, Harry]

5. The keyword tokenizer("tokenizer": "keyword")

outputs the text as a single term

POST _analyze
 {
   "tokenizer": "keyword",
   "text": "Los Angeles"
 }
输出为: 图片.png

6. The pattern tokenizer("tokenizer": "pattern")

The pattern tokenizer uses a regular expression to divide the text or capture the matching text as terms.
The default pattern is \W+

POST _analyze
 {
   "tokenizer": "pattern",
   "text": "The foo_bar_size's default is 5."
 }
图片.png

如下参数可以配置:

PUT my_index_name 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer_name": {
                    "tokenizer": "my_tokenizer_name"
                }
            },
            "tokenizer": {
                "my_tokenizer_name": {
                    "type": "pattern",
                    "pattern": [",", "|"]
                }
            }
        }
    }
}

测试:

POST my_index_name/_analyze 
{
    "analyzer": "my_analyzer_name",
    "text": "comma, separated, values|one|two|three-four"
}
分割结果是: 图片.png

7. The simple pattern tokenizer("type": "simple_pattern")

类似与pattern tokenizer, 但它只使用一个pattern,不接受分割的模式集,所以通常比pattern tokenizer快

PUT my_index_name 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer_name": {
                    "tokenizer": "my_tokenizer_name"
                }
            },
            "tokenizer": {
                "my_tokenizer_name": {
                    "type": "simple_pattern",
                    "pattern": "[0123456789]{3}"
                }
            }
        }
    }
}

测试:

post my_index_name/_analyze
{
    "analyzer": "my_analyzer_name",
    "text": "asta-313-267-847-mm-309"
}
图片.png
上一篇下一篇

猜你喜欢

热点阅读