Logstash Filter 中文解读

2017-11-28 本文已影响89人 Zparkle

1.grok

Description

Parse arbitrary text and structure it.
Grok is a great way to parse unstructured log data into something structured and queryable.
This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format that is generally written for humans and not computer consumption.
Logstash ships with about 120 patterns by default. You can find them here:https://github.com/logstash-plugins/logstash-patterns-core/tree/master/patterns. You can add your own trivially. (See the'patterns_dir'setting)

grok可以解析任意的文本并将其结构化。
该工具在针对syslog, apache或者其他的webserver或者例如mysql跟其他一些杂七杂八的东西会特别好用=- =。而且对log的格式化仅仅是为了数据显示更加人性化，不会增加计算消耗。
Logstash本身针对不同语言有120种默认的匹配模式（实际上很容易看到是正则表达式），你也可以写你自己的表达式并且提一个pull request；

Grok Basics

Grok 通过将文本标签与log内容进行匹配实现格式化。
格式：%{SYNTAX:SEMANTIC}
SYNTAX是标签的名字，SEMANTIC是通过标签映射得到的数据的存放变量。

栗子
1234 55.3.244.1
可以匹配 %{NUMBER:IDKFKDATA} %{IP:clientaddr}
该匹配会将前一个数字存入名为“IDKFKDATA”的字段内，而后一个数据会被识别为IP地址并存入clientaddr字段。

默认情况下所有字段的存储类型为String，如果你希望其他的存储类型
%{NUMBER:num:int}使用这种匹配将产生int类型的字段

Examples: With that idea of a syntax and semantic, we can pull out useful fields from a sample log like this fictional http request log:

5.3.244.1 GET /index.html 15824 0.043

The pattern for this could be:

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

自定义过滤标签

(?<field_name>the pattern here)

直接用这个就能自定义一个正则把数据存入field_name.
也可以写文件自定义一个pattern

一些选项

break_on_match
boolean默认为true,设置匹配是否只进行最初匹配到的，如果设置为false，每一次匹配都将执行完所有的pattern，如果你为一个数据设置了多个pattern，需要设置为false。
keep_empty_captures
默认为false，如果没能捕获到数据将不会有空的字段产生
match
默认值是用中括号"{}"括起来的键值对""field1"=>"value1"

Example:
match => {
"field1" => "value1"
"field2" => "value2"
...
}

最常用的就是对message字段进行格式化

filter {
grok { match => { "message" => "Duration: %{NUMBER:duration}" } }

}

如果想对同一个字段多次格式化

filter {
grok { match => { "message" => [ "Duration: %{NUMBER:duration}", "Speed: %{NUMBER:speed}" ] } }
}

其中"Duration: "是正则直接匹配相对应的字符，%{}是grok的匹配标签，前者为正则后者为字段

named_captures_only
boolean形式默认值是true
只用通过grok标签成功匹配到并写入字段的数据会被保存
overwrite
很奇怪的一个参数，可以覆写已经存在的标签，复写的内容可以是grok中过滤出来的字段（这里存在一个问题，我从已有的message字段中过滤出一个message然后用新的message字段覆写，最后剩下几个message）

试验部分发送一个字段包含message="May 29 16:37:11 sadness logger: hello world"
调用
filter {
 grok {
   match => { "message" => "%{SYSLOGBASE} %{DATA:message}" }
   overwrite => [ "message" ]
 }
}
试验结果，grok有作用 %{SYSLOGBASE}在linuxlog文件中书写，成功得到对应字段
但是%{DATA:message}匹配失败
查询发现DATA -> .*?加了问号导致非贪婪模式什么都没有匹配到(garbage garbage garbage)

不加override设置的效果是message字段变成了数组，[0]是原来的message内容,[1]，存储的是匹配出来的"hello world"
增加了override设置之后message字段只有"hello world"数据
顺便测试了自定义匹配的可用性还是靠谱的。

pattern_definitions
emmm这个东西比较神奇，值是hash键值对，用来定义自定义的匹配模式，但是实际上可以直接用噢能够上文的自定义inline字符串替代。大概相当于把自定义匹配模式的东西写到配置中。
patterns_dir
前面提到pattern可以自己配置在文件中，通过配置这个属性来读取文件，值是array“[]”，可以同时读取多个文件。
patterns_files_glob
一个字符串，用正则过滤出要使用的匹配标签文件。。
默认是*
tag_on_failure
默认为"_grokparsefailure",当匹配失败后会增加一个该名字的字段告诉你grok工作错误
tag_on_timeout
类似于上面那个的功能
timout_millis
设置超时，如果有多个pattern 那么超时针对每一个pattern单独工作,超时不会过快触发，但有时候会稍慢触发。当该值设置为0时不会超时，默认为30000（大概是毫秒，具体没说）

一些公共选项（后续不再重复介绍）

add_field
Example:

filter {
  grok {
    add_field => { "foo_%{somefield}" => "Hello world, from %{host}" }
  }
}

没错很好用，每一款filter都可以加这个玩意儿

add_tag
增加一些tag
还需要探究一下跟空的add_field有什么区别

结果
区别还是比较大的，add_tag实际上是在tags字段下添加数据，不会新产生字段，而add_field可以产生新的字段

enable_metric
关闭或者打开公共日志记录功能，默认Logstash为所有插件提供日志记录，不过你可以手动关闭其中的一些你不要的。
id
这个id写在日志中，是一个字符串，如果你不指定一个明确的id，logstash会给你生成一个，但是如果多次使用了同一种filter，那么一定要加一个id便以区分不同的同款过滤器.
periodic_flush
通过有规律的时间间隔调用过滤插件的flush方法.= -= 目前不知道有没有用，感觉问题不大，如果出现了数据不及时问题应该设置为true，默认值是false
remove_field
就这么用=- =

You can also remove multiple fields at once:
filter {
 grok {
   remove_field => [ "foo_%{somefield}", "my_extraneous_field" ]
 }
}

字面意思，删除某些字段

remove_tag
写法同上，字面意思

值得注意的是某些标签的使用需要在过滤器成功工作的前提下，如果你的标签没有效果，记得检查一下前面的过滤主体（有的标签必须在有过滤得情况下才能起作用）

2.Aggregate

Description

The aim of this filter is to aggregate information available among several events (typically log lines) belonging to a same task, and finally push aggregated information into final task event.

You should be very careful to set Logstash filter workers to 1 (-w 1 flag) for this filter to work correctly otherwise events may be processed out of sequence and unexpected results will occur.

该过滤器的目的是将多条消息的数据聚合成一条消息，提供"code"字段可以对int属性进行自定义的增加减少，然后丢到某一个最终消息中去，然后进入output过程。

不过为了使用这个过滤器，你需要将Logstash的过滤器参数设置为1 （-w 1 flag）这样该过滤器才能正确工作。否则会掉头发。

总的来说是一个很迷的过滤器，请尽量在来源或者Kibana中完成消息聚合，使用该Filter极其麻烦。

拒绝翻译这个东西 = =

3.Mutate

英文文档

Description

The mutate filter allows you to perform general mutations on fields. You can rename, remove, replace, and >modify fields in your events.

变形（？）过滤器。允许你对字段做一般的改变。你可以改名,删除,替代,修改收到消息中的参数。

如果你认真读了上面的你会发现grok也提供删除字段的功能。实际上相当多的过滤器提供了大量的重复功能，不过我认为针对不同操作尽量调用相对应的过滤器会令配置简洁明了。

convert
转换操作，将某个字段的值得类型转换成其他类型，例如将string字段转化为integer字段，如果字段是一个数组，每一个元素将会被转换，如果字段是一个hash（键值对），那么将不会有什么效果。

Example:

filter {
 mutate {
   convert => { "fieldname" => "integer" }
 }
}

其中涉及到True False有一些转换的规则，详情请从title下面的英文文档链接点进去。（没错自己看吧hhhh）

copy
值为hash，将已经存在的字段复制到另一个新的或者已经存在的字段（覆盖）。

Example:

filter {
 mutate {
    copy => { "source_field" => "dest_field" }
 }
}

gsub
值为array,对某个字段为string或者是string的array的执行正则表达式过滤操作，并用全新的string替代这些被匹配到的字符串。

Example:

filter {
 mutate {
   gsub => [
     # replace all forward slashes with underscore
     "fieldname", "/", "_",
     # replace backslashes, question marks, hashes, and minuses
     # with a dot "."
     "fieldname2", "[\\?#-]", "."
   ]
 }
}

你会发现有两个反斜杠，这没错，你需要给正则里面的所有反斜杠加反斜杠。。。我知道这有点绕口hhh。

join
把一个array用某一个分隔符连成一条

Example:

filter {
 mutate {
   join => { "fieldname" => "," }
 }
}

lowercase
某些字段全部变成小写……

Example:

filter {
 mutate {
   lowercase => [ "fieldname" ]
 }
}

merge
hash
这个东西比较神奇，把两个字段的东西merge在一起
- 可以把string跟string的数组merge在一起
- 可以把string跟string merge在一起，会变成一个string的数组
- 不可以把数组跟hash merge在一起

hash也可以merge
反正你自己玩=。=

Example:

filter {
 mutate {
    merge => { "dest_field" => "added_field" }
 }
}

rename
修改field的名字，值是哈西键值对用{}括起来。栗子就不放了没什么意义。
replace
用新的内容替代字段内的原本内容。比较常用。
当数据源不方便修改格式的时候可以直接在这里修改，当然用grok也不错，自己选择。

Example:

filter {
 mutate {
   replace => { "message" => "%{source_host}: My new message" }
 }
}

split
如果你有一个字符串的变量，由某一种字符分隔，可以用这个进行分割变成一个数组。
比如用逗号分割一个字符串

mutate{
  split => {"fieldname" => "," }
}

strip
从一个array的字段中删除空格，只删除单词前后的空格
update
更新一个字段的值，经过测试，功能上跟replace没有半毛钱区别，replace优先级更高，当replace跟update同时出现时，最后结果为replace的值。
只能作用于已经存在的字段
uppercase
将字符串转换为大写

4.Date

Descriptionedit

The date filter is used for parsing dates from fields, and then using that date or timestamp as the logstash timestamp for the event.
For example, syslog events usually have timestamps like this:

"Apr 17 09:32:01"

You would use the date format

MMM dd HH:mm:ss

to parse this.
The date filter is especially important for sorting events and for backfilling old data. If you don’t get the date correct in your event, then searching for them later will likely sort out of order.
In the absence of this filter, logstash will choose a timestamp based on the first time it sees the event (at input time), if the timestamp is not already set in the event. For example, with file input, the timestamp is set to the time of each read.

日期过滤器是一个用来解析日期格式的过滤器，并将解析出来的日期作为logstash的时间戳使用。
栗子
日期过滤器是一个对整理消息重新回填旧数据非常重要的过滤器。如果你没有在你的消息中正确的获取到时间，那么之后对他们的搜索很可能会失去顺序。
如果没有该过滤器并且时间中没有设置时间戳，logstash会根据他首次获取到消息的时间设置时间戳，比如从文件中读取消息，那么每次读取的时间将会作为时间戳。

Date Filter Configuration Options 日期过滤器配置选项

locale
match
tag_on_failure
target
timezone

Logstash Filter 中文解读

1.grok

Description

Grok Basics

自定义过滤标签

一些选项

一些公共选项（后续不再重复介绍）

2.Aggregate

Description

3.Mutate

Description

Example:

Example:

Example:

Example:

Example:

Example:

Example:

4.Date

Descriptionedit

Date Filter Configuration Options 日期过滤器配置选项

猜你喜欢

热点阅读