linux basic knowledge

Linux -- 正则表达式

2019-10-05  本文已影响0人  生信摆渡

《Linux命令行与shell脚本编程大全》,4 E -- Chapter 20

一、 什么是正则表达式

1. 定义

$ ls -al da*
-rw-r--r-- 1 rich rich 45 Nov 26 12:42 data
-rw-r--r-- 1 rich rich 25 Dec 4 12:40 data.tst
-rw-r--r-- 1 rich rich 180 Nov 26 12:42 data1
-rw-r--r-- 1 rich rich 45 Nov 26 12:44 data2
-rw-r--r-- 1 rich rich 73 Nov 27 12:31 data3
-rw-r--r-- 1 rich rich 79 Nov 28 14:01 data4
-rw-r--r-- 1 rich rich 187 Dec 4 09:45 datatest

2. 正则表达式的类型

二、 定义 BRE 模式

1. 纯文本

$ echo "This is a test" | sed -n '/test/p'
This is a test
$·· echo "This is a test" | sed -n '/trial/p'
$ echo "This is a test" | gawk '/test/{print $0}'
This is a test
$ echo "This is a test" | gawk '/trial/{print $0}
$ echo "This is a test" | sed -n '/this/p'
$ echo "This is a test" | sed -n '/This/p'
This is a test
$ echo "The books are expensive" | sed -n '/book/p'
The books are expensive
$ cat data1
This is a normal line of text.
This is  a line with too many spaces.
$ sed -n '/  /p' data1
This is  a line with too many spaces.

2. 特殊字符

$ cat data2
The cost is $4.00
$ sed -n '/\$/p' data2
The cost is $4.00
$ echo "\ is a special character" | sed -n '/\\\/p'
\ is a special character
$ echo "3 / 2" | sed -n '///p'
sed: -e expression #1, char 3: unknown command: `/'
$ echo "3 / 2" | sed -n '/\//p'
3 / 2

3. 锚字符

3.1 锁定在行首

$ echo "The book store" | sed -n '/^ book/p'
$ echo "Books are great" | sed -n '/^ Book/p'
Books are great
$ cat data3
This is a test line.
this is another test line.
A line that tests this feature.
Yet more testing of this
$ sed -n '/^ this/p' data3
this is another test line.
$ echo "This ^ is a test" | sed -n '/s ^ /p'
This ^ is a test

说明 :如果指定正则表达式模式时只用了脱字符,就不需要用反斜线来转义。但如果你在模式中先指定了脱字符,随后还有其他一些文本,那么你必须在脱字符前用转义字符。


3.2 锁定在行尾

echo "This is a good book" | sed -n '/book$/p'
This is a good book
echo "This book is good" | sed -n '/book$/p'

3.3 组合锚点

​ 一些常见情况下,可以在同一行中将行首锚点和行尾锚点组合在一起使用。

cat data4
this is a test of using both anchors
I said this is a test
this is a test
I'm sure this is a test.
$ sed -n '/^this is a test$/p' data4
this is a test
$ cat data5
This is one test line.
This is one test line.
This is another test line.
sed '/^ $/d' data5
This is one test line.
This is another test line.

​ 定义的正则表达式模式会查找行首和行尾之间什么都没有的那些行。由于空白行在两个换行符之间没有文本,刚好匹配了正则表达式模式。sed编辑器用删除命令 d 来删除匹配该正则表达式模式的行,因此删除了文本中的所有空白行。这是从文档中删除空白行的有效方法。

3.4 点号字符

$ cat data6
This is a test of a line.
The cat is sleeping.
That is a very nice hat.
This test is at line four.
at ten o'clock we'll go home.
$ sed -n '/.at/p' data6
The cat is sleeping.
That is a very nice hat.
This test is at line four.

​ 行有点复杂。注意,我们匹配了 at ,但在 at 前面并没有任何字符来匹配点号字符。其实是有的!在正则表达式中,空格也是字符,因此 at 前面的空格刚好匹配了该模式。第五行证明了这点,将 at 放在行首就不会匹配该模式了。

3.4 字符组

$ cat data6
This is a test of a line.
The cat is sleeping.
That is a very nice hat.
This test is at line four.
at ten o'clock we'll go home.
$ sed -n '/[ch]at/p' data6
The cat is sleeping.
That is a very nice hat.

​ 首先筛选出了含有at的行,但不包含at ten o'clock we'll go home.这一行,因为这一行的at前面没有字符,而该模式规定除了匹配at字符之外,还要匹配s字符或h字符,因此at ten o'clock we'll go home.淘汰,The cat is sleeping.That is a very nice hat.通过。

$ echo "Yes" | sed -n '/[Yy]es/p'
Yes
$ echo "yes" | sed -n '/[Yy]es/p'
yes
$ echo "Yes" | sed -n '/[Yy][Ee][Ss]/p'
Yes
$ echo "yEs" | sed -n '/[Yy][Ee][Ss]/p'
yEs
$ echo "yeS" | sed -n '/[Yy][Ee][Ss]/p'
yeS
$ echo "Yes" | sed -n '/[Yy]es/p'
Yes
$ echo "yes" | sed -n '/[Yy]es/p'
yes
$ cat data7
This line doesn't contain a number.
This line has 1 number on it.
This line a number 2 on it.
This line has a number 4 on it.
$ sed -n '/[0123]/p' data7
This line has 1 number on it.
This line a number 2 on it.

​ 这个正则表达式模式匹配了任意含有数字0、1、2或3的行。含有其他数字以及不含有数字的行都会被忽略掉。

$ cat data8
60633
46201
223001
4353
22203
$ sed -n '/^[0123456789][0123456789][0123456789][0123456789][0123456789]$/p' data8
60633
46201
22203
$ cat data9
I need to have some maintenence done on my car.
I'll pay that in a seperate invoice.
After I pay for the maintenance my car will be as good as new.
#换行能不能代替为空格?
$ sed -n '/maint[ea]n[ae]nce/p/sep[ea]r[ea]te/p' data9
I need to have some maintenence done on my car.
I'll pay that in a seperate invoice.
After I pay for the maintenance my car will be as good as new.

​ 两个 sed 打印命令利用正则表达式字符组来帮助找到文本中拼错的单词maintenance和separate。同样的,正则表达式模式也能匹配正确拼写的maintenance。

3.5 排除型字符组

$ cat data6
This is a test of a line.
The cat is sleeping.
That is a very nice hat.
This test is at line four.
at ten o'clock we'll go home.
This test is at line four.
$ sed -n '/[^ ch]at/p' data6

​ 首先筛选出含有at的行,之后在这些行必须还要满足条件:at的前面没有(空格)或ch,所以三者均被排除。

3.5 区间

​ 之前演示邮编的例子的时候,必须在每个字符组中列出所有可能的数字,这实在有点麻烦。好在有一种便捷的方法可以让人免受这番劳苦。可以用单破折线符号在字符组中表示字符区间。只需要指定区间的第一个字符单破折线以及区间的最后一个字符就行了。

$ sed -n '/^[0-9][0-9][0-9][0-9][0-9]$/p' data8
60633
46201
45902
$ sed -n '/[c-h]at/p' data6
The cat is sleeping.
That is a very nice hat.

​ 新的模式 [c-h]at 匹配了首字母在字母c和字母h之间的单词。这种情况下,只含有单词 at的行将无法匹配该模式。

$ sed -n '/[a-ch-m]at/p' data6
The cat is sleeping.
That is a very nice hat.

​ 该字符组允许区间ac、hm中的字母出现在 at 文本前,但不允许出现d~g的字母。

$ echo "I'm getting too fat." | sed -n '/[a-ch-m]at/p'

​ 该模式不匹配 fat 文本,因为它没在指定的区间。

3.6 特殊的字符组

$ echo "abc" | sed -n '/[[:digit:]]/p'
$ echo "abc" | sed -n '/[[:alpha:]]/p'
abc
$ echo "abc123" | sed -n '/[[:digit:]]/p'
abc123
$ echo "This is, a test" | sed -n '/[[:punct:]]/p'
This is, a test
$ echo "This is a test" | sed -n '/[[:punct:]]/p'

3.7 星号

$ echo "ik" | sed -n '/ie*k/p'
ik
$ echo "iek" | sed -n '/ie*k/p'
iek
$ echo "ieek" | sed -n '/ie*k/p'
ieek
$ echo "ieeek" | sed -n '/ie*k/p'
ieeek
$ echo "I'm getting a color TV" | sed -n '/colou*r/p'
I'm getting a color TV
$ echo "I'm getting a colour TV" | sed -n '/colou*r/p'
I'm getting a colour TV

​ 模式中的 u* 表明字母u可能出现或不出现在匹配模式的文本中。

$ echo "I ate a potatoe with my lunch." | sed -n '/potatoe*/p'
I ate a potatoe with my lunch.
echo "I ate a potato with my lunch." | sed -n '/potatoe*/p'
I ate a potato with my lunch.

​ 在可能出现的额外字母后面放个星号将允许接受拼错的单词。

$ echo "this is a regular pattern expression" | sed -n '/regular.*expression/p'
this is a regular pattern expression

​ 可以使用这个模式轻松查找可能出现在数据流中文本行内任意位置的多个单词。

$ echo "bt" | sed -n '/b[ae]*t/p'
bt
$ echo "bat" | sed -n '/b[ae]*t/p'
bat
$ echo "bet" | sed -n '/b[ae]*t/p'
bet
$ echo "btt" | sed -n '/b[ae]*t/p'
btt
$ echo "baat" | sed -n '/b[ae]*t/p'
baat
$ echo "baaeeet" | sed -n '/b[ae]*t/p'
baaeeet
$ echo "baeeaeeat" | sed -n '/b[ae]*t/p'
baeeaeeat
$ echo "baakeeet" | sed -n '/b[ae]*t/p'

​ 只要a和e字符以任何组合形式出现在 b 和 t 字符之间(就算完全不出现也行),模式就能够匹配。如果出现了字符组之外的字符,该模式匹配就会不成立。

三、扩展正则表达式

​ POSIX ERE模式包括了一些可供Linux应用和工具使用的额外符号。gawk程序能够识别ERE模式,但sed编辑器不能。


警告:sed编辑器和gawk程序的正则表达式引擎之间是有区别的。gawk程序可以使用大多数扩展正则表达式模式符号,并且能提供一些额外过滤功能,而这些功能都是sed编辑器所不具备的。但正因为如此,gawk程序在处理数据流时通常才比较慢。


​ 本节将介绍可用在gawk程序脚本中的较常见的ERE模式符号。

1. 问号

$ echo "bt" | gawk '/be?t/{print $0}'
bt
$ echo "bet" | gawk '/be?t/{print $0}'
bet
$ echo "beet" | gawk '/be?t/{print $0}'
$ echo "beeet" | gawk '/be?t/{print $0}'
$ echo "bt" | gawk '/b[ae]?t/{print $0}'
bt
$ echo "bat" | gawk '/b[ae]?t/{print $0}'
bat
$ echo "bot" | gawk '/b[ae]?t/{print $0}'
$ echo "bet" | gawk '/b[ae]?t/{print $0}'
bet
$ echo "baet" | gawk '/b[ae]?t/{print $0}'
$ echo "beat" | gawk '/b[ae]?t/{print $0}'
$ echo "beet" | gawk '/b[ae]?t/{print $0}'

2. 加号

#e至少要出现一次
$echo "bt" | gawk '/be+t/{print $0}'
$ echo "bet" | gawk '/be+t/{print $0}'
beeet
$ echo "beet" | gawk '/be+t/{print $0}'
beet
#如果字符组中定义的任一字符出现了,文本就会匹配指定的模式。
$$ echo "beat" | gawk '/b[ae]+t/{print $0}'
beat
$ echo "beet" | gawk '/b[ae]+t/{print $0}'
beet
$ echo "beeat" | gawk '/b[ae]+t/{print $0}'
beeat

3.使用花括号


警告:默认情况下,gawk程序不会识别正则表达式间隔。必须指定gawk程序的 --re- interval命令行选项才能识别正则表达式间隔。


$ echo "bt" | gawk --re-interval '/be{1}t/{print $0}'
$ echo "bet" | gawk --re-interval '/be{1}t/{print $0}'
bet
$ echo "beet" | gawk --re-interval '/be{1}t/{print $0}'

​ 通过指定间隔为1,限定了该字符在匹配模式的字符串中出现的次数。如果该字符出现多次或不出现,模式匹配就不成立。

$ echo "bt" | gawk --re-interval '/be{1,2}t/{print $0}'
$ echo "bet" | gawk --re-interval '/be{1,2}t/{print $0}'
bet
$ echo "beet" | gawk --re-interval '/be{1,2}t/{print $0}'
beet
$ echo "beeet" | gawk --re-interval '/be{1,2}t/{print $0}'

​ 字符 e 可以出现1次或2次,这样模式就能匹配;否则,模式无法匹配。

$ echo "bt" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
$ echo "bat" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
bat
$ echo "bet" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
bet
$ echo "beat" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
beat
$ echo "beet" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
beet
$ echo "beeat" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
beeat
$ echo "baeet" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
baeet
$ echo "baeaet" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'
baeaet
$ echo "baeaeat" | gawk --re-interval '/b[ae]{1,2}t/{print $0}'

​如果字母a或e在文本模式中只出现了1~2次,则正则表达式模式匹配;否则,模式匹配失败。

4. 管道符号

$ echo "The cat is asleep" | gawk '/cat|dog/{print $0}'
The cat is asleep
$ echo "The dog is asleep" | gawk '/cat|dog/{print $0}'
The dog is asleep
$ echo "The sheep is asleep" | gawk '/cat|dog/{print $0}'

这个例子会在数据流中查找正则表达式 cat 或 dog 。正则表达式和管道符号之间不能有空格,否则它们也会被认为是正则表达式模式的一部分。

$ echo "He has a hat." | gawk '/[ch]at|dog/{print $0}'
He has a hat.

5. 表达式分组

$ echo "Sat" | gawk '/Sat(urday)?/{print $0}'
Sat
$ echo "Saturday" | gawk '/Sat(urday)?/{print $0}'
Saturday

​ 结尾的 urday 分组以及问号,使得模式能够匹配完整的 Saturday 或缩写 Sat 。

$ echo "cat" | gawk '/(c|b)a(b|t)/{print $0}'
cat
$ echo "cab" | gawk '/(c|b)a(b|t)/{print $0}'
cab
$ echo "bat" | gawk '/(c|b)a(b|t)/{print $0}'
bat
$ echo "bab" | gawk '/(c|b)a(b|t)/{print $0}'
bab
$ echo "tab" | gawk '/(c|b)a(b|t)/{print $0}'
$ echo "tac" | gawk '/(c|b)a(b|t)/{print $0}'

四、 实战

1. 目录文件计数

2. 验证电话号码

3. 解析邮件地址

上一篇下一篇

猜你喜欢

热点阅读