Linux文本处理三剑客——grep

2018-05-30 本文已影响0人 Hye_Lau

文本处理三工具：grep，sed，awk

grep：文本过滤工具：pattern
sed：行编辑器：模式空间、保持空间
awk：报告生成器：格式化文本输出；

一.grep

作用：

文本搜索工具，根据用户指定的“模式（过滤条件）”对目标文本逐行进行匹配检查；
打印匹配到的行；

模式：

由正则表达式的元字符及文本字符所编写出的过滤条件；

正则表达式引擎：

grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]

常用选项

--color=auto：对匹配到的文本着色高亮显示；
-i，ignorecase：忽略字符的大小写；
-o：仅显示匹配到的字符串本身；
-v，--invert-match：显示不能被模式匹配到的行；
-E： 支持使用扩展的正则表达式元字符；
-q，--quiet, --silent：静默模式，即不输出任何信息；
-A #：after，后#行
-B #：before，前#行
-C #：context，前后各#行

二.基本正则表达式元字符

1.字符匹配

 .：匹配任意单个字符；
[]：匹配指定范围内的任意单个字符；
[^]：匹配指定范围外的任意单个字符；

以上[]中的范围有以下几种表示方法：

[:digit:]：表示所有的数字
[:lower:]：表示所有的小写字母
[:upper:]：表示所有的大写字母
[:alpha:]：表示所有的字母（不区分大小写）
[:alnum:]：表示所有的字母和数字
[:punct:]：表示所有的标点符号
[:space:]：表示所有的空白字符

2.匹配次数

用在要指定其出现的次数的字符的后面，用于限制其前面字符出现的次数；
默认工作于贪婪模式

*：匹配其前面的字符任意次；0,1，多次；
     例如：grep “x*y"    abxy、aby、xxxxy、yab均匹配
.*：匹配任意长度的任意字符
\?：匹配其前面的字符0次或1次；即其前面的字符是可有可无的；
\+：匹配其前面的字符1次或多次；即前面的字符要出现至少1次；
\{m\}：匹配其前面的字符m次；
\{m,n\}：匹配其前面的字符至少m次，至多n次；
\{0,n\}：至多n次；
\{m,\}：至少m次；

3.位置锚定

^：行首锚定；用于模式的最左侧；
$：行尾锚定；用于模式的最右侧；
^PATTERN$：用PATTERN来匹配整行；
       ^$：空白行；  
       ^[[：space：]]*$：空行或包含空白字符的行；

单词：非特殊字符组成的连续字符（字符串）都成为单词；
\<或\b：词首锚定，用于单词模式的左侧；
\>或\b：词尾锚定，用于单词模式的右侧；
\<PATTERN\>：匹配完整单词；

4.分组及引用

{}：将一个或多个字符捆绑在一起，当做一个整体进行处理；
如：{xy}*ab 表示xy这个整体可以出现任意次
注意：
分组括号中的模式匹配到的内容会被正则表达式引擎自动记录于内部的变量中，这些变量为：

 \1：模式从左侧起，第一个左括号以及与之匹配的右括号之间的模式所匹配到的字符；
 \2：模式从左侧起，第二个左括号以及与之匹配的右括号之间的模式所匹配到的字符；
 \3：
     ....

后向引用：引用前面的分组括号中的模式所匹配到的字符，而非模式本身；

实例

1.查找特定字符串；

//（1）从文件中/scripts/regular_express.txt 中取得the这个特定字符串
[root@localhost ~]# grep -n 'the' ~/scripts/regular_express.txt 
8:I can't finish the test.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

//（2）取反，取出不含有'the'这个字符串的行，显示8/12/15/16/18以外的行
[root@localhost ~]# grep -vn 'the' ~/scripts/regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
5:However, this dress is about $ 3183 dollars.
6:GNU is free air not free beer.
7:Her hair is very beauty.
9:Oh! The soup taste good.
10:motorcycle is cheap than car.
11:This window is clear.
13:Oh!  My god!
14:The gd software is a library for drafting programs.
17:I like dog.
19:goooooogle yes!
20:go! go! Let's go.
21:# I am VBird

//（3）取出不区分大小写的'the'字符
[root@localhost ~]# grep -in 'the' ~/scripts/regular_express.txt 
8:I can't finish the test.
9:Oh! The soup taste good.
12:the symbol '*' is represented as start.
14:The gd software is a library for drafting programs.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

2.利用中括号[ ]来查找集合字符集

//（1）查找test或taste这两个单词，它们有共同的't?st'存在
[root@localhost ~]# grep -n 't[ae]st' ~/scripts/regular_express.txt 
8:I can't finish the test.
9:Oh! The soup taste good.

//（2）利用反向选择[^]查找oo前面不为g的字符
[root@localhost ~]# grep -n '[^g]oo' ~/scripts/regular_express.txt 
2:apple is my favorite food.
3:Football game is not use feet only.
18:google is the best tools for search keyword.
19:goooooogle yes!

//（3）查找ooq前面不为小写字母的字符
[root@localhost ~]# grep -n '[^a-z]oo' ~/scripts/regular_express.txt 
3:Football game is not use feet only.
或
[root@localhost ~]# grep -n '[^[:lower:]]oo' ~/scripts/regular_express.txt 
3:Football game is not use feet only.

//（4）取出有数字的行
[root@localhost ~]# grep -n '[0-9]' ~/scripts/regular_express.txt 
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.
或
[root@localhost ~]# grep -n '[[:digit:]]' ~/scripts/regular_express.txt 
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

3.行首字符^与行尾字符$

//（1）查找行首为the的行
[root@localhost ~]# grep -n '^the' ~/scripts/regular_express.txt 
12:the symbol '*' is represented as start.

//（2）查找开头是小写字符的行
[root@localhost ~]# grep -n '^[a-z]' ~/scripts/regular_express.txt 
2:apple is my favorite food.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
12:the symbol '*' is represented as start.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.
或
[root@localhost ~]# grep -n '^[[:lower:]]' ~/scripts/regular_express.txt 
2:apple is my favorite food.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
12:the symbol '*' is represented as start.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

//（3）查找空白行
grep -n '^$' ~/scripts/regular_express.txt

4.任意一个字符.与重复字符*

//（1）.代表一定有一个任意字符
[root@localhost ~]# grep -n 'g..d' ~/scripts/regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
9:Oh! The soup taste good.
16:The world <Happy> is the same with "glad".

//（2）*代表0或无穷多次
[root@localhost ~]# grep -n 'ooo*' ~/scripts/regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

5.限定连续RE字符范围{}

//（1）找到有两个o的字符串
[root@localhost ~]# grep -n 'o\{2\}' ~/scripts/regular_express.txt 
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

//（2）找出g后面接2到5个o，然后再接一个g的字符串
[root@localhost ~]# grep -n 'go\{2,5\}g' ~/scripts/regular_express.txt 
18:google is the best tools for search keyword.

//（3）找出2个o以上的gooo...g
[root@localhost ~]# grep -n 'go\{2,\}g' ~/scripts/regular_express.txt 
18:google is the best tools for search keyword.
19:goooooogle yes!

练习

1.显示/etc/passwd文件中不以/bin/bash结尾的行；

[root@localhost ~]# grep -v "/bin/bash$" /etc/passwd
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
......
slackware:x:2002:2018::/home/slackware:/bin/tcsh

2.找出/etc/passwd文件中两位或三位数；

[root@localhost ~]# grep "\<[0-9]\{2,3\}\>" /etc/passwd
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
......
basher:x:502:502::/home/basher:/bin/bash

3.找出/etc/rc.d/rc.sysinit文件中，以至少一个空白字符开头，且后面非空白字符的行；

grep  "^[[:space:]] \+[^[:space:]]" /etc/rc.d/rc.sysinit

4.找出"netstat -tan"命令的结果中以"LISTEN"后跟0、1或多个空白字符结尾的行；

[root@localhost ~]# netstat -tan | grep "LISTEN[[:space:]]*$"
tcp        0      0 0.0.0.0:43150               0.0.0.0:*                   LISTEN      
tcp        0      0 0.0.0.0:111                 0.0.0.0:*                   LISTEN      
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      
tcp        0      0 127.0.0.1:631               0.0.0.0:*                   LISTEN      
tcp        0      0 127.0.0.1:25                0.0.0.0:*                   LISTEN      
tcp        0      0 :::111                      :::*                        LISTEN      
tcp        0      0 :::60309                    :::*                        LISTEN      
tcp        0      0 :::22                       :::*                        LISTEN      
tcp        0      0 ::1:631                     :::*                        LISTEN      
tcp        0      0 ::1:25                      :::*                        LISTEN

三.egrep：

支持扩展的正则表达式实现类似于grep文本过滤功能；
grep -E

命令格式及常用选项

egrep [OPTIONS] PATTERN [FILE...]
OPTIONS：-i,-o,-v,-A,-B,-C
 -G：支持基本正则表达式

四.扩展正则表达式的元字符：

1.字符匹配

.：任意单个字符
[]：指定范围内的任意单个字符
[^]：指定范围外的任意单个字符

2.次数匹配

*：任意次，0,1或多次；
？：0次或1次，其前面的字符是可有可无的；
+：其前面字符至少1次；
{m}：其前面的字符m次；
{m,n}：至少m次，至多n次；
{0,m}
{m,}

3.位置锚定

^：行首锚定；用于模式的最左侧；
$：行尾锚定；用于模式的最右侧；
\<或\b：词首锚定，用于单词模式的左侧；
\>或\b：词尾锚定，用于单词模式的右侧；
\<PATTERN\>：匹配完整单词；

4.分组及引用

()：分组；
括号内的模式匹配到的字符会被记录于正则表达式引擎的内部变量中；
后向引用：\1，\2，...
或：

a|b：a或者b；
C|cat：C或cat
(c|C)at：cat 或Cat

练习：

1.找出/proc/meminfo文件中，所有在大写或小写s开头的行，至少有三种实现方式；

grep -i '^s' /proc/meminfo
grep '^[Ss]' /proc/meminfo
grep -E '^(S|s） /proc/meminfo

[root@localhost ~]# grep -i '^s' /proc/meminfo
SwapCached:            0 kB
SwapTotal:       1023992 kB
SwapFree:        1023992 kB
Shmem:               236 kB
Slab:              61920 kB
SReclaimable:      31508 kB
SUnreclaim:        30412 kB

[root@localhost ~]# grep -E '^(S|s)' /proc/meminfo
SwapCached:            0 kB
SwapTotal:       1023992 kB
SwapFree:        1023992 kB
Shmem:               236 kB
Slab:              61924 kB
SReclaimable:      31504 kB
SUnreclaim:        30420 kB
[root@localhost ~]# grep  '^[Ss]' /proc/meminfo
SwapCached:            0 kB
SwapTotal:       1023992 kB
SwapFree:        1023992 kB
Shmem:               236 kB
Slab:              61904 kB
SReclaimable:      31504 kB
SUnreclaim:        30400 kB

参考书籍
《鸟哥的Linux私房菜--基础学习篇》

Linux文本处理三剑客——grep

文本处理三工具：grep，sed，awk

一.grep

作用：

模式：

正则表达式引擎：

常用选项

二.基本正则表达式元字符

1.字符匹配

以上[]中的范围有以下几种表示方法：

2.匹配次数

3.位置锚定

4.分组及引用

实例

1.查找特定字符串；

2.利用中括号[ ]来查找集合字符集

3.行首字符^与行尾字符$

4.任意一个字符.与重复字符*

5.限定连续RE字符范围{}

练习

三.egrep：

命令格式及常用选项

四.扩展正则表达式的元字符：

1.字符匹配

2.次数匹配

3.位置锚定

4.分组及引用

练习：

猜你喜欢

热点阅读