4-19-4 Linux中的正则表达式 --- 分组

2022-08-26 本文已影响0人捌千里路雲和月

1、分组：将需要过滤的内容规划成一个整体，简单理解就是把字符框选起来当作一个整体进行过滤处理。

2、分组可以进行嵌套。

3、后向引用：通过 \序号引用前面分组匹配出来的结果。

1、分组：将需要过滤的内容规划成一个整体，简单理解就是把字符框选起来当作一个整体进行过滤处理。

①、假设需要过滤 appleappleapple 这 3 个连续的 apple 的单词。首先编辑内容文件 test.txt

[root@localhost ~]# vim test.txt 

appleappleapple
                                                                                           
~                                                                                                
~                                                                                                
~                                                                                                
:wq

要义一：grep -w 不能过滤没有分隔符的字符串。

## grep -w 参数不能过 3 个 apple 单词过滤出来
[root@localhost ~]# cat test.txt 
appleappleapple
[root@localhost ~]# grep -w "apple" test.txt 
[root@localhost ~]#

把内容中的 apple 分隔开才可以用 grep -w 参数过滤单词。

[root@localhost ~]# vim test.txt    ## 编辑 test.txt 

apple apple apple    ## 空格分隔开 apple
                                                                             
~                                                                                                
~                                                                                                
~                                                                                                
:wq    ## 保存

[root@localhost ~]# cat test.txt    ## 查看 test.txt 文件内容，apple 已经隔开 
apple apple apple
[root@localhost ~]# grep -w "apple" test.txt    ## 分隔开的 apple 才可以用 -w 参数过滤单词 
apple apple apple
[root@localhost ~]#

grep -w 图示

要义二：对于需要过滤一定次数的字符串用 \{ 次数 \} 进行次数的控制。那么，想要过滤连续的 appleappleapple。如果用 "apple\{ 3 \}" 原意上希望过滤连续出现 3 次的 apple。实际上 \{ 3\} 只是过滤前面连续出现的字符 3 次，也就是连续出现 3 次 e 的数据。

[root@localhost ~]# vim test.txt 

appleappleapple
appleee    ## 新增一条 appleee
                                                                                 
~                                                                                        
~                                                                                        
~                                                                                        
:wq    ## 保存


[root@localhost ~]# 
[root@localhost ~]# cat test.txt     
appleappleapple
appleee    ## <--- 新增的内容
[root@localhost ~]# grep "apple\{3\}" test.txt    ## "apple\{3\}" 实际上是过滤前面的字符连续出现 3 次的数据
appleee    ## 过滤了连续出现 3 次 e 的数据。
[root@localhost ~]#

要义三：如果需要过滤一个单词整体出现多少次则需要用到分组。把希望处理的字符括起来就可以，如 (apple)，当成一个整体。后面跟上过滤的次数就可以根据括起来的字符这一整体进行过滤。

[root@localhost ~]# cat test.txt 
appleappleapple

## \( \)，转义括号 ，\{ \} 转义花括号
[root@localhost ~]# grep "\(apple\)\{1\}" test.txt    ## 过滤 1 个 apple 
appleappleapple
[root@localhost ~]# grep "\(apple\)\{2\}" test.txt    ## 过滤 2 个 apple 
appleappleapple
[root@localhost ~]# grep "\(apple\)\{3\\}" test.txt    ## 过滤 3 个 apple 
appleappleapple
[root@localhost ~]# grep "\(apple\)\+" test.txt    ## 过滤 1 个以上 apple 
appleappleapple
[root@localhost ~]#

( apple ) 分组过滤图示

 表示分组，$apple$ 表示把 apple 字符串当成一个整体，而 \{3\} 则可以针对 apple 进行连续出现次数的过滤。
注：基础正则的 ( ) 括号和 { }花括号需要用 \ 转义。

2、分组可以进行嵌套。简单理解就是分组内含分组。

新增一个 banana 单词。从单词上看，除去开头 b，结尾 a。中间的 anan 是有规律且重复的内容。所以，可以用分组来表达。

[root@localhost ~]# vim test.txt    ## 编辑 test.txt  

appleappleapple
banana    ## 新增内容
                                                                                
~                                                                                                
~                                                                                                
~                                                                                                
:wq    ## 保存

[root@localhost ~]#
[root@localhost ~]# grep "\(b\(an\)\{2\}a\)\{1\}" test.txt 
banana

分组嵌套图示

$b\(an$\{2\}a\)\{1\} 图示

应用在多个 banana 上的情况。

[root@localhost ~]# vim test.txt 

appleappleapple
bananabananabanana    ## 新增内容
                                                                                       
~                                                                                                
~                                                                                                
~                                                                                                
:wq          

[root@localhost ~]# grep "\(b\(an\)\{2\}a\)\{3\}" test.txt 
bananabananabanana
[root@localhost ~]#

多个 banana 图示

3、后向引用：通过 \序号引用前面分组匹配出来的结果。
③-①、假设有 apple banana cherry damson apple banana cherry damson
这段话。

[root@localhost ~]# vim test.txt 

apple banana cherry damson apple banana cherry damson
                                                                                              
~                                                                                                
~                                                                                                
~                                                                                                
:wq


[root@localhost ~]# cat test.txt 
apple banana cherry damson apple banana cherry damson
[root@localhost ~]#

如果想表达两个 apple 之间的内容。可以用 grep "apple.*apple" test.txt。

[root@localhost ~]# grep "apple.*apple" test.txt 
apple banana cherry damson apple banana cherry damson
[root@localhost ~]#

grep "apple.*apple" test.txt

③-②、另外一种方法是分组和后向引用，使用后向引用之前，一定要有分组。

[root@localhost ~]# grep "\(apple\).*\1" test.txt 
apple banana cherry damson apple banana cherry damson
[root@localhost ~]#

后向引用

后向引用是引用前面分组的的结果。从表达式分析 $apple$.*\1，首先把 apple 用 ( ) 括起来分组，作为一个整体。（基础正则的（）括号需要用 \ 进行转义）。.* 代表任意字符，\1 代表引用第一个分组，这里第一个分组是 apple。所以 \1 也等同于 apple。

后向引用

4、正如 \1 引用表达式的第一分组，那么 \2、\3...就是引用表达式的第二、第三...等等对应的分组。
④-①、test.txt 编辑测试内容。

[root@localhost ~]# vim test.txt 

apple or banana or cherry or damson: apple
apple or banana or cherry or damson: banana
apple or banana or cherry or damson: cherry
apple or banana or cherry or damson: damson                                                                             
~                                                                                                
~                                                                                                
~                                                                                                
:wq

[root@localhost ~]# cat test.txt 
apple or banana or cherry or damson: apple
apple or banana or cherry or damson: banana
apple or banana or cherry or damson: cherry
apple or banana or cherry or damson: damson
[root@localhost ~]#

结构图解

④-②、把前段的 apple、banana、cherry 和 damson 进行分组，后段通过后向引用前面不同的分组，从而输出对应的行内容。通过此了解 \1、\2、\3、\4 引用前面的分组结果。

[root@localhost ~]# cat test.txt    ## 查看 test.txt 的内容 
apple or banana or cherry or damson: apple
apple or banana or cherry or damson: banana
apple or banana or cherry or damson: cherry
apple or banana or cherry or damson: damson

##  通过不同的后向引用，输出对应的行内容。
[root@localhost ~]# grep "\(apple\).*\(banana\).*\(cherry\).*\(damson\).*\4" test.txt 
apple or banana or cherry or damson: damson
[root@localhost ~]# grep "\(apple\).*\(banana\).*\(cherry\).*\(damson\).*\1" test.txt 
apple or banana or cherry or damson: apple
[root@localhost ~]# grep "\(apple\).*\(banana\).*\(cherry\).*\(damson\).*\2" test.txt 
apple or banana or cherry or damson: banana
[root@localhost ~]# grep "\(apple\).*\(banana\).*\(cherry\).*\(damson\).*\3" test.txt 
apple or banana or cherry or damson: cherry

通过输出进行分析：
grep "$apple$.*$banana$.*$cherry$.*$damson$.*\1" test.txt 分析图示。

grep "$apple$.*$banana$.*$cherry$.*$damson$.*\1" test.txt 分析图示

grep "$apple$.*$banana$.*$cherry$.*$damson$.*\2" test.txt 分析图示。

grep "$apple$.*$banana$.*$cherry$.*$damson$.*\2" test.txt 分析图示

grep "$apple$.$banana$.$cherry$.$damson$.\3" test.txt 分析图示。

$apple$.*$banana$.*$cherry$.*$damson$.*\3" test.txt 分析图示

grep "$apple$.* $banana$.* $cherry$.* $damson$.* \4" test.txt 分析图示。

grep "$apple$.*$banana$.*$cherry$.*$damson$.*\4" test.txt 分析图示

④-③、词首和词尾相呼应。假设有一需求，输出词首和词尾相同的行。延用以上的测试内容。在已知词首的情况下可以在表达式中写上具体的词首内容，然后通过后向引用分组内容。情况如下：

[root@localhost ~]# cat test.txt 
apple or banana or cherry or damson: apple
apple or banana or cherry or damson: banana
apple or banana or cherry or damson: cherry
apple or banana or cherry or damson: damson
[root@localhost ~]# 
[root@localhost ~]# grep "^\(apple\).*\1$" test.txt 
apple or banana or cherry or damson: apple

grep "^$apple$.*\1$" test.txt 输出词首词尾 apple 的行

④-④、假设不知道一段内容中有多少词首和词尾相同的行。这种情况下并不能像以上的例子把具体内容进行分组，没有分组就不能后向引用。这时可以用 .* 进行分组来表达词首的内容，再用后向引用分组的结果。情况如下：

④-④-①、修改一下 test.txt 文档内容。

[root@localhost ~]# vim test.txt 

apple or banana or cherry or damson: apple    ## 词首和词尾是相同的字符串
apple or banana or cherry or damson: banana
cherry or apple or banana or damson: cherry    ## 词首和词尾是相同的字符串    
apple or banana or cherry or damson: damson                                                                                   
~                                                                                                 
~                                                                                                 
~                                                                                                 
:wq!    ## 保存并退出

④-④-②、grep "^$.*$ .* \1$" test.txt 输出词首词尾相同的行。

[root@localhost ~]# 
[root@localhost ~]# cat test.txt 
apple or banana or cherry or damson: apple
apple or banana or cherry or damson: banana
cherry or apple or banana or damson: cherry
apple or banana or cherry or damson: damson
[root@localhost ~]# 
[root@localhost ~]# 
[root@localhost ~]# grep "^\(.*\) .* \1$" test.txt 
apple or banana or cherry or damson: apple
cherry or apple or banana or damson: cherry
[root@localhost ~]# 
[root@localhost ~]#

grep "^(.*) .* \1$" test.txt 输出词首词尾相同的行

grep "^(.*) .* \1$" test.txt 输出词首词尾相同的行图示

④-④-③、注意事项，用 .* 进行词首词尾匹配时需要留意内容的格式。比如，^$.*$ 后面没有空格和 \1$ 前面没有空格，它的输出效果如下：

grep "^(.*) .* \1$" test.txt 表达式中 ^$.*$ 后面没有空格和 \1$ 前面没有空格的输出效果

因为 .* 代表所有（包含空）。^$.*$ 是从头到尾所有内容的意思。通过下图可以直观看到效果。

^$.*$ 输出从头到位的内容

如上述，表达式 ^$.*$ .* \1$ 中的分组是输出整行的内容，当 ^$.*$ 分组后面没有空格隔开时，后向引用就是引用了分组的结果。所以，输出是全部内容标红。并不是预期的只输出词首词尾一致的行。

④-④-④、实现输出词首词尾一致的行。需要考虑的是把词首区域的单词进行分组。然后，通过后向引用词尾的单词来和词首分组的结果进行对比，如果一致就过滤输出。通过观察文本格式，不难发现词首 apple 后面跟了一个空格。不妨试试 ^$.*$ 后用空格隔开的效果。

^$.*$ 后面用空格隔开的效果

从以上图片可以看到 ^$.*$ 后用空格隔开输出的效果是字符串后面有空格的都标红。这样的就知道怎样去找到词首。之所以这么多标红，是因为 grep 的贪婪模式。凡是单词后有空格的都标红，稍微改动一下能更清晰的理解。

[root@localhost ~]# vim test.txt 

apple:or banana or cherry or damson apple    ## 词首 apple 后跟 : 号，词尾 apple 前删除 : 号
apple or banana or cherry or damson: banana
cherry or apple or banana or damson: cherry
apple or banana or cherry or damson: damson                                                                                
~                                                                                                 
~                                                                                                 
~                                                                                                 
:wq    ## 保存并退出

"^(.*):" ：意思是 : 号前的任意字符串都是词首，通过修改过后的第一行不难发现通过首个单词后跟符号进行分隔，可以找到词首的内容。

"^(.*):" 后面用 : 号隔开的效果

至此就是通过 "^(.*) " 匹配后续分隔符来找词首的方法。

④-④-⑤：词尾作为后向引用时，需不需要结合文本格式进行输入？

后向引用 \1$ 前有空格和没有空格效果，从输出效果看并没有什么不同。

后向引用 \1$ 前有空格和没有空格效果

修改 test.txt 文件，增加三行内容。词尾分别是 xxapple、applexx、xxapplexx，再看看情况。

[root@localhost ~]# 
[root@localhost ~]# vim test.txt 

apple or banana or cherry or damson: apple
apple or banana or cherry or damson: xxapple    ## 增加内容，词尾 xxapple
apple or banana or cherry or damson: applexx    ## 增加内容，词尾 applexx
apple or banana or cherry or damson: xxapplexx    ## 增加内容，词尾 xxapplexx
apple or banana or cherry or damson: banana
cherry or apple or banana or damson: cherry
apple or banana or cherry or damson: damson
                                                                      
~                                                                                        
~                                                                                        
~                                                                                        
:wq

通过词尾不同格式的输出得出以下的情况。

词尾匹配格式图示

由此可见，词尾作为后向引用还是匹配文本内容格式输出的效果较为严谨一些。

5、后向引用嵌套分组如何区分第一分组(\1)、第二分组(\2)？？？
⑤-①、以 bananabanana 为测试数据。
⑤-②、先通过两个表达式查看一下输出结果。
分别是：grep "$b\(an$\).*\1" 和 grep "$b\(an$\).*\2"

grep "$b\(an$\).*\1"

grep "$b\(an$\).*\2"

⑤-③、后向引用嵌套分组的顺序取决于分组时左括号的顺序。从表达式 $b\(an$\) 分析：

嵌套分组如何区分第一分组(\1)、第二分组(\2)

⑤-④、\1 后向引用嵌套分组图示。

\1 后向引用嵌套分组图示

⑤-⑤、\2 后向引用嵌套分组图示。

\2 后向引用嵌套分组图示

6、扩展：为什么说后向引用是分组的结果？？？
⑥-①、测试用例：applexxabble。当 grep "$a...e$.*\1" 时，$a...e$ 是 a开头，e结尾，中间任意 3 位字符的字符串，分组代表 apple。虽然 apple 和 abble 的开头和结尾都是一样，位数也是一样。而 \1 后向引用 $a...e$ 时则没有输出结果。这是因为 apple 和 abble 并不一样，\1 后向引用 $a...e$ 时并不是引用 a 开头，e结尾，中间任意 3 个字符的字符串。而是必须要和分组的结果相一致才能后向引用成功。

[root@localhost ~]# echo "applexxabble" | grep "\(a...e\).*\1"
[root@localhost ~]#

echo "applexxabble" | grep "$a...e$.*\1" 图示

⑥-②、把 abble 改为 apple，\1 才能起到后向引用的效果。

root@localhost ~]# echo "applexxapple" | grep "\(a...e\).*\1"
applexxapple

echo "applexxapple" | grep "$a...e$.*\1" 图示

7、或符号 \|。（如 a \| b（a 或者 b）。a \| A（a 或者 A）

⑦-①、测试数据。

测试数据

⑦-②、输出 a 开头或这 b 开头的行。

输出 a 开头或 b 开头的行

⑦-③、输出 a 开头或 y 结束的行。

输出 a 开头或 y 结束的行

⑦-④、输出含有 pp 或 an 或 rr 字符的行。

输出含有 pp 或 an 或 rr 字符的行

⑦-⑤、输出 a 开头（后面任意字符）或 b 开头（后面任意字符）的行，

test.txt 文件增添一些内容：

[root@localhost ~]# vim test.txt 

almond 杏仁
apple 苹果
apricot 杏子
arbutus 杨梅
banana 香蕉
bennet 水杨梅
bergamot 佛手柑
berry 桨果
betelnut 槟榔
cherry 樱桃
damson 洋李子                                                        
~                                                                                        
~                                                                                        
~                                                                                        
:wq    ## 保存

--------------------------------------------------------------------

[root@localhost ~]# 
[root@localhost ~]# cat test.txt 
almond 杏仁
apple 苹果
apricot 杏子
arbutus 杨梅
banana 香蕉
bennet 水杨梅
bergamot 佛手柑
berry 桨果
betelnut 槟榔
cherry 樱桃
damson 洋李子
[root@localhost ~]#

grep "^$a\|b$.*" test.txt 过滤 a 和 b 开头，后面任意字符的行。

grep "^$a\|b$.*" test.txt 过滤 a 和 b 开头，后面任意字符的行

不要写成 grep "^ a\|^b.*" test.txt，这样的意思是 a 开头或者 b 开头后面任意字符的意思。

grep "^a\|^b.*" test.txt 是 a 开头或者 b 开头后面任意字符的意思

⑦-⑥、如有一需求是过滤 apple 或 bpple。grep "^a\|bpple" test.txt 实际输出时 a 开头的字符串，并不是 apple 或 bpple。

grep "^a\|bpple" test.txt 实际输出时 a 开头的字符串，并不是 apple 或 bpple

这里应该用分组，grep "^$a\|b$pple" test.txt，^ 符号后用分组把 a \| b 括起来才表示 apple 或 bpple。

grep "^$a\|b$pple" test.txt，^ 符号后用分组把 a \| b 括起来才表示 apple 或 bpple

⑦-⑦、如有一需求是过滤相同的字符串，忽略首字母大小写。

新增首字母大写的 Apple

[root@localhost ~]# vim test.txt 

almond 杏仁
apple 苹果
Apple 苹果    ## 新增首字母大写的 Apple
apricot 杏子
arbutus 杨梅
banana 香蕉
bennet 水杨梅
bergamot 佛手柑
berry 桨果
betelnut 槟榔
cherry 樱桃
damson 洋李子
                                                                        
~                                                                                        
~                                                                                        
~                                                                                        
:wq    ## 保存并退出

同样的，a 和 A 进行分组。

a 和 A 进行分组

如果不把 a 和 A 进行分组，过滤条件的意思是过滤 a 开头或者 APPLE 的字符串。

4-19-4 Linux中的正则表达式 --- 分组

猜你喜欢

热点阅读