sed and awk 2022-10-13

2022-10-13 本文已影响0人 9_SooHyun

sed

stream editor 通常用来修改文件
sed [OPTION]... {script-only-if-no-other-script} [input-file]...
[]可选
{} 指定范围内必选
<> 必选

sed script-only-if-no-other-script 通常被单引号或者双引号包裹

sed [options] 'command' file(s) # 单引号所见即所得
sed [options] "command" file(s) # 双引号能够被shell解释器做$变量替换
sed [options] -f scriptfile file(s)

sed 会根据脚本命令来处理文本文件中的数据，这些命令要么从命令行中输入，要么存储在一个文件中
sed按以下顺序操作对象数据：

每次仅读取一行内容；
根据提供的规则命令匹配并修改数据。注意，sed 默认不会直接修改源文件数据，而是会将数据复制到缓冲区中，修改也仅限于缓冲区中的数据；
将执行结果输出
当一行数据匹配完成后，它会继续读取下一行数据，并重复这个过程，直到将文件中所有数据处理完毕

默认情况下，sed 命令会作用于文本数据的所有行，即sed的默认address是all lines
如果只想将命令作用于特定行或某些行，则必须在{script-only-if-no-other-script}指定address部分

address specified for sed

address用来确认sed的作用域
A sed command can specify zero, one, or two addresses.
An address can be a line number, a line addressing symbol, or a regular expression that describes a pattern.

If no address is specified, then the command is applied to each line.
If there is only one address, the command is applied to any line matching the address.
If two comma-separated addresses are specified, the command is performed on the first matching line and all succeeding lines up to and including a line matching the second address. This range may match multiple times throughout the input. 如果指定了两个逗号分隔的地址，这两个地址定义了范围的起止，即在第一个匹配行和所有后续行（包括匹配第二个地址的行）上执行该命令。此范围可能在整个输入中匹配多次。
If an address is followed by an exclamation mark ( ! ), the command is applied to all lines that do not match the address.

To illustrate how addressing works, let's look at examples using the delete command, d .
A script consisting of simply the d command and no address: d produces no output since it deletes all lines.

one-address situation

When a line number is supplied as an address, the command affects only that line. For instance, the following example deletes only the first line: 1d The line number refers to an internal line count maintained by sed . This counter is not reset for multiple input files. Thus, no matter how many files were specified as input, there is only one line 1 in the input stream.

Similarly, the input stream has only one last line. It can be specified using the addressing symbol, $ . The following example deletes the last line of input: $d The $ symbol should not be confused with the $ used in regular expressions, where it means the end of the line.

When a regular expression is supplied as an address, the command affects only the lines matching that pattern. The regular expression must be enclosed by slashes ( / ). The following delete command: /^$/d deletes only blank lines. All other lines are passed through untouched.

two-address situation

If you supply two addresses, then you specify a range of lines over which the command is executed. The following example shows how to delete all lines surrounded by a pair of macros, in this case, .TS and .TE, that mark a table as tbl input: /^\.TS/,/^\.TE/d It deletes all lines beginning with the line matched by the first pattern up to and including the line matched by the second pattern. Lines outside this range are not affected. If there is more than one table (another .TS/.TE pair after the first), those tables will also be deleted.

The following command deletes from line 50 to the last line in the file: 50,$d You can mix a line address and a pattern address: 1,/^$/d This example deletes from the first line up to the first blank line

sed 进阶: hold space & pattern space

h H Copy/append pattern space to hold space.
g G Copy/append hold space to pattern space.
n N Read/append the next line of input into the pattern space.

When sed reads a file line by line, the line that has been currently read is inserted into the pattern buffer (pattern space). Pattern buffer is like the temporary buffer, the scratchpad where the current information is stored. When you tell sed to print, it prints the pattern buffer.

Hold buffer / hold space is like a long-term storage, such that you can catch something, store it and reuse it later when sed is processing another line. You do not directly process the hold space, instead, you need to copy it or append to the pattern space if you want to do something with it. For example, the print command p prints the pattern space only. Likewise, s operates on the pattern space.

Here is an example:

sed -n '1!G;h;$p'
(the -n option suppresses automatic printing of lines)

There are three commands here: 1!G, h and $p. 1!G has an address, 1 (first line), but the ! means that the command will be executed everywhere but on the first line. $p on the other hand will only be executed on the last line. So what happens is this:

1.first line is read and inserted automatically into the pattern space
2.on the first line, first command is not executed; h copies the first line into the hold space.
3.now the second line replaces whatever was in the pattern space
4.on the second line, first we execute G, appending the contents of the hold buffer to the pattern buffer, separating it by a newline. The pattern space now contains the second line, a newline, and the first line.
5.Then, h command copies the concatenated contents of the pattern buffer into the hold space, which now holds the reversed lines two and one.
6.We proceed to line number three -- go to the point (3) above.
Finally, after the last line has been read and the hold space (containing all the previous lines in a reverse order) have been appended to the pattern space, pattern space is printed with p. As you have guessed, the above does exactly what the tac command does -- prints the file in reverse.

echo -e "1\n2\n3\n4" | sed -n '1!G;h;$p'

sed for string replacing

sed s/oldcontent/newcontent/ ：替换pattern space中的oldcontent为newcontent
速度优化：当由于某种原因（比如输入文件较大、处理器或硬盘较慢等）需要提高命令执行速度时，可以考虑在替换命令（“s/.../.../”）前面加上地址表达式来提高速度。举例来说：
sed 's/foo/bar/g' filename # 标准替换命令
sed '/foo/s/foo/bar/g' filename # 通过前置/foo/指定address，速度更快
sed '/foo/s//bar/g' filename # 简写形式

regex for sed

https://www.gnu.org/software/sed/manual/html_node/Extended-regexps.html
The only difference between basic and extended regular expressions is in the behavior of a few characters: ‘?’, ‘+’, parentheses, and braces (‘{}’)
basic regular expressions require these to be escaped if you want them to behave as special characters, when using extended regular expressions you must escape them if you want them to match a literal character.
例如：
要匹配“tt”，

在BRE中使用t\{1,2\}。BRE中，一些字符要使用特殊义，必须使用\ escape
在ERE中则是使用t{1,2}。ERE中，一些字符要使用字面义，必须使用\ escape

sed 默认支持和使用POSIX.2 BREs
使用- E opt可以use extended regular expressions in the script

[root@TENCENT64 /]# echo "abc+def" | sed 's/+/--/g'
abc--def
-> 默认使用BRE，+表示字面义，因此被替换

[root@TENCENT64 /]# echo "abc+def" | sed -E 's/+/--/g'
sed: -e expression #1, char 8: Invalid preceding regular expression
-> 使用ERE，+表示1次或多次，但单一个+是非法的regular expression
[root@TENCENT64 /]# echo "abc+def" | sed -E 's/\+/--/g'
abc--def
-> 使用ERE，\+表示“+”字符
[root@TENCENT64 /]# echo "abc+def" | sed -E 's/c+/--/g'
ab--+def
-> 使用ERE，c+表示“c出现一次或多次”
[root@TENCENT64 /]# echo "abc+def" | sed -E 's/c\+/--/g'
ab--def
-> 使用ERE，c\+表示“c+”字符串
[root@TENCENT64 /]#

more usage see info sed

awk

awk命令是强大的文本查找和提取命令，支持丰富的过滤和提取。使用awk就像使用一个小型数据库一样

NAME
       awk - pattern-directed scanning and processing language

SYNOPSIS
       awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ...  ]

Awk scans each input file for lines that match any of a set of patterns
       specified literally **in prog or in one or more files specified as -f
       progfile**.  

-F:
The -F fs option defines the input field separator to be the regular expression fs. 
-F 定义了separator，它用来切割一行文本以获得若干fields。An input line is normally made up of fields separated by white space, or by the regular expression FS.  
The fields are denoted $1, $2, ..., while $0 refers to the entire line.  If FS is null, the input line is split into one field per character. 
$1, $2 等可以在后面的prog程序中被引用

-v:
The option -v followed by var=value is an assignment to be done before prog is executed; any number of -v options may be present.  

prog:
awk prog(程序语句)的格式如下：
pattern1 {action1} pattern2 {action2} …

With each pattern there can be an associated action that will be performed when a line of a file matches the pattern. 
Each line is matched against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern.  
对每一行，匹配pattern1的执行action1，匹配pattern2的执行action2...

eg
awk -F '$6 != 0{print $0}' filetest 一行的第六个字段不等于0，则打印这一行

awk的内置变量

NR 记录当前已经读取的行数（不是输出的行数）(NUM READ)
FNR 作用域是当前文件的NR(FILE NR)
NF 记录当前行的字段数
trick：当多个输入文件时，NR==FNR 即这一行在第一个文件中，NR>FNR即这一行不在第一个文件中。

awk的内置特殊pattern

BEGIN：匹配第一个输入文件第一行之前的位置
END：最后一个输入文件最后一行之后的位置

awk的action

action里面可以定运算，支持+ - × / % 五种运算。变量直接只用，不需要声明
action中如果有多条语句，那么可以用;隔开
awk中只有两种类型：数值(双精度浮点)、字符串。变量可以使用字符串拼接进行赋值，拼接时使用空格隔开即可

awk还支持一般编程语言中常见的控制结构if、while、for，和c中的写法一样

if(){}else{}
while(){}
for( ; ;){}

[root@VM-165-116-centos ~]# free
              total        used        free      shared  buff/cache   available
Mem:       16132456     2242816     1608600      274992    12281040    13378556
Swap:       1048572       64256      984316
[root@VM-165-116-centos ~]# free | awk '{print $2}'
used
16132456
1048572
# 这里used的值对应到total列去了，因为第一行的$2和剩余行的$2不一致

[root@VM-165-116-centos ~]# free | awk '{if(NR==1){print $2}else{print $3}}'
used
2245456
64256
[root@VM-165-116-centos ~]#