Linux学习之路Linux专题我用 Linux

文本处理命令(二)——awk学习总结

2018-05-02  本文已影响17人  Chuck_Hu

上篇学习了grep文本处理工具,这篇总结下另一款更强大的处理工具awk。
AWK 是一种解释执行的编程语言。它非常的强大,被设计用来专门处理文本数据。AWK 的名称是由它们设计者的名字缩写而来 —— Afred Aho, Peter Weinberger 与 Brian Kernighan。除了文本处理,awk还可以生成格式化的文本报告,进行算术运算,字符串操作等。

工作流程

awk的工作流程就三步:读取、执行和重复


awk工作流程

(1)读(Read)
AWK 从输入流(文件、管道或者标准输入)中读入一行然后将其存入内存中。
(2)执行(Execute)
对于每一行输入,所有的 AWK 命令按顺执行。 默认情况下,AWK 命令是针对于每一行输入,但是我们可以将其限制在指定的模式中。
(3)重复(Repeate)
一直重复上述两个过程直到文件结束。

awk程序结构

分为三个模块:开始模块,主体模块和结束模块

开始模块

语法:

BEGIN {awk-commands}

仅在程序启动时执行,且只执行一次,通常用于为变量赋值等初始化操作。BEGIN必须大写。
另外,开始模块是可选模块,可以没有。

主体模块

语法:

/pattern/ {awk-commands}

主体模块就是程序对文件每行进行处理的部分。

结束模块

语法:

END {awk-commands}

类似开始模块,结束模块只在结束时调用一次,也是可选模块,END关键字必须大写。
实例
文件内容(后文通用文件)test.txt

1)    Amit     Physics    80
2)    Rahul    Maths      90
3)    Shyam    Biology    87
4)    Kedar    English    85
5)    Hari     History    89

awk命令实例

awk 'BEGIN{printf "NO\tName\tSubject\tMark\n"} {print$0}' test.txt

输出:

NO    Name    Subject    Mark
1)    Amit     Physics    80
2)    Rahul    Maths      90
3)    Shyam    Biology    87
4)    Kedar    English    85
5)    Hari     History    89

基础语法

awk基础语法,awk [awk commands] file,awk关键字必带,之后是一串awk指令,即上文讲的语法模块,最后是要处理的文件。
如果在命令行中输入awk指令,awk命令主体部分必须包括在''内,且每句指令需要包括在{}中。例如打印test.txt全文:

awk '{print}' test.txt

awk命令可以在命令行执行,也可以写入文件中执行。还是上面的操作,这次换到文件执行,创建exe1.awk

{print}

awk执行指令文件需要通过-f选项完成,awk -f awkcommandfile targetfile

awk -f exe.awk test.txt

也将得到前文的输出结果。

常用awk选项

输入awk --help,列出全部可用选项

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:      GNU long options:
    -f progfile     --file=progfile
    -F fs           --field-separator=fs
    -v var=val      --assign=var=val
    -m[fr] val
    -W compat       --compat
    -W copyleft     --copyleft
    -W copyright        --copyright
    -W dump-variables[=file]    --dump-variables[=file]
    -W gen-po       --gen-po
    -W help         --help
    -W lint[=fatal]     --lint[=fatal]
    -W lint-old     --lint-old
    -W non-decimal-data --non-decimal-data
    -W profile[=file]   --profile[=file]
    -W posix        --posix
    -W re-interval      --re-interval
    -W source=program-text  --source=program-text
    -W traditional      --traditional
    -W usage        --usage
    -W version      --version

①-f:执行指令文件
②-F:指定内容分隔符,默认使用空格分割。例如:-F':'即指定:作为分隔符,''单引号可以省去。可以同时使用多个域分隔符,这时应该把分隔符写成放到方括号中,awk -F[:\t] 可以使用冒号、制表符进行分割。
③-v:赋值操作。除了可以在BEGIN中进行赋值,awk还提供了-v选项在命令主体外进行赋值。

awk -v name=Chuck 'BEGIN{printf "my name is %s\n", name}'
输出:
my name is Chuck

④--dump-variables[=file]:输出awk全局变量

#输出awk全局变量
awk --dump-variables ''
#默认在awkvars.out中
cat awkvars.out
输出:
ARGC: number (1)
ARGIND: number (0)
ARGV: array, 1 elements
BINMODE: number (0)
CONVFMT: string ("%.6g")
ERRNO: number (0)
FIELDWIDTHS: string ("")
FILENAME: string ("")
FNR: number (0)
FS: string (" ")
IGNORECASE: number (0)
LINT: number (0)
NF: number (0)
NR: number (0)
OFMT: string ("%.6g")
OFS: string (" ")
ORS: string ("\n")
RLENGTH: number (0)
RS: string ("\n")
RSTART: number (0)
RT: string ("")
SUBSEP: string ("\034")
TEXTDOMAIN: string ("messages")

⑤--profile:格式化awk指令,将通过命令行输入的awk指令格式化到文件中

#命令行输入执行语句,--profile默认写入awkprof.out中
awk --profile 'BEGIN{print "This is BEGIN"} {print $1,$2,$3} END{print "AWK command end"}' test.txt 
#查看格式化后的命令
cat awkprof.out
输出:
# gawk profile, created Sat Apr  7 09:39:57 2018

    # BEGIN block(s)

    BEGIN {
        (prinf "This is BEGIN")
    }

    # Rule(s)

    {
        print $1, $2, $3
    }

    # END block(s)

    END {
        print "AWK command end"
    }

美不美。指定文件的话只需要--profile=filename即可,④中同样适用。

内置变量

①ARGC
参数个数

awk 'BEGIN{print "num of argument is ", ARGC}' one two three four
输出:
num of argument is 5

②ARGV
这个变量表示存储命令行输入参数的数组。

awk 'BEGIN{for(i = 0; i < ARGC - 1; i++){printf "ARGV[%d] = %s\n", i, ARGV[i]}}' one two three four
输出:
ARGV[0] = awk
ARGV[1] = one
ARGV[2] = two
ARGV[3] = three

③ ENVIRON
这个变量是系统的环境变量数组,相当于Linux的env指令。

awk 'BEGIN{printf "environment params user is %s\n", ENVIRON["USER"]}'
输出:
environment params user is Chuck

④ FILENAME
文件名

awk 'END{printf "execute file name is %s\n", FILENAME}' test.txt
输出:
test.txt

注意文件名输出不能在BEGIN模块中,可以试试。
⑤FS
分隔符,默认是空格,可以通过-F选项进行指定

awk -F: 'BEGIN{print "FS =", FS}'
输出:
FS = 
#指定分隔符
awk -F: 'BEGIN{print "FS =", FS}'
输出:
FS = :

⑥NF
此变量表示当前输入记录中域的数量。所谓域就是行经过分隔符分割之后的列,NF就是列的数量。

echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk 'BEGIN{i = 0}{printf "line %d NF = %d\n", i++, NF}'
输出:
line 0 NF = 2
line 1 NF = 3
line 2 NF = 4
#NF也可以作为条件使用
echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk 'NF > 2'
输出:
One Two Three
One Two Three Four

NF > 2是条件,后面不跟任何操作时默认执行print,输出当前行。
⑦ NR
此变量表示当前记录的数量。即行数。
在⑥中使用定义的变量i表示行数,本例中可以使用NR变量。

echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk '{printf "now line number is %d\n", NR}'
输出:
now line number is 1
now line number is 2
now line number is 3

可以看出,NR标记行号是从1开始计算。
⑧ FNR
当多个文件同时读取时,NR会从第一个文件的第一行开始一直计算到最后一个文件的最后一行。
使用FNR时,每更换文件时NR重新开始计算。

awk '{printf "now file is %s and NR = %d content is %s\n", FILENAME, NR, $0}' test.txt data.txt
输出:
now file is test.txt and NR = 1 content is 1)    Amit     Physics    80
now file is test.txt and NR = 2 content is 2)    Rahul    Maths      90
now file is test.txt and NR = 3 content is 3)    Shyam    Biology    87
now file is test.txt and NR = 4 content is 4)    Kedar    English    85
now file is test.txt and NR = 5 content is 5)    Hari     History    89
now file is data.txt and NR = 6 content is root:x:0:0:root:/root:/bin/bash
now file is data.txt and NR = 7 content is bin:x:1:1:bin:/bin:/bin/false,aaa,bbbb,cccc,aaaaaa
now file is data.txt and NR = 8 content is DADddd:x:2:2:daemon:/sbin:/bin/false
now file is data.txt and NR = 9 content is mail:x:8:12:mail:/var/spool/mail:/bin/false
now file is data.txt and NR = 10 content is ftp:x:14:11:ftp:/home/ftp:/bin/false
now file is data.txt and NR = 11 content is &nobody:$:99:99:nobody:/:/bin/false
now file is data.txt and NR = 12 content is zhangy:x:1000:100:,,,:/home/zhangy:/bin/bash
now file is data.txt and NR = 13 content is http:x:33:33::/srv/http:/bin/false
now file is data.txt and NR = 14 content is dbus:x:81:81:System message bus:/:/bin/false
now file is data.txt and NR = 15 content is hal:x:82:82:HAL daemon:/:/bin/false
now file is data.txt and NR = 16 content is mysql:x:89:89::/var/lib/mysql:/bin/false
now file is data.txt and NR = 17 content is aaa:x:1001:1001::/home/aaa:/bin/bash
now file is data.txt and NR = 18 content is ba:x:1002:1002::/home/zhangy:/bin/bash
now file is data.txt and NR = 19 content is test:x:1003:1003::/home/test:/bin/bash
now file is data.txt and NR = 20 content is @zhangying:*:1004:1004::/home/test:/bin/bash
now file is data.txt and NR = 21 content is policykit:x:102:1005:Po

#使用FNR
awk '{printf "now file is %s and NR = %d content is %s\n", FILENAME, FNR, $0}' test.txt data.txt
输出:
now file is test.txt and NR = 1 content is 1)    Amit     Physics    80
now file is test.txt and NR = 2 content is 2)    Rahul    Maths      90
now file is test.txt and NR = 3 content is 3)    Shyam    Biology    87
now file is test.txt and NR = 4 content is 4)    Kedar    English    85
now file is test.txt and NR = 5 content is 5)    Hari     History    89
now file is data.txt and NR = 1 content is root:x:0:0:root:/root:/bin/bash
now file is data.txt and NR = 2 content is bin:x:1:1:bin:/bin:/bin/false,aaa,bbbb,cccc,aaaaaa
now file is data.txt and NR = 3 content is DADddd:x:2:2:daemon:/sbin:/bin/false
now file is data.txt and NR = 4 content is mail:x:8:12:mail:/var/spool/mail:/bin/false
now file is data.txt and NR = 5 content is ftp:x:14:11:ftp:/home/ftp:/bin/false
now file is data.txt and NR = 6 content is &nobody:$:99:99:nobody:/:/bin/false
now file is data.txt and NR = 7 content is zhangy:x:1000:100:,,,:/home/zhangy:/bin/bash
now file is data.txt and NR = 8 content is http:x:33:33::/srv/http:/bin/false
now file is data.txt and NR = 9 content is dbus:x:81:81:System message bus:/:/bin/false
now file is data.txt and NR = 10 content is hal:x:82:82:HAL daemon:/:/bin/false
now file is data.txt and NR = 11 content is mysql:x:89:89::/var/lib/mysql:/bin/false
now file is data.txt and NR = 12 content is aaa:x:1001:1001::/home/aaa:/bin/bash
now file is data.txt and NR = 13 content is ba:x:1002:1002::/home/zhangy:/bin/bash
now file is data.txt and NR = 14 content is test:x:1003:1003::/home/test:/bin/bash
now file is data.txt and NR = 15 content is @zhangying:*:1004:1004::/home/test:/bin/bash
now file is data.txt and NR = 16 content is policykit:x:102:1005:Po

可以发现,更换文件是NR重新开始计算。
⑨ RLENGTH
匹配的字符串的长度。

awk 'BEGIN{if(match("three", "re")){printf "regex length is %d\n", RLENGTH}}'
输出:
regex length is 2

⑩ RSTART
匹配字符串的起始位置

awk 'BEGIN{if(match("three", "re")){printf "start pos of str is %d\n", RSTART}}'
输出:
start pos of str is 3

⑪$n
输出列,0——整行,n>0为分隔符分割后的第n列
⑫ IGNORECASE
指定是否区分大小写

awk 'BEGIN{IGNORECASE=1}  /amit/' test.txt
输出:
1)    Amit     Physics    80

#如果没有IGNORECASE设置
awk '/amit/' test.txt
将无匹配记录

awk常用内建函数

1.字符串函数

①sub、gsub
sub 函数匹配记录中最大、最靠左边的子字符串的正则表达式,并用替换字符串替换这些字符串。如果没有指定目标字符串就默认使用整个记录。替换只发生在第一次匹配的时候。

sub (regular expression, substitution string):
sub (regular expression, substitution string, target string)
例
echo "hello i am Chuck i am" | awk '{sub(/am/, "ok"); print $0}'
echo "hello i am Chuck i am" | awk '{sub(/am/, "ok", $6); print $0}'
输出:
hello i ok Chuck i am
hello i am Chuck i ok

sub只在第一次匹配时发生,如果想替换文档中所有匹配项,需要使用gsub函数

#将替换文档中所有am为ok
echo "hello i am Chuck i am" | awk '{sub(/am/, "ok"); print $0}'
#将替换文档所有第6项为am的记录的第6项为ok
echo "hello i am Chuck i am" | awk '{sub(/am/, "ok", $6); print $0}'

②index
返回字符串第一次被匹配的位置,偏移量从1开始

index(string, originstr)

③length
获取字符串长度

length    #获取整条记录的字符数
length(string)    #获取string的字符数

④substr
截取字符串

substr(string, startpos);  #startpos起的所有字符
substr(string, startpos, length);  #startpos起长度为length的字符串

⑤match
匹配正则表达式,不符合返回0

match(string, regular expression);

⑥split
分割字符串到数组

split(string, array);  #默认按照FS分割
split(string, array, separator);  #按照分隔符分割
例
awk 'BEGIN{ split( "20:18:00", time, ":" ); print time[2] }'
输出:
18
2.时间函数

①systime
获取当前时间戳
②strftime
获取时间格式化


时间格式表
strftime(format, [timestamp]);
例:
awk 'BEGIN{now = strftime("%D"); print now}'
awk 'BEGIN{now = strftime("%D", systime()); print now}'
3.数学函数
数学函数表

awk操作符

1.算术运算符

①加法操作

awk 'BEGIN{a = 10; b = 30; print "(a + b) =", a + b}'
输出:
(a + b) = 40

②减法运算符

awk 'BEGIN{a = 10; b = 30; print "(a - b) =", a - b}'
输出:
(a - b) = -20

③乘法运算符

awk 'BEGIN{a = 10; b = 30; print "a * b =", a * b}'
输出:
a * b = 300

④除法运算符

awk 'BEGIN{a = 10; b = 20; print "a / b = ", a / b}'
输出:
a / b =  0.5

⑤模运算符

awk 'BEGIN{a = 10; b = 20; print "a % b = ", a % b}'
输出:
a % b =  10
2.递增运算符与递减运算符

和大多数编程语言一样,都有前置、后置的递增递减运算符

#后置递增
awk 'BEGIN{a = 10; printf "the res of a++ is %d then print a is %d", a++, a}'
输出:
the res of a++ is 10 then print a is 11

#前置递增
awk 'BEGIN{a = 10; printf "the res of ++a is %d then print a is %d", ++a, a}'
输出:
the res of ++a is 11 then print a is 11

#后置递减
awk 'BEGIN{a = 10; printf "the res of a-- is %d then print a is %d", a--, a}'
输出:
the res of a-- is 10 then print a is 9

#前置递减
awk 'BEGIN{a = 10; printf "the res of --a is %d then print a is %d", --a, a}'
输出:
the res of --a is 9 then print a is 9
3.赋值操作符

这里介绍简单赋值,加法赋值,减法赋值,乘法赋值,除法赋值,取模赋值,指数赋值

#简单赋值
awk 'BEGIN{a = 10; printf "a is %d\n", a}'
输出:
a is 10

#加法赋值
awk 'BEGIN{a = 10; printf "a += 10 is %d\n", a += 10}'
输出:
a is 20

#减法赋值
awk 'BEGIN{a = 10; printf "a -= 5 is %d\n", a -= 5}'
输出:
a is 5

#乘法赋值
awk 'BEGIN{a = 10; printf "a *= 5 is %d\n", a *= 5}'
输出:
a *= 5 is 50

#除法赋值
awk 'BEGIN{a = 10; printf "a /= 5 is %d\n", a /= 5}'
输出:
a /= 5 is 2

#取模赋值
awk 'BEGIN{a = 10; printf "a %= 3 is %d\n", a %= 3}'
输出:
a %= 3 is 1

#指数赋值
awk 'BEGIN{a = 10; printf "a ^= 3 is %d\n", a ^= 3}'
输出:
a ^= 3 is 1000
4.关系运算符

①等于

awk 'BEGIN { a = 10; b = 10; if (a == b) print "a == b" }'
输出:
a == b

②不等于

awk 'BEGIN{ a = 10; b = 5; if(a != b){print "a != b"} }'
输出:
a != b

③ 小于

awk 'BEGIN{ a = 5; b = 10; if(a < b){print "a < b"} }'
输出:
a < b

④小于或等于

awk 'BEGIN{ a = 5; b = 10; if(a <= b){print "a <= b"} }'
输出:
a <= b

⑤大于

awk 'BEGIN{a = 5; b = 10; if(a > b){print "a > b"}}'
输出:
无

⑥大于或等于

awk 'BEGIN{a = 5; b = 10; if(a >= b){print "a >= b"}}'
输出:
无

控制流程

1.if-else
awk '{if($3 > 1){print "Y\n"}else{print "N"}}' test.txt
2.while
awk 'BEGIN{i = 0;}{while(i < $3){i++;}}' test.txt
3.for
awk '{for(i = 0; i < 10; i++){print $i}}' test.txt

awk数组

awk数组和PHP类似,是一种key-value的模式,下标可以是数字和字符串,value将以字符串的形式存储,支持多维数组。

声明数组

声明数组的方式:数组名[key] = value

arr[0] = "Chuck"
arr["Craig"] = 1;
arr[$0] = $1;
arr["a", "b"] = 100;  #二维数组,相当于arr["a"]["b"] = 100;
输出元素
方式一:print
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr[0]}'
输出:
1
如果输出数组中不存在的下标
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr[100]}'
输出:

将输出空字符串

方式二:for
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; for(i in arr){print i}}'
输出:
0
a
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; for(i in arr){print arr[i]}}'
1
2

awk数组对不存在的key采用空字符串方式输出,通过for方式输出时,i表示数组的key,只有通过arr[i]才能输出value。如果print arr会报错!

删除数组元素

awk可以删除数组元素也可以删除整个数组,通过delete命令完成

#删除单个元素
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr["a"]; delete arr["a"]; print arr["a"];}'
输出:
2

#删除整个数组
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr["a"]; delete arr; print arr[0]; print arr["a"];'
输出:
2


删除之后再打印元素将显示空字符串,如果delete一个不存在的key,awk将不会报错。

多维数组

声明多维数组使用 数组名[index1, index2,...]方式声明

awk 'BEGIN{arr["a", "b"] = 1; print arr["a", "b"];}'
输出:
1

打印多维数组
除了按上面print方式打印,for方式也可以打印,不像一般的编程语言几维数组需要几层for循环,awk多维数组可以用一个for循环搞定。

awk 'BEGIN{arr["a", "b"] = 1; arr["c", "d"] = 2; for(i in arr){print i, arr[i];}}'
输出:
ab 1
cd 2

awk多维数组默认使用''连接每个维度,可以定义SUBSEP变量的值设置维度之间的分隔符。注意设定SUBSEP一定在数组声明之前,否则无效。

awk 'BEGIN{SUBSEP = ":"; arr["a", "b"] = 1; arr["c", "d"] = 2; for(i in arr){print i, arr[i];}}'
输出:
a:b 1
c:d 2

通过设置SUBSEP分隔符时需要注意避免使用index中的符号,否则有可能出问题,如:

awk 'BEGIN{SUBSEP = ":"; arr["a", "b:c"] = 1; arr["a:b", "c"] = 2; for(i in  arr){print i, arr[i];}}'
输出:
a:b:c 2
awk 'BEGIN{SUBSEP = "~"; arr["a", "b:c"] = 1; arr["a:b", "c"] = 2; for(i in  arr){print i, arr[i];}}'
输出:
a:b~c 2
a~b:c 1

通过':'连接后,两个元素的key都是'a:b:c',会产生覆盖问题。

awk自定义函数

awk可以像编程语言一样自定义函数,格式如下:

function funcName(parameter1, parameter2, parameter3, ...){
    statements;
    [return xxx;]
}
例:
awk 'function add(a, b){a += 5; res = a + b; return res;}BEGIN{print add(10, 20)}'
输出:
35

函数定义需要在执行流程之前,否则会出错。

上一篇下一篇

猜你喜欢

热点阅读