Pig从入门到精通6:PigLatin语句
PigLatin语句类似于SQL语句,但是具有自己的语法格式。本节就来介绍一下PigLatin语句的使用。
1.PigLatin语句
常用的PigLatin语句有下面这些:
load:加载数据,生成一张表。
order *** by *** :排序。
group *** by *** :分组。
filter *** by *** :过滤,选择满足条件的记录,类似where语句。
generate:提取列,相当于在select中指定列。
foreach:对表中每一条记录做某种操作。
join:连接操作,多表查询。
union:集合运算,求并集。
dump:将数据打印到屏幕上。
store:将数据保存到HDFS上。
注意:
(1)PigLatin语句和Spark的算子(API)操作非常像。
(2)使用PigLatin语句需要启动Yarn的HistoryServer。
2.实战案例
2.1环境准备
(1)启动Hadoop集群
[root@bigdata ~]# start-all.sh
[root@bigdata ~]# jps
2096 NameNode
2422 SecondaryNameNode
2232 DataNode
2586 ResourceManager
2813 NodeManager
3037 Jps
(2)启动HistoryServer服务器
[root@bigdata ~]# mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /root/trainings/hadoop-2.7.3/logs/mapred-root-historyserver-bigdata.out
[root@bigdata ~]# jps
2096 NameNode
3123 Jps
2422 SecondaryNameNode
2232 DataNode
2586 ResourceManager
3084 JobHistoryServer
2813 NodeManager
(3)启动Pig的集群模式
[root@bigdata ~]# pig
18/09/26 00:02:32 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
18/09/26 00:02:32 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
18/09/26 00:02:32 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2018-09-26 00:02:32,804 [main] INFO org.apache.pig.Main - Apache Pig version 0.17.0 (r1797386) compiled Jun 02 2017, 15:41:58
2018-09-26 00:02:32,804 [main] INFO org.apache.pig.Main - Logging error messages to: /root/pig_1537891352803.log
2018-09-26 00:02:32,830 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /root/.pigbootup not found
2018-09-26 00:02:33,289 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2018-09-26 00:02:33,289 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://bigdata:9000
2018-09-26 00:02:33,812 [main] INFO org.apache.pig.PigServer - Pig Script ID for the session: PIG-default-4db6a82f-0910-4950-a889-e1d7ee031cce
2018-09-26 00:02:33,812 [main] WARN org.apache.pig.PigServer - ATS is disabled since yarn.timeline-service.enabled set to false
grunt>
(4)上传测试数据到HDFS
grunt> copyFromLocal /root/input/emp.csv /input
grunt> copyFromLocal /root/input/dept.csv /input
grunt> ls /input
hdfs://bigdata:9000/input/dept.csv<r 1> 84
hdfs://bigdata:9000/input/emp.csv<r 1> 617
grunt> cat /input/dept.csv
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON
grunt> cat /input/emp.csv
7369,SMITH,CLERK,7902,1980/12/17,800,,20
7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30
7566,JONES,MANAGER,7839,1981/4/2,2975,,20
7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30
7782,CLARK,MANAGER,7839,1981/6/9,2450,,10
7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20
7839,KING,PRESIDENT,,1981/11/17,5000,,10
7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30
7876,ADAMS,CLERK,7788,1987/5/23,1100,,20
7900,JAMES,CLERK,7698,1981/12/3,950,,30
7902,FORD,ANALYST,7566,1981/12/3,3000,,20
7934,MILLER,CLERK,7782,1982/1/23,1300,,10
2.2一些例子
例1:创建员工表,并且指定表结构,数据类型和分隔符(默认的分隔符是:tab,csv文件的默认分隔符是逗号)。
grunt> emp = load '/input/emp.csv' using PigStorage(',') as
(empno:int,ename:chararray,job:chararray,mgr:int,hiredate:chararray,sal:int,comm:int,deptno:int);
查看表结构:json格式(可以嵌套),列的默认类型是bytearray
grunt> describe emp;
emp: {empno: int,ename: chararray,job: chararray,mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int}
查看表数据:
grunt> dump emp;
log
(7369,SMITH,CLERK,7902,1980/12/17,800,,20)
(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)
(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)
(7566,JONES,MANAGER,7839,1981/4/2,2975,,20)
(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)
(7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30)
(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
(7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20)
(7839,KING,PRESIDENT,,1981/11/17,5000,,10)
(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)
(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20)
(7900,JAMES,CLERK,7698,1981/12/3,950,,30)
(7902,FORD,ANALYST,7566,1981/12/3,3000,,20)
(7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
创建部门表:
grunt> dept = load '/input/dept.csv' using PigStorage(',') as (deptno:int,dname:chararray,loc:chararray);
grunt> describe dept;
dept: {deptno: int,dname: chararray,loc: chararray}
log
(10,ACCOUNTING,NEW YORK)
(20,RESEARCH,DALLAS)
(30,SALES,CHICAGO)
(40,OPERATIONS,BOSTON)
例2:查询员工信息:员工号,姓名,薪水。
grunt> emp1 = foreach emp generate empno,ename,sal;
grunt> dump emp1;
log
(7369,SMITH,800)
(7499,ALLEN,1600)
(7521,WARD,1250)
(7566,JONES,2975)
(7654,MARTIN,1250)
(7698,BLAKE,2850)
(7782,CLARK,2450)
(7788,SCOTT,3000)
(7839,KING,5000)
(7844,TURNER,1500)
(7876,ADAMS,1100)
(7900,JAMES,950)
(7902,FORD,3000)
(7934,MILLER,1300)
例3:查询10号部门的员工:
grunt> emp2 = filter emp by deptno == 10; 注意:两个等号
grunt> dump emp2;
log
(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
(7839,KING,PRESIDENT,,1981/11/17,5000,,10)
(7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
例4:查询员工信息按照月薪排序:
grunt> emp3 = order emp by sal;
grunt> dump emp3;
log
(7369,SMITH,CLERK,7902,1980/12/17,800,,20)
(7900,JAMES,CLERK,7698,1981/12/3,950,,30)
(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20)
(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)
(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)
(7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)
(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)
(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
(7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30)
(7566,JONES,MANAGER,7839,1981/4/2,2975,,20)
(7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20)
(7902,FORD,ANALYST,7566,1981/12/3,3000,,20)
(7839,KING,PRESIDENT,,1981/11/17,5000,,10)
例5:求每个部门工资的最大值:
grunt> emp41 = group emp by deptno;
grunt> describe emp41;
emp41: {group: int,emp: {(empno: int,ename: chararray,job: chararray,
mgr: int,hiredate: chararray,sal: int,comm: int,deptno: int)}}
注意:可以看到表emp41中嵌套了一张表emp。
grunt> dump emp41;
log
(10,{(7934,MILLER,CLERK,7782,1982/1/23,1300,,10),
(7839,KING,PRESIDENT,,1981/11/17,5000,,10),
(7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)})
(20,{(7876,ADAMS,CLERK,7788,1987/5/23,1100,,20),
(7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20),
(7369,SMITH,CLERK,7902,1980/12/17,800,,20),
(7566,JONES,MANAGER,7839,1981/4/2,2975,,20),
(7902,FORD,ANALYST,7566,1981/12/3,3000,,20)})
(30,{(7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30),
(7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30),
(7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30),
(7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30),
(7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30),
(7900,JAMES,CLERK,7698,1981/12/3,950,,30)})
grunt> emp42 = foreach emp41 generate group,MAX(emp.sal); 注意:函数要大写
grunt> describe emp42;
emp42: {group: int,int}
grunt> dump emp42;
log
(10,5000)
(20,3000)
(30,2850)
例6:多表查询:查询员工信息:部门名称和员工姓名
grunt> emp51 = join dept by deptno, emp by deptno;
grunt> describe emp51;
emp51: {dept::deptno: int,dept::dname: chararray,dept::loc: chararray,
emp::empno: int,emp::ename: chararray,emp::job: chararray,emp::mgr: int,
emp::hiredate: chararray,emp::sal: int,emp::comm: int,emp::deptno: int}
grunt> dump emp51;
log
(10,ACCOUNTING,NEW YORK,7934,MILLER,CLERK,7782,1982/1/23,1300,,10)
(10,ACCOUNTING,NEW YORK,7839,KING,PRESIDENT,,1981/11/17,5000,,10)
(10,ACCOUNTING,NEW YORK,7782,CLARK,MANAGER,7839,1981/6/9,2450,,10)
(20,RESEARCH,DALLAS,7876,ADAMS,CLERK,7788,1987/5/23,1100,,20)
(20,RESEARCH,DALLAS,7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20)
(20,RESEARCH,DALLAS,7369,SMITH,CLERK,7902,1980/12/17,800,,20)
(20,RESEARCH,DALLAS,7566,JONES,MANAGER,7839,1981/4/2,2975,,20)
(20,RESEARCH,DALLAS,7902,FORD,ANALYST,7566,1981/12/3,3000,,20)
(30,SALES,CHICAGO,7844,TURNER,SALESMAN,7698,1981/9/8,1500,0,30)
(30,SALES,CHICAGO,7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30)
(30,SALES,CHICAGO,7698,BLAKE,MANAGER,7839,1981/5/1,2850,,30)
(30,SALES,CHICAGO,7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30)
(30,SALES,CHICAGO,7521,WARD,SALESMAN,7698,1981/2/22,1250,500,30)
(30,SALES,CHICAGO,7900,JAMES,CLERK,7698,1981/12/3,950,,30)
grunt> emp52 = foreach emp51 generate dept::dname, emp::ename;
grunt> dump emp52;
(ACCOUNTING,MILLER)
(ACCOUNTING,KING)
(ACCOUNTING,CLARK)
(RESEARCH,ADAMS)
(RESEARCH,SCOTT)
(RESEARCH,SMITH)
(RESEARCH,JONES)
(RESEARCH,FORD)
(SALES,TURNER)
(SALES,ALLEN)
(SALES,BLAKE)
(SALES,MARTIN)
(SALES,WARD)
(SALES,JAMES)
例7:查询10号和20号部门的员工信息,将结果保存到HDFS上。
grunt> emp61 = filter emp by deptno == 10;
grunt> emp62 = filter emp by deptno == 20;
grunt> emp6 = union emp61, emp62;
grunt> store emp6 into '/output/emp6.txt' using PigStorage(',');
Output(s):
Successfully stored 8 records (334 bytes) in: "/output/emp6.txt"
grunt> ls /output/emp6.txt
hdfs://bigdata:9000/output/emp6.txt/_SUCCESS<r 1> 0
hdfs://bigdata:9000/output/emp6.txt/part-m-00000<r 1> 209
hdfs://bigdata:9000/output/emp6.txt/part-m-00001<r 1> 125
grunt> cat /output/emp6.txt
7369,SMITH,CLERK,7902,1980/12/17,800,,20
7566,JONES,MANAGER,7839,1981/4/2,2975,,20
7788,SCOTT,ANALYST,7566,1987/4/19,3000,,20
7876,ADAMS,CLERK,7788,1987/5/23,1100,,20
7902,FORD,ANALYST,7566,1981/12/3,3000,,20
7782,CLARK,MANAGER,7839,1981/6/9,2450,,10
7839,KING,PRESIDENT,,1981/11/17,5000,,10
7934,MILLER,CLERK,7782,1982/1/23,1300,,10
PigLatin语句中只有dump,store语句会触发MapReduce计算,其他语句不会触发MapReduce计算。这一点类似于Spark的懒惰性(在Spark中只有Action算子会触发Spark计算,而Transformation算子不会触发Spark计算)。