Hive自定义函数
2019-01-07 本文已影响0人
小月半会飞
一、hive为我们提供的函数
hive给我们提供了一些内置函数,比如截取字符串,大小写转换
此处距离substr
1、首先模仿sql建立一个伪表dual
create table dual(id string);
2、准备数据
在本地创建一个文档,dual.txt,内容为一个空格或者空行
3、加载数据到表格
load data local inpath '/root/dual.txt' into table dual;
4、进行测试
0: jdbc:hive2://localhost:10000> select substr('sichuan',1,3) from dual;
5、也可以不建表直接使用:
select substr('sichuan',1,3);
二、自定义函数
1、添加maven依赖
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-metastore -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-common -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-common</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-service -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-service</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1</version>
</dependency>
2、大写转小写
1)、创建UpperToLowerCase类,继承UDF,重写evaluate方法
public class UpperToLowerCase extends UDF {
/*
* 重载evaluate
* 访问限制必须是public
*/
public String evaluate(String word) {
String lowerWord = word.toLowerCase();
return lowerWord;
}
}
2)、打包上传到hadoop集群上
此处我的项目名称是hive
3)、将jar包放到hive的classpath下
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
4)、创建临时函数,指定完整类名
完整类名:包名加类名
0: jdbc:hive2://localhost:10000> create temporary function tolower as 'com.neusoft.hive.UpperToLowerCase';
5)、使用临时函数
select tolower('HELLO');
3、根据电话号码显示归属地信息
1)、创建工具类
public class PhoneNumParse extends UDF{
static HashMap<String, String> phoneMap = new HashMap<String, String>();
static{
phoneMap.put("136", "beijing");
phoneMap.put("137", "shanghai");
phoneMap.put("138", "shenzhen");
}
public static String evaluate(int phoneNum) {
String num = String.valueOf(phoneNum);
String province = phoneMap.get(num.substring(0, 3));
return province==null?"foreign":province;
}
//测试
public static void main(String[] args) {
String string = evaluate(136666);
System.out.println(string);
}
}
2)、重新打jar包,然后上传到hadoop集群上
此处我的项目名称是hive
3)、将jar包放到hive的classpath下
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
4)、创建临时函数,指定完整类名
完整类名:包名加类名
create temporary function getprovince as 'com.scu.hive.PhoneNumParse';
5)、创建本地数据
创建flow.txt将以下数据
[root@mini1 ~]# vi flow.txt;
1367788,1
1367788,10
1377788,80
1377788,97
1387788,98
1387788,99
1387788,100
1555118,99
6)、创建表,然后加载数据
0: jdbc:hive2://localhost:10000> create table flow(phonenum int,flow int)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.143 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/flow.txt' into table flow;
7)、查询结果
0: jdbc:hive2://localhost:10000> select phonenum,getprovince(phonenum),flow from flow;
+-----------+-----------+-------+--+
| phonenum | _c1 | flow |
+-----------+-----------+-------+--+
| 1367788 | beijing | 1 |
| 1367788 | beijing | 10 |
| 1377788 | shanghai | 80 |
| 1377788 | shanghai | 97 |
| 1387788 | shenzhen | 98 |
| 1387788 | shenzhen | 99 |
| 1387788 | shenzhen | 100 |
| 1555118 | foreign | 99 |
+-----------+-----------+-------+--+
3、Json数据解析UDF开发
1)、创建数据源文本
内容如下,记住不要有空行以及多余的空格,不然会出现解析错误
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
2)、创建表,然后上传数据
创建表:
create table json(line string);
上传数据:
load data local inpath ‘/root/json.txt’ into table json;
2)、与json数据对应的javabean
public class MovieRateBean {
private String movie;
private String rate;//评分
private String timeStamp;
private String uid;
@Override
public String toString() {
return this.movie+"\t"+this.rate+"\t"+this.timeStamp+"\t"+this.uid;
}
// get、set方法
}
3)、java工具类
package com.neusoft.hive;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.TypeReference;
import org.apache.hadoop.hive.ql.exec.UDF;
public class JsonParse extends UDF {
public String evaluate(String jsonStr){
MovieRateBean movieRateBean=JSON.parseObject(jsonStr,new TypeReference<MovieRateBean>(){});
returnmovieRateBean.toString();
}
}
4)、打jar包然后上传到hadoop集群
5)、将jar包添加到hive下的classpath
0: jdbc:hive2://localhost:10000> add jar /root/hive.jar;
5)、将fastjson的jar包添加到hive下的classpath
0: jdbc:hive2://localhost:10000> add jar /root/fastjson-1.1.41.jar;
6)、创建临时函数:
指定类名一定要完整的路径,即包名加类名
0: jdbc:hive2://localhost:10000> create temporary function parsejson as 'com.sneusoft.hive.JsonParse';
7)、执行查询
0: jdbc:hive2://localhost:10000> select parsejson(line) from json limit 10;
+---------------------+--+
| _c0 |
+---------------------+--+
| 1193 5 978300760 1 |
| 661 3 978302109 1 |
| 914 3 978301968 1 |
| 3408 4 978300275 1 |
| 2355 5 978824291 1 |
| 1197 3 978302268 1 |
| 1287 5 978302039 1 |
| 2804 5 978300719 1 |
| 594 4 978302268 1 |
| 919 4 978301368 1 |
8)、显示字段名
从上面的结果可以看出来,数据虽然分开了,但是没有字段名,现在我们通过建表来实现显示字段名
0: jdbc:hive2://localhost:10000> create table movie as
0: jdbc:hive2://localhost:10000> select split(parsejson(line),'\t')[0]as movieid,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[1] as rate,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[2] as timestring,
0: jdbc:hive2://localhost:10000> split(parsejson(line),'\t')[3] as uid
0: jdbc:hive2://localhost:10000> from json;
再次执行查询,查看结果:
0: jdbc:hive2://localhost:10000> select * from movie;
+-------------------+----------------+----------------------+---------------+--+
| t_rating.movieid | t_rating.rate | t_rating.timestring | t_rating.uid |
+-------------------+----------------+----------------------+---------------+--+
| 919 | 4 | 978301368 | 1 |
| 594 | 4 | 978302268 | 1 |
| 2804 | 5 | 978300719 | 1 |
| 1287 | 5 | 978302039 | 1 |
| 1197 | 3 | 978302268 | 1 |
| 2355 | 5 | 978824291 | 1 |
| 3408 | 4 | 978300275 | 1 |
| 914 | 3 | 978301968 | 1 |
| 661 | 3 | 978302109 | 1 |
| 1193 | 5 | 978300760 | 1 |
+-------------------+----------------+----------------------+---------------+--+
三、transform关键字的使用
将某一个字段时间戳要改为输出周几,可以不用实现UDF
我们直接使用跟上面创建好的表,将第三个字段改为时间
1、编写python脚本
在本地创建一个Python脚本,名字叫做trans.py
vi trans.py
以下是脚本代码
import sys
import datetime
for line in sys.stdin:
line = line.strip()
movieid, rating, unixtime,userid = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([movieid, rating, str(weekday),userid])
2、使用该脚本
1)、添加该文本到hive
0: jdbc:hive2://localhost:10000> add FILE /root/trans.py;
2)、使用该脚本
0: jdbc:hive2://localhost:10000> select TRANSFORM(movieid,rate,time,userid) USING 'python trans.py' as (mov,rat,tim,uid) from movie;
输出结果:
INFO : Number of reduce tasks is set to 0 since there's no reduce operator
INFO : number of splits:1
INFO : Submitting tokens for job: job_1546821616463_0002
INFO : The url to track the job: http://hadoop4:8088/proxy/application_1546821616463_0002/
INFO : Starting Job = job_1546821616463_0002, Tracking URL = http://hadoop4:8088/proxy/application_1546821616463_0002/
INFO : Kill Command = /opt/modules/app/hadoop/bin/hadoop job -kill job_1546821616463_0002
INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
INFO : 2019-01-05 00:35:37,847 Stage-1 map = 0%, reduce = 0%
INFO : 2019-01-05 00:35:56,316 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.55 sec
INFO : MapReduce Total cumulative CPU time: 1 seconds 550 msec
INFO : Ended Job = job_1546821616463_0002
+-------+------+------+------+--+
| mov | rat | tim | uid |
+-------+------+------+------+--+
| 1197 | 3 | 1 | 1 |
| 2355 | 5 | 7 | 1 |
| 3408 | 4 | 1 | 1 |
| 914 | 3 | 1 | 1 |
| 661 | 3 | 1 | 1 |
| 1193 | 5 | 1 | 1 |
+-------+------+------+------+--+
6 rows selected (36.201 seconds)