Hive使用总结

2018-11-14 本文已影响995人 TedHacker

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，它专为OLAP设计，可以执行复杂的分析操作。本文总结了Hive的常用操作。

基础类型

Hive支持的数据类型如下:

数据类型	长度	例子
TINYINT	1byte有符号整数	20
SMALLINT	2byte有符号整数	20
INT	4byte有符号整数	20
BIGINT	8byte有符号整数	20
BOOLEAN	布尔类型，true或者false	TRUE
FLOAT	单精度浮点数	3.14159
DOUBLE	双精度浮点数	3.14159
STRING	字符序列	'now is the time'
TIMESTAMP	整数，浮点数或者字符串	1327882394
BINARY	字节数组

Hive同时也支持如下算术运算符：

运算符	类型	描述
A+B	数值	A和B相加
A-B	数值	A减去B
A*B	数值	A和B相乘
A/B	数值	A除以B。如果能整除，那么返回商数。
A%B	数值	A除以B的余数。
A&B	数值	A和B按位取与。
A\|B	数值	A和B按位取或
A^B	数值	A和B按位取异或。
~A	数值	A按位取反。

类型转换

语法

CAST(expr as <type>)

例如

select CAST('1' as BIGINT);

创建表

语法

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

例如

CREATE TABLE IF NOT EXISTS `employee`(
user_id int COMMENT 'employee id',
name string COMMENT 'employee name'
)COMMENT 'employee info'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

执行结果

创建表.png

删除表

语法

DROP TABLE [IF EXISTS] table_name;

例如

DROP TABLE IF EXISTS employee;

执行结果

删除表.png

加载数据

在Hive中，一般使用LOAD DATA插入数据，数据来源一般为两种，一种是从本地文件系统，第二种是从hadoop文件系统。

语法

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

例如

LOAD DATA LOCAL INPATH '/Users/ted/Desktop/hivetest/employee' OVERWRITE INTO TABLE employee;

LoadData.png

注意：文件的分隔符要和建表语句定义的一致，不然会出现异常的null数值。

新增数据

Hive新增数据有两种语法：insert into和insert overwrite。

语法

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

insert into 只是简单的插入，不考虑原始表的数据，直接追加到表中。

insert overwrite 会覆盖已经存在的数据，会先现将原始表的数据remove，再插入新数据。

例如

CREATE TABLE `employee_backup` as
SELECT * FROM employee;

CreateAs.png

INSERT INTO TABLE employee_backup
SELECT * FROM employee WHERE user_id=10;

InsertInto.png

INSERT OVERWRITE TABLE employee_backup
SELECT * FROM employee WHERE user_id!=10;

InsertOverwrite.png

表连接

原始数据

原始数据.png

等值连接

join即等值连接，只有某个值在两个表中同时存在才会被检索出来。

例如

select * from employee a join work_list b on a.user_id=b.user_id;

等值连接.png

左外连接

left outer join即左外连接，左边表中的值无论是否存在右表中，都会输出，但是右表中的记录只有在左表中存在时才会输出。

例如

select * from employee a left outer join work_list b on a.user_id=b.user_id;

左外连接.png

右外连接

right outer join即右外连接，右边表中的值无论是否存在左表中，都会输出，但是左表中的记录只有在右边中存在时才会输出。

例如

select * from employee a right outer join work_list b on a.user_id=b.user_id;

右外连接.png

左半连接

left semi join即左半连接，类似于存在判断。

例如

select * from employee a left semi join work_list b on a.user_id=b.user_id;

左半连接.png

逻辑类似于

select * from employee where user_id in (select user_id from work_list);

条件函数

IF

语法

IF(boolean condition, T trueValue, T falseValue)

当条件condition为TRUE时，返回trueValue；否则返回falseValue。

例如

select *,IF(name='ted',true,false) from employee;

if.png

NVL

语法

NVL(T v1,T v2)

如果v1为空那么显示v2值，如果v1的值不为空，则显示v1本来的值。

例如

select NVL(null,user_id) from employee where user_id=10;

nvl.png

COALESCE

语法

COALESCE(T v1, T v2, …)

返回参数中的第一个非空值；如果所有值都为NULL，那么返回NULL。

例如

select COALESCE(null,user_id,name) from employee where user_id=10;

coalesce.png

CASE

语法

CASE  [ expression ] 
  WHEN condition1 THEN result1 
  WHEN condition2 THEN result2 
  ... 
  WHEN conditionn THEN resultn 
  ELSE result 
END

如果expression满足某个condition，就返回对应result，不然返回ELSE里面的result。

例如

select work_id,work_time,
CASE
WHEN work_time>=60 THEN 'A'
WHEN work_time>=30 THEN 'B'
ELSE 'C'
END as grade
from work_list;

case.png

日期函数

名称	返回值类型	描述	例如
unix_timestamp()	bigint	获得当前时间戳	unix_timestamp()
unix_timestamp(string date)	bigint	获得date表示的时间戳	unix_timestamp('2018-08-08 21:00:00')
from_unixtime(int unixtime)	string	将时间戳转换为日期时间字符串	from_unixtime(1533762000)
to_date(string timestamp)	bigint	返回日期字符串	select to_date('2018-08-08 21:00:00')
year(string date)	int	返回年	year('2018-08-08 21:59:58')
month(string date)	int	返回月	month('2018-08-08 21:59:58')
day(string date)	int	返回日	day('2018-08-08 21:59:58')
dayofmonth(string date)	int	返回日	dayofmonth('2018-08-08 21:59:58')
hour(string date)	int	返回小时	hour('2018-08-08 21:59:58')
minute(string date)	int	返回分钟	minute('2018-08-08 21:59:58')
second(string date)	int	返回秒	second('2018-08-08 21:59:58')
weekofyear(string date)	int	返回周	weekofyear('2018-08-08 21:59:58')
datediff(string endDate, string startDate)	int	返回天数差	datediff('2018-08-11 21:59:58','2018-08-08 12:59:58')
date_add(string startdate, int days)	string	startDate加days天数	date_add('2018-08-08 21:59:58',5)
date_sub(string startdate, int days)	string	startDate减days天数	date_sub('2018-08-08 21:59:58',5)

字符串转换

名称	返回值类型	描述	例如
length(string s)	int	返回字符串长度	length('hello world')
reverse(string s)	string	反转字符串	reverse('hello world')
concat(string a, string b...)	string	合并字符串	concat('hello','+','world')
substr(string s, int start)	string	返回子串	substr('hello world',7)
upper(string s)	string	转换为大写	upper('hello')
lower(string s)	string	转换为小写	lower('HELLO')
trim(string s)	string	去除空格符	trim(' world ')

开窗函数

普通的聚合函数聚合的行集是组，开窗函数聚合的行集是窗口。

因此，普通的聚合函数每组只返回一个值，而开窗函数则可为窗口中的每行都返回一个值。

insert overwrite table statistics_base_orders
select id,order_id,billno,warehouse_id,warehouse_type,logis_company_id,logis_company_code,from
(select *,row_number() over (partition by order_id,billno order by db_update_time desc) as rows
from statistics_base_orders) a
where rows=1

WITH查询

Hive 可以通过with查询来提高查询性能。先通过with语法将重复使用的数据查询到内存，后面其它查询可以直接使用。

with tmp as (select * from employee where name='ted')
select * from tmp;

参考

Hive保留关键字
Hive聚合函数
《Hive编程指南》