Hive补充
2019-04-20 本文已影响0人
蓝Renly
Hive补充
1.认知
1.Mapreduce的客户端,避免写繁琐的MapReduce程序;
2.metastore:原本的数据集和字段名称以及数据信息之间双射关系;
3.Server-Hadoop;在操作Hive同时,需要将Hadoop的HDFS,YARN开启,MAPRED配置好.
2.本地避免使用MapReduce
设置:
<property>
<name>hive.fetch.task.conversion</name>
<value>more</value>
<description>
Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)
</description>
</property>
3.Hive元数据
常见错误:启动Hive时,无法初始化metastore数据库,无法创建连接,无法创建会话。
1、hive的metastore数据库丢失了,比如drop,比如文件损坏
2、metasotre版本号不对。
3、远程表服务
备份的基本语法:
$ mysqldump -uroot -p metastore > metastore.sql
还原的基本语法:
$ mysql -uroot -p metastore < metastore.sql
find命令,查找metastore.sql默认存放位置
4.Hive任务调度
oozie/azakban/crontab
hive -e "" ;
hive -f 文件名.hql ;
#举例
bin/hive hive -e "SELECT * FROM emp";
bin/hive hive -f "emp.hql";
5.查看历史命令存放地
cat ~/.hivehistory;
6.临时生效
set hive.cli.print.header = false;
7.HiveServer2
* HiveServer2
配置:hive-site.xml
hive.server2.thrift.port --> 10000
hive.server2.thrift.bind.host --> jdf.node1.com
hive.server2.long.polling.timeout -- > 5000(去掉L)
hive.server2.enable.doAs --> false
检查端口:
$ sudo netstat -antp | grep 10000
启动服务:
$ bin/hive --service hiveserver2
连接服务:
$ bin/beeline
beeline> !connect jdbc:hive2://jdf.node1.com:10000
尖叫提示:注意此时不能够执行MR任务调度,报错:
Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException(Permission denied: user=anonymous, access=EXECUTE, inode="/tmp/hadoop-yarn":admin:supergroup:drwxrwx---
8.UDF
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>0.13.1</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>0.13.1</version>
</dependency>
9.Hive建表
#创建全新表
CREATE TABLE IF NOT EXISTS mydb.tb1(id int,size string,date string,hour string)
PARTITION BY (date string,hour string)
ROW FORMAT DELIMITED FIELDS TERMINTED BY '\t';
#根据已有表抽取字段创建表(并拿到已有表数据)
CREATE TABLE IF NOT EXISTS mydb.tb2 AS SELECT * FROM tb1;
#根据已有表模式创建相同模式表(无数据)
CREATE TABLE LIKE mydb.tb3;
10.导入/出数据
导入数据
#本地导入
load data local inpath '../../' into table table_name;
#HDFS导入
load data inpath 'hdfs_path/../../' into table table_name;
#覆盖导入
load data [local] inpath '../../' overwirte into table table_name;
#查询导入
create table tb1 as select * from mydb.tb1;
#追加导入 insert
#appen追加
insert into table table_name select * from tb1;
#覆盖导入
insert overwrite table table_name select * from tb1;
导出数据
#本地导出
insert overwrite local directory '../../..'
rpw format delimited feilds terminated by '\t'
select * from mydb.t1;
#HDFS导出
insert overwrite directory '../../..' select * from t1;
#bash脚本覆盖追加导出
$ bin/hive -e "select * from mydb.t1" > /../../..
自动导入日志信息脚本
#!/bin/bash
#执行系统环境变量脚本,初始化一些变量信息
. /etc/profile
#定义Hive目录
HIVE_DIR=/opt/modules/cdh/hive-0.13.1-cdh5.3.6
#定义日志的存储路径
WEB_LOG=/opt/modules/weblog
#昨天的日期,用于访问目录
YESTERDAY=$(date --date="1 day ago" +%Y%m%d)
#遍历目录
for i in `ls $WEB_LOG/$YESTERDAY`
do
DATE=${i:0:8}
HOUR=${i:8:2}
#$HIVE_DIR/bin/hive -e "load data local inpath "$WEB_LOG/$YESTERDAY/$i"
#into table db_web_log.track_log partition (date='$DATE', hour='$HOUR')"
$HIVE_DIR/bin/hive \
--hiveconf LOADFILE_NEW=$WEB_LOG/$YESTERDAY/$i \
--hiveconf DATE_NEW=$DATE \
--hiveconf HOUR_NEW=$HOUR \
-f $HIVE_DIR/hql/auto.hql
done
案例
* 业务案例梳理
需求:执行周期性任务,每天的晚上6点,执行自动化脚本,加载昨天的日志文件到HDFS,同时分析网站的多维数据(PV,UV按照省份和小时数进行分类查询,最后将查询的结果,存储在一张临时表中(表字段:date,hour,provinceId,pv,uv)存储在HIVE中,并且将该临时表中的所有数据,存储到MySQL中,以供第二天后台开发人员的调用,展示。
1、定时加载本地数据到HDFS,涉及到:auto.sh,crontab
2、清洗数据,打包jar,定时执行, /user/hive/warehouse/db_web_data.db/track_log/date=20150828/hour=18
part-000001
/user/hive/warehouse/db_web_data.db/track_log/date=20150828/hour=19
part-000001
3、建表track_log,也不需要建立现成的分区,临时指定清洗好的数据作为仓库源
alter table track_log add partition(date='20150828',hour='18') location
"/user/hive/warehouse/db_web_data.db/track_log/date=20150828/hour=18";
alter table track_log add partition(date='20150828',hour='18') location
"/user/hive/warehouse/db_web_data.db/track_log/date=20150828/hour=19";
4、开始分析想要的数据,将结果存储在Hive的临时表中
创建临时表:
create table if not exists temp_track_log(date string, hour string, provinceId string, pv string, uv string) row format delimited fields terminated by '\t';
向临时表中插入数据:
insert overwrite table temp_track_log select date, hour, provinceId, count(url) pv, count(distinct guid) uv from track_log where date='20150828' group by date, hour, provinceId;
5、使用自定义的JAR,导入本地导出的文件到MYsql或者使用Sqoop。
11.Sqoop
-
Sqoop
一、SQL-TO-HADOOP
二、配置:
1、开启Zookeeper
2、开启集群服务
3、配置文件:
** sqoop-env.sh
#export HADOOP_COMMON_HOME=
export HADOOP_COMMON_HOME=/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/#Set path to where hadoop-*-core.jar is available #export HADOOP_MAPRED_HOME= export HADOOP_MAPRED_HOME=/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/ #set the path to where bin/hbase is available #export HBASE_HOME= #Set the path to where bin/hive is available #export HIVE_HOME= export HIVE_HOME=/opt/modules/cdh/hive-0.13.1-cdh5.3.6/ #Set the path for where zookeper config dir is #export ZOOCFGDIR= export ZOOCFGDIR=/opt/modules/cdh/zookeeper-3.4.5-cdh5.3.6/ export ZOOKEEPER_HOME=/opt/modules/cdh/zookeeper-3.4.5-cdh5.3.6/ 4、拷贝jdbc驱动到sqoop的lib目录下 cp -a mysql-connector-java-5.1.27-bin.jar /opt/modules/cdh/sqoop-1.4.5-cdh5.3.6/lib/ 5、启动sqoop $ bin/sqoop help查看帮助 6、测试Sqoop是否能够连接成功 $ bin/sqoop list-databases --connect jdbc:mysql://hadoop-senior01.itguigu.com:3306/metastore --username root \ --password 123456 三、案例 使用sqoop将mysql中的数据导入到HDFS Step1、确定Mysql服务的正常开启 Step2、在Mysql中创建一张表 mysql> create database company; mysql> create table staff( id int(4) primary key not null auto_increment, name varchar(255) not null, sex varchar(255) not null); mysql> insert into staff(name, sex) values('Thomas', 'Male'); Step3、操作数据 RDBMS --> HDFS 使用Sqoop导入数据到HDFS ** 全部导入 $ bin/sqoop import \ --connect jdbc:mysql://hadoop-senior01.itguigu.com:3306/company \ --username root \ --password 123456 \ --table staff \ --target-dir /user/company \ --delete-target-dir \ --num-mappers 1 \ --fields-terminated-by "\t" ** 查询导入 $ bin/sqoop import --connect jdbc:mysql://hadoop-senior01.itguigu.com:3306/company --username root --password 123456 --target-dir /user/company --delete-target-dir --num-mappers 1 --fields-terminated-by "\t" --query 'select name,sex from staff where id >= 2 and $CONDITIONS;' ** 导入指定列 $ bin/sqoop import --connect jdbc:mysql://hadoop-senior01.itguigu.com:3306/company --username root --password 123456 --target-dir /user/company --delete-target-dir --num-mappers 1 --fields-terminated-by "\t" --columns id, sex --table staff ** 使用sqoop关键字筛选查询导入数据 $ bin/sqoop import --connect jdbc:mysql://hadoop-senior01.itguigu.com:3306/company --username root --password 123456 --target-dir /user/company --delete-target-dir --num-mappers 1 --fields-terminated-by "\t" --table staff --where "id=3" RDBMS --> Hive 1、在Hive中创建表(不需要提前创建表,会自动创建) hive (company)> create table staff_hive(id int, name string, sex string) row format delimited fields terminated by '\t'; 2、向Hive中导入数据 $ bin/sqoop import --connect jdbc:mysql://hadoop-senior01.itguigu.com:3306/company --username root --password 123456 --table staff --num-mappers 1 --hive-import --fields-terminated-by "\t" --hive-overwrite --hive-table company.staff_hive Hive/HDFS --> RDBMS 1、在Mysql中创建一张表 $ bin/sqoop export --connect jdbc:mysql://hadoop-senior01.itguigu.com:3306/company --username root --password 123456 --table staff_mysql --num-mappers 1 --export-dir /user/hive/warehouse/company.db/staff_hive --input-fields-terminated-by "\t" 使用opt文件打包sqoop命令,然后执行 Step1、创建一个.opt文件 Step2、编写sqoop脚本 export --connect jdbc:mysql://hadoop-senior01.itguigu.com:3306/company --username root --password 123456 --table staff_mysql --num-mappers 1 --export-dir /user/hive/warehouse/company.db/staff_hive --input-fields-terminated-by "\t" Step3、执行该脚本 $ bin/sqoop --options-file opt/job_hffs2rdbms.opt