hive 安装及操作

2018-01-26  本文已影响57人  BlackChen

安装

元素数据存储选择

默认使用derby数据库,不能够多个用户同时使用,多用于测试
使用MySQL数据库存储元数据,多用于生产环境

HDFS数据仓库目录

hive安装

使环境变量生效
source /etc/profile

切换到hadoop用户
修改HIVE_HOME/conf/hive-site.xml内容,没有则新建

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://192.168.183.101:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
        <description>username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive123</value>
        <description>password to use against metastore database</description>
    </property>
</configuration>

注: 修改对应ip和密码

启动hive
/usr/local/hive/bin/hive

[hadoop@hadoop4 bin]$ hive

Logging initialized using configuration in jar:file:/home/hadoop/apps/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
hive>

启动成功

hive -hiveconf hive.root.logger=DEBUG,console
显示日志方式启动hive

Hive 操作

hive> show databases;
OK
default
Time taken: 0.02 seconds, Fetched: 1 row(s)
create table user_info(
user_id string,
area_id string,
age int,
occupation string
)
row format delimited fields terminated by '\t' 
lines terminated by '\n'
stored as textfile; 

创建成功后,同时会在HDFS中创建目录
/user/hive/warehouse/mytestdb.db/user_info

创建内部表

create table student_info(
student_id string comment '学号',
name string comment '姓名',
age int comment '年龄',
origin string comment '地域'
)
comment '学生信息表'
row format delimited 
fields terminated by '\t' 
lines terminated by '\n'
stored as textfile; 
Found 1 items
-rwxrwxrwx   3 hadoop supergroup        341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
hive> select * from student_info where origin='11'
    > ;
OK
1   xiaoming    20  11
6   zhangsan    20  11
7   lisi    19  11
Time taken: 0.473 seconds, Fetched: 3 row(s)
hive>

追加的方式载入
load data inpath '/student_info_data.txt' into table student_info;
hdfs中student_info表位置会出现两个 student_info_data.txt

[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/mydb.db/student_info
Found 2 items
-rwxrwxrwx   3 hadoop supergroup        341 2018-01-26 10:34 /user/hive/warehouse/mydb.db/student_info/student_info_data.txt
-rwxrwxrwx   3 hadoop supergroup        341 2018-01-26 10:39 /user/hive/warehouse/mydb.db/student_info/student_info_data_copy_1.txt

并且HDFS中 /student_info_data.txt会剪切到student_info表的hdfs路径下/user/hive/warehouse/rel.db/student_info

以重写的方式载入
load data inpath '/student_info_data.txt' overwrite into table student_info;
会覆盖原来的数据.

数据类型

create table employee(
user_id string,
salary int,
worked_citys array<string>,
social_security map<string,float>,
welfare struct<meal_allowance:float,if_regular:boolean,commercial_insurance:float>
)
row format delimited fields terminated by '\t' 
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile; 
hive> select * from employee;
OK
zhangsan    10800   ["beijing","shanghai"]  {"养老":1000.0,"医疗":600.0}    {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
lisi    20000   ["beijing","nanjing"]   {"养老":2000.0,"医疗":1200.0}   {"meal_allowance":2000.0,"if_regular":false,"commercial_insurance":500.0}
wangwu  17000   ["shanghai","nanjing"]  {"养老":1800.0,"医疗":1100.0}   {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
hive> select user_id,
    > salary,
    > worked_citys[0],
    > social_security['养老'],
    > welfare.meal_allowance
    > from employee
    > where welfare.if_regular=true;
OK
zhangsan    10800   beijing 1000.0  2000.0
wangwu  17000   shanghai    1800.0  2000.0

创建外部表

可以提前创建好HDFS路径
hadoop mkdir -p /user/hive/warehouse/data/student_school_info
如果没有提前创建好,在创建外部表的时候会根据指定路径自动创建

create external table rel.student_school_info(
student_id string,
name string,
institute_id string,
major_id string,
school_year string
)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info';

创建内部分区表

创建学生入学信息表
字段信息:学号、姓名、学院id
分区字段:专业id

create table student_school_info_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string) 
row format delimited
fields terminated by '\t' 
lines terminated by '\n'
stored as textfile; 
insert into table student_school_info_partition_maj partition(major_id ='bigdata')
select t1.student_id,t1.name,t1.institute_id
from student_school_info t1
where t1. major_id = bigdata;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1 ;
hive> show partitions student_school_info_partition_maj;
OK
major_id=bigdata
major_id=computer
major_id=software
Time taken: 0.114 seconds, Fetched: 3 row(s)
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition_maj
Found 3 items
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=bigdata
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=computer
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:30 /user/hive/warehouse/rel.db/student_school_info_partition_maj/major_id=software

会增加三个目录,每个目录存储对应的数据

创建外部分区表

创建学生入学信息表
字段信息:学号、姓名、学院id
分区字段:专业id

create external table rel.student_school_info_external_partition_maj(
student_id string,
name string,
institute_id string
)
partitioned by(major_id string) 
row format delimited 
fields terminated by '\t' 
lines terminated by '\n'
stored as textfile
location '/user/hive/warehouse/data/student_school_info_external_partition_maj';

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table student_school_info_external_partition_maj partition(major_id)
select t1.student_id,t1.name,t1.institute_id,t1.major_id
from student_school_info t1;
drop table student_school_info_partition_maj;

[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/rel.db/
Found 1 items
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:23 /user/hive/warehouse/rel.db/student_school_info_partition
hive> drop table student_school_info_external_partition_maj;
OK
Time taken: 0.63 seconds
hive> show tables;
OK
student_school_info
student_school_info_partition
Time taken: 0.027 seconds, Fetched: 2 row(s)

查看hdfs中的文件,数据依然存在

[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/
Found 2 items
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:06 /user/hive/warehouse/data/student_school_info
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj
Found 3 items
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=bigdata
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=computer
drwxrwxrwx   - hadoop supergroup          0 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
[hadoop@hadoop1 sbin]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software
Found 1 items
-rwxrwxrwx   3 hadoop supergroup         46 2018-01-26 11:47 /user/hive/warehouse/data/student_school_info_external_partition_maj/major_id=software/000000_0

使用LIKE、AS创建表,表重命名,添加、修改、删除列

创建分桶表

按照指定字段取它的hash散列值分桶
创建学生入学信息分桶表
字段信息:学号、姓名、学院ID、专业ID
分桶字段:学号,4个桶,桶内按照学号升序排列

create table rel.student_info_bucket(
student_id string,
name string,
age int,
origin string
)
clustered by (student_id) sorted by (student_id asc) into 4 buckets 
row format delimited 
fields terminated by '\t' 
lines terminated by '\n' 
stored as textfile;
set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket 
select student_id,name,age,origin 
from student_info 
cluster by(student_id);

分桶表一般不使用load向分桶表中导入数据,因为load导入数据只是将数据复制到表的数据存储目录下,hive并不会
在load的时候对数据进行分析然后按照分桶字段分桶,load只会将一个文件全部导入到分桶表中,并没有分桶。一般
采用insert从其他表向分桶表插入数据。
分桶表在创建表的时候只是定义表的模型,插入的时候需要做如下操作:
在每次执行分桶插入的时候在当前执行的session会话中要设置hive.enforce.bucketing = true;声明本次执行的是一次分桶操作。
需要指定reduce个数与分桶的数量相同set mapreduce.job.reduces=4,这样才能保证有多少桶就生成多少个文件。
如果定义了按照分桶字段排序,需要在从其他表查询数据过程中将数据按照分区字段排序之后插入各个桶中,分桶表并不会将各分桶中的数据排序。
排序和分桶的字段相同的时候使用Cluster by(字段),cluster by 默认按照分桶字段在桶内升序排列,如果需要在桶内降序排列,
使用distribute by (col) sort by (col desc)组合实现。

set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;
insert overwrite table student_info_bucket
select student_id,name,age,origin 
from student_info 
distribute by (student_id) sort by (student_id desc); 

导出数据

join关联

select * from a join b on a.id=b.id;
等同于
select * from a inner join b on a.id=b.id;
select a.id, a.name from a where a.id in (select b.id from b);
使用Hive对应于如下语句:
select a.id,a.name from a left semi join b on a.id = b.id;

hive 内置函数

hive> select case 1 when 2 then 'two' when 1 then 'one' else 'zero' end;
one

语法2:CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
说明:如果a为TRUE,则返回b;如果c为TRUE,则返回d;否则返回e

hive> select case when 1=2 then 'two' when 1=1 then 'one' else 'zero' end;
one

自定义UDF函数

当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。
UDF 作用于单个数据行,产生一个数据行作为输出。
步骤:

HIVE安装使用时遇到的问题

  1. 创建表时失败

原因: mysql字符集问题,要设置mysql中hive数据库的字符为latin1

  1. 删除表时,卡主
    原因: 也是字符问题. 是在创建表时,mysql的字符还是utf-8, 后来用命令改掉为latin1,
    需要重新设置,删除hive数据库,重新创建,并设置字符集.
上一篇 下一篇

猜你喜欢

热点阅读