Hive 基本操作

2018-06-30 本文已影响11人袭明

准备工作

1.检查各项服务是否已正常启动
[x] hdfs启动【start-dfs.sh】
[x] yarn启动【start-yarn.sh】
[x] mysql是否启动【service mysqld status】- 启动命令【service mysqld start】
[x] hiveserver2 启动【hiveserver2】
2.准备好开发工具
[x] Xshell 5 【连接linux】
[x] SQuirrel SQL Client【连接hive,发送sql到hive进行处理】

最好浏览器打开地址：http://master:50070，以便直观的查看hdfs上的文件。【master为安装hive的节点ip对应的hostname】

1. 数据库操作

1.1 创建数据库

create database wj123;
//或
create schema wj111;

1.2 查看数据库

show databases;

1.3 删除数据库

drop database wj123;

1.4 使用数据库

use wj111;

2. 表的操作

2.1 创建表加载数据

官方文档有关于表的创建的详细范式，有需要的可以这里查看。

这里只列举几个简单的例子以供参考，我的宗旨是：流程先走通，细节慢慢打磨：）

例1：给定user-logs-large.txt文件，在hive上创建user_logs 表，加载相应数据

1.根据txt文件的格式，建表

create table user_logs (
user_name string,
action_type string,
ip_address string
)
row format delimited
fields terminated by '\t'
stored as textfile;

2.加载数据

load data inpath '/test/user/log/user-logs-large.txt' overwrite into table user_logs;

2.2 创建外部表加载数据

首先区分一下什么是外部表，什么是内部表

外部表(exernal table)
- 1.external一般和LOCATION配对使用一般和ROW FORMAT一起使用
- 2.外部表一般是通过加载原始文件而来，数据模型中的ODS层一般都是外部表。
- 3.在删除表的时候，仅删除hive的表元数据，原始数据文件不会被删除。
内部表(managed table)
- 1.内部表一般是由原始数据通过计算或转而来的表，而不是通过加载原始数据文件而来的表。
- 2.在删除内部表的时候，表的元数据和原始数据文件会被一起删除。

外部表创建范式：

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
 COMMENT 'This is the staging page view table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS TEXTFILE
 LOCATION '<hdfs_location>';

例2：给定province_user.txt文件，创建外部表加载province_user 数据。

前两步同例1，上传文件到hadoop目录/test/user/province_user.

建表语句：

create external table if not exists province_user(
  user_id string,
  user_name string,
  province_type int,
  gender_type string,
  province_name string
)
row format delimited
fields terminated by ';'
location '/test/user/province_user';

2.3 使用serder正则表达式的形式加载access_log数据文件

drop table if exists access_log;
create external table access_log(
  host STRING,
  identity STRING,
  auser STRING,
  access_time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"
)
STORED AS TEXTFILE
location '/external/access_log_20160429';

总结： delimited 适用于文本；serde适用于二进制文件，需要正则解析的原数据文件

2.4 复制表

只复制表结构

create table access_log_copy like access_log

2.5 克隆表

不止复制表结构，还复制表数据

create table access_log_clone as 
select * from access_log

2.6 把access_log 表中满足某条件的日志单独放到一个表中，并且用orc的格式来存储

create table access_log_s
stored as orc
as
select * from access_log where 【某条件】