大数据学习

Impala分区表

2020-11-12  本文已影响0人  xiaogp

摘要:Impala分区表hdfs

分区表

分区表就是将某个分区的数据的单独存放,当使用where语句是针对某个分区查询时,impala只会在该分区扫描,大大减少了从磁盘读取的数据量。

创建普通分区表

使用partitioned by指定分区字段,分区字段不进入表字段,分区字段和表字段在建表语句中都需要指定字段类型

[cloudera01:21000] > create table logs (f1 STRING, f2 STRING, f3 STRING)
partitioned by (year STRING, month STRING, day STRING, host STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

往分区中插入数据,使用partition(...)关键字指定该条数据在分区字段的值,直接顺序指定values作为该条数据各字段的值。

[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2012", month="12", day="26", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="11", day="26", host="host3") values ("foo", "foo", "foo");

查看插入数据后的分区表数据,分区字段也存在在表数据中

[cloudera01:21000] > select * from logs;
Query: select * from logs
Query submitted at: 2020-11-12 13:16:13 (Coordinator: http://cloudera01:25000)
Query progress can be monitored at: http://cloudera01:25000/query_plan?query_id=7146e71798da0ed0:f8e8cd0100000000
+-----+-----+-----+------+-------+-----+-------+
| f1  | f2  | f3  | year | month | day | host  |
+-----+-----+-----+------+-------+-----+-------+
| foo | foo | foo | 2013 | 12    | 28  | host1 |
| foo | foo | foo | 2013 | 12    | 26  | host1 |
| foo | foo | foo | 2013 | 12    | 28  | host2 |
| foo | foo | foo | 2012 | 12    | 26  | host3 |
| foo | foo | foo | 2013 | 12    | 26  | host2 |
| foo | foo | foo | 2013 | 12    | 27  | host2 |
| foo | foo | foo | 2013 | 12    | 27  | host3 |
| foo | foo | foo | 2013 | 11    | 26  | host3 |
| foo | foo | foo | 2013 | 12    | 27  | host1 |
| foo | foo | foo | 2013 | 12    | 28  | host3 |
| foo | foo | foo | 2013 | 12    | 26  | host3 |
+-----+-----+-----+------+-------+-----+-------+

查看impala分区表在hdfs上存储结构,整个表名文件夹下首先是分区的第一列yaer,等于year的值作为下一级目录名

[hdfs@cloudera01 gp]$  hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs
Found 3 items
drwxrwx--T   - impala hive          0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/_impala_insert_staging
drwxr-xr-x   - impala hive          0 2020-11-12 11:35 /user/hive/warehouse/test_gp.db/logs/year=2012
drwxr-xr-x   - impala hive          0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/year=2013

打开分区的第一个字段目录,下一级目录是分区的第二个字段month

[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013
Found 2 items
drwxr-xr-x   - impala hive          0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/year=2013/month=11
drwxr-xr-x   - impala hive          0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12

打开第二个分区字段目录,第三个分区字段目录是day

[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013/month=12
Found 3 items
drwxr-xr-x   - impala hive          0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=26
drwxr-xr-x   - impala hive          0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=27
drwxr-xr-x   - impala hive          0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28

最后是第四个分区目录

[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28
Found 3 items
drwxr-xr-x   - impala hive          0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host1
drwxr-xr-x   - impala hive          0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host2
drwxr-xr-x   - impala hive          0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host3

创建外部分区表

分区表也可以从现有的hdfs分区目录结构中直接创建,在impala中创建外部表再指向hdfs路径,然后手动添加每条数据的分区。
创建hdfs表名目录(external_partition_test)和子目录作为分区(p=?)

[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=1
[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=2
[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=3

创建三个csv格式数据文件

[root@cloudera01 gp]# echo "a,b,c" > block_data1
[root@cloudera01 gp]# echo "d,e,f" > block_data2
[root@cloudera01 gp]# echo "g,h,i" > block_data3

将三个文件分别put到三个分区目录下

[hdfs@cloudera01 gp]$ hdfs dfs -put block_data1 /user/impala/external_partition_test/p=1
[hdfs@cloudera01 gp]$ hdfs dfs -put block_data2 /user/impala/external_partition_test/p=2
[hdfs@cloudera01 gp]$ hdfs dfs -put block_data3 /user/impala/external_partition_test/p=3

在impala中创建外部分区表

[cloudera01:21000] > create external table external_partition_test
(f1 string,
f2 string,
f3 string
) partitioned by (p string)
row format delimited fields terminated by ',' location '/user/impala/external_partition_test';

此时查询此表无数据,手动add每条数据的分区值

[cloudera01:21000] > alter table external_partition_test add partition (p="1");
[cloudera01:21000] > alter table external_partition_test add partition (p="2");
[cloudera01:21000] > alter table external_partition_test add partition (p="3");

在手动增加分区值后可以进行查询操作

[cloudera01:21000] > select * from external_partition_test;
+----+----+----+---+
| f1 | f2 | f3 | p |
+----+----+----+---+
| d  | e  | f  | 2 |
| a  | b  | c  | 1 |
| g  | h  | i  | 3 |
+----+----+----+---+
上一篇 下一篇

猜你喜欢

热点阅读