Impala分区表

2020-11-12 本文已影响0人 xiaogp

摘要：Impala，分区表，hdfs

分区表

分区表就是将某个分区的数据的单独存放，当使用where语句是针对某个分区查询时，impala只会在该分区扫描，大大减少了从磁盘读取的数据量。

创建普通分区表

使用partitioned by指定分区字段，分区字段不进入表字段，分区字段和表字段在建表语句中都需要指定字段类型

[cloudera01:21000] > create table logs (f1 STRING, f2 STRING, f3 STRING)
partitioned by (year STRING, month STRING, day STRING, host STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

往分区中插入数据，使用partition(...)关键字指定该条数据在分区字段的值，直接顺序指定values作为该条数据各字段的值。

[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2012", month="12", day="26", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="11", day="26", host="host3") values ("foo", "foo", "foo");

查看插入数据后的分区表数据，分区字段也存在在表数据中

[cloudera01:21000] > select * from logs;
Query: select * from logs
Query submitted at: 2020-11-12 13:16:13 (Coordinator: http://cloudera01:25000)
Query progress can be monitored at: http://cloudera01:25000/query_plan?query_id=7146e71798da0ed0:f8e8cd0100000000
+-----+-----+-----+------+-------+-----+-------+
| f1  | f2  | f3  | year | month | day | host  |
+-----+-----+-----+------+-------+-----+-------+
| foo | foo | foo | 2013 | 12    | 28  | host1 |
| foo | foo | foo | 2013 | 12    | 26  | host1 |
| foo | foo | foo | 2013 | 12    | 28  | host2 |
| foo | foo | foo | 2012 | 12    | 26  | host3 |
| foo | foo | foo | 2013 | 12    | 26  | host2 |
| foo | foo | foo | 2013 | 12    | 27  | host2 |
| foo | foo | foo | 2013 | 12    | 27  | host3 |
| foo | foo | foo | 2013 | 11    | 26  | host3 |
| foo | foo | foo | 2013 | 12    | 27  | host1 |
| foo | foo | foo | 2013 | 12    | 28  | host3 |
| foo | foo | foo | 2013 | 12    | 26  | host3 |
+-----+-----+-----+------+-------+-----+-------+

查看impala分区表在hdfs上存储结构，整个表名文件夹下首先是分区的第一列yaer，等于year的值作为下一级目录名

[hdfs@cloudera01 gp]$  hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs
Found 3 items
drwxrwx--T   - impala hive          0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/_impala_insert_staging
drwxr-xr-x   - impala hive          0 2020-11-12 11:35 /user/hive/warehouse/test_gp.db/logs/year=2012
drwxr-xr-x   - impala hive          0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/year=2013

打开分区的第一个字段目录，下一级目录是分区的第二个字段month

[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013
Found 2 items
drwxr-xr-x   - impala hive          0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/year=2013/month=11
drwxr-xr-x   - impala hive          0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12

打开第二个分区字段目录，第三个分区字段目录是day

[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013/month=12
Found 3 items
drwxr-xr-x   - impala hive          0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=26
drwxr-xr-x   - impala hive          0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=27
drwxr-xr-x   - impala hive          0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28

最后是第四个分区目录

[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28
Found 3 items
drwxr-xr-x   - impala hive          0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host1
drwxr-xr-x   - impala hive          0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host2
drwxr-xr-x   - impala hive          0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host3

分区表的分区会以分区字段的前后顺序，在hdfs上创建树结构目录，每个最细的分区单独存放在一个子目录下
表名目录下的子目录代表分区，所有子目录下的文件组合成完整的表数据
impala的库名和表名在/user/hive/warehouse上，在表名目录下是表的分区目录，子目录下的最终数据文件是每一块分区的数据
删除分区表之后hdfs上的表目录和其下的所有分区子目录都会被级联删除。

创建外部分区表

分区表也可以从现有的hdfs分区目录结构中直接创建，在impala中创建外部表再指向hdfs路径，然后手动添加每条数据的分区。
创建hdfs表名目录（external_partition_test）和子目录作为分区（p=?）

[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=1
[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=2
[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=3

创建三个csv格式数据文件

[root@cloudera01 gp]# echo "a,b,c" > block_data1
[root@cloudera01 gp]# echo "d,e,f" > block_data2
[root@cloudera01 gp]# echo "g,h,i" > block_data3

将三个文件分别put到三个分区目录下

[hdfs@cloudera01 gp]$ hdfs dfs -put block_data1 /user/impala/external_partition_test/p=1
[hdfs@cloudera01 gp]$ hdfs dfs -put block_data2 /user/impala/external_partition_test/p=2
[hdfs@cloudera01 gp]$ hdfs dfs -put block_data3 /user/impala/external_partition_test/p=3

在impala中创建外部分区表

[cloudera01:21000] > create external table external_partition_test
(f1 string,
f2 string,
f3 string
) partitioned by (p string)
row format delimited fields terminated by ',' location '/user/impala/external_partition_test';

此时查询此表无数据，手动add每条数据的分区值

[cloudera01:21000] > alter table external_partition_test add partition (p="1");
[cloudera01:21000] > alter table external_partition_test add partition (p="2");
[cloudera01:21000] > alter table external_partition_test add partition (p="3");

在手动增加分区值后可以进行查询操作

[cloudera01:21000] > select * from external_partition_test;
+----+----+----+---+
| f1 | f2 | f3 | p |
+----+----+----+---+
| d  | e  | f  | 2 |
| a  | b  | c  | 1 |
| g  | h  | i  | 3 |
+----+----+----+---+

在删除外部分区表之后，hdfs文件依旧存在，实际上删除外部表就是仅仅删除impala的表定义。

Impala分区表

分区表

创建普通分区表

创建外部分区表

猜你喜欢

热点阅读