Impala分区表
2020-11-12 本文已影响0人
xiaogp
摘要:Impala
,分区表
,hdfs
分区表
分区表就是将某个分区的数据的单独存放,当使用where
语句是针对某个分区查询时,impala只会在该分区扫描,大大减少了从磁盘读取的数据量。
创建普通分区表
使用partitioned by
指定分区字段,分区字段不进入表字段
,分区字段和表字段在建表语句中都需要指定字段类型
[cloudera01:21000] > create table logs (f1 STRING, f2 STRING, f3 STRING)
partitioned by (year STRING, month STRING, day STRING, host STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
往分区中插入数据,使用partition(...)
关键字指定该条数据在分区字段的值,直接顺序指定values作为该条数据各字段的值。
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="28", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="27", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host1") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host2") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="12", day="26", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2012", month="12", day="26", host="host3") values ("foo", "foo", "foo");
[cloudera01:21000] > insert into logs partition (year="2013", month="11", day="26", host="host3") values ("foo", "foo", "foo");
查看插入数据后的分区表数据,分区字段也存在在表数据中
[cloudera01:21000] > select * from logs;
Query: select * from logs
Query submitted at: 2020-11-12 13:16:13 (Coordinator: http://cloudera01:25000)
Query progress can be monitored at: http://cloudera01:25000/query_plan?query_id=7146e71798da0ed0:f8e8cd0100000000
+-----+-----+-----+------+-------+-----+-------+
| f1 | f2 | f3 | year | month | day | host |
+-----+-----+-----+------+-------+-----+-------+
| foo | foo | foo | 2013 | 12 | 28 | host1 |
| foo | foo | foo | 2013 | 12 | 26 | host1 |
| foo | foo | foo | 2013 | 12 | 28 | host2 |
| foo | foo | foo | 2012 | 12 | 26 | host3 |
| foo | foo | foo | 2013 | 12 | 26 | host2 |
| foo | foo | foo | 2013 | 12 | 27 | host2 |
| foo | foo | foo | 2013 | 12 | 27 | host3 |
| foo | foo | foo | 2013 | 11 | 26 | host3 |
| foo | foo | foo | 2013 | 12 | 27 | host1 |
| foo | foo | foo | 2013 | 12 | 28 | host3 |
| foo | foo | foo | 2013 | 12 | 26 | host3 |
+-----+-----+-----+------+-------+-----+-------+
查看impala分区表在hdfs上存储结构,整个表名文件夹下首先是分区的第一列yaer,等于year的值作为下一级目录名
[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs
Found 3 items
drwxrwx--T - impala hive 0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/_impala_insert_staging
drwxr-xr-x - impala hive 0 2020-11-12 11:35 /user/hive/warehouse/test_gp.db/logs/year=2012
drwxr-xr-x - impala hive 0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/year=2013
打开分区的第一个字段目录,下一级目录是分区的第二个字段month
[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013
Found 2 items
drwxr-xr-x - impala hive 0 2020-11-12 11:36 /user/hive/warehouse/test_gp.db/logs/year=2013/month=11
drwxr-xr-x - impala hive 0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12
打开第二个分区字段目录,第三个分区字段目录是day
[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013/month=12
Found 3 items
drwxr-xr-x - impala hive 0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=26
drwxr-xr-x - impala hive 0 2020-11-12 11:33 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=27
drwxr-xr-x - impala hive 0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28
最后是第四个分区目录
[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28
Found 3 items
drwxr-xr-x - impala hive 0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host1
drwxr-xr-x - impala hive 0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host2
drwxr-xr-x - impala hive 0 2020-11-12 11:32 /user/hive/warehouse/test_gp.db/logs/year=2013/month=12/day=28/host=host3
- 分区表的分区会以分区字段的
前后顺序
,在hdfs上创建树结构
目录,每个最细的分区单独存放
在一个子目录
下 - 表名目录下的子目录代表分区,所有子目录下的文件组合成完整的表数据
- impala的
库名
和表名
在/user/hive/warehouse
上,在表名目录下是表的分区目录
,子目录下的最终数据文件是每一块分区的数据 - 删除分区表之后hdfs上的表目录和其下的所有分区子目录都会被级联删除。
创建外部分区表
分区表也可以从现有的hdfs分区目录结构中直接创建,在impala中创建外部表再指向hdfs路径,然后手动添加每条数据的分区。
创建hdfs表名目录(external_partition_test)和子目录作为分区(p=?)
[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=1
[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=2
[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/external_partition_test/p=3
创建三个csv格式数据文件
[root@cloudera01 gp]# echo "a,b,c" > block_data1
[root@cloudera01 gp]# echo "d,e,f" > block_data2
[root@cloudera01 gp]# echo "g,h,i" > block_data3
将三个文件分别put到三个分区目录下
[hdfs@cloudera01 gp]$ hdfs dfs -put block_data1 /user/impala/external_partition_test/p=1
[hdfs@cloudera01 gp]$ hdfs dfs -put block_data2 /user/impala/external_partition_test/p=2
[hdfs@cloudera01 gp]$ hdfs dfs -put block_data3 /user/impala/external_partition_test/p=3
在impala中创建外部分区表
[cloudera01:21000] > create external table external_partition_test
(f1 string,
f2 string,
f3 string
) partitioned by (p string)
row format delimited fields terminated by ',' location '/user/impala/external_partition_test';
此时查询此表无数据,手动add每条数据的分区值
[cloudera01:21000] > alter table external_partition_test add partition (p="1");
[cloudera01:21000] > alter table external_partition_test add partition (p="2");
[cloudera01:21000] > alter table external_partition_test add partition (p="3");
在手动增加分区值后可以进行查询操作
[cloudera01:21000] > select * from external_partition_test;
+----+----+----+---+
| f1 | f2 | f3 | p |
+----+----+----+---+
| d | e | f | 2 |
| a | b | c | 1 |
| g | h | i | 3 |
+----+----+----+---+
- 在删除外部分区表之后,hdfs文件依旧存在,实际上删除外部表就是仅仅删除impala的表定义。