大数据学习

Impala数据加载与查询

2020-11-11  本文已影响0人  xiaogp

摘要:Impala,hdfs,impala-shellhive

数据加载

Impala数据加载分为两种方式

使用外部表指向hdfs路径

新建一个csv文件

[hdfs@cloudera01 gp]$ cat load_data_test.txt 
1,true,123
2,false,222
3,true,121
4,false,231

切换为hdfs用户将文件上传到/user/impala/load_data_test目录下

[hdfs@cloudera01 gp]$ hdfs dfs -mkdir -p /user/impala/load_data_test
[hdfs@cloudera01 gp]$ hdfs dfs -put load_data_test.txt /user/impala/load_data_test
[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/impala/load_data_test
Found 1 items
-rw-r--r--   3 hdfs impala         46 2020-11-11 20:01 /user/impala/load_data_test/load_data_test.txt

指定数据库,在impala中创建外部表,ROW FORMAT DELIMITED FIELDS TERMINATED BY指定分隔符,指定LOCATION为对应的hdfs路径,路径可以是任意hdfs路径,两个对应即可

[cloudera01:21000] > use test_gp;
[cloudera01:21000] > create external table load_data_test2
(
id INT,
f1 BOOLEAN,
f2 INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/tmp/load_data_test';

查看映射表数据

[cloudera01:21000] > select * from load_data_test;
Query: select * from load_data_test
Query submitted at: 2020-11-11 20:07:46 (Coordinator: http://cloudera01:25000)
Query progress can be monitored at: http://cloudera01:25000/query_plan?query_id=d84f5d1e86cacc93:2eb4a04d00000000
+----+-------+-----+
| id | f1    | f2  |
+----+-------+-----+
| 1  | true  | 123 |
| 2  | false | 222 |
| 3  | true  | 121 |
| 4  | false | 231 |
+----+-------+-----+
使用INSERT加载数据

创建一张内部表insert_test

[cloudera01:21000] > create table insert_test
                   > (id INT,
                   > f1 BOOLEAN,
                   > f2 INT
                   > )
                   > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

内部表不指定LOCATION则使用impala默认的存储位置存储。使用INSERT语句将load_data_test的数据加载到insert_test

[cloudera01:21000] > INSERT OVERWRITE TABLE insert_test
                   > select * from load_data_test;

OVERWRITE关键字代表将查询结果覆写到原有数据表,查询加载到的数据

[cloudera01:21000] > select * from insert_test;
Query: select * from insert_test
Query submitted at: 2020-11-12 10:54:28 (Coordinator: http://cloudera01:25000)
Query progress can be monitored at: http://cloudera01:25000/query_plan?query_id=274b2819ac05c6ce:d762875f00000000
+----+-------+-----+
| id | f1    | f2  |
+----+-------+-----+
| 1  | true  | 123 |
| 2  | false | 222 |
| 3  | true  | 121 |
| 4  | false | 231 |
+----+-------+-----+
[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/insert_test
Found 2 items
-rw-r--r--   3 impala hive         46 2020-11-12 10:54 /user/hive/warehouse/test_gp.db/insert_test/5d490d43985cb7ec-aae0ea5500000000_774296641_data.0.
drwxrwx--T   - impala hive          0 2020-11-12 10:54 /user/hive/warehouse/test_gp.db/insert_test/_impala_insert_staging

数据文件可以直接用cat命令查看

[hdfs@cloudera01 gp]$ hdfs dfs -cat /user/hive/warehouse/test_gp.db/insert_test/5d490d43985cb7ec-aae0ea5500000000_774296641_data.0.
1,true,123
2,false,222
3,true,121
4,false,231

数据查询

impala支持大部分传统数据库支持的关联,聚合,子查询等

[cloudera01:21000] > insert into insert_test (id, f1, f2) values(2, false, 333);
[cloudera01:21000] > select id, max(f2) from insert_test group by id;
+----+---------+
| id | max(f2) |
+----+---------+
| 4  | 231     |
| 2  | 333     |
| 1  | 123     |
| 3  | 121     |
+----+---------+

在执行一条insert命令后,对应的hdfs目录下同时也会新增数据文件

[hdfs@cloudera01 gp]$ hdfs dfs -ls /user/hive/warehouse/test_gp.db/insert_test
Found 3 items
-rw-r--r--   3 impala hive         46 2020-11-12 10:54 /user/hive/warehouse/test_gp.db/insert_test/5d490d43985cb7ec-aae0ea5500000000_774296641_data.0.
drwxrwx--T   - impala hive          0 2020-11-12 11:13 /user/hive/warehouse/test_gp.db/insert_test/_impala_insert_staging
-rw-r--r--   3 impala hive         12 2020-11-12 11:13 /user/hive/warehouse/test_gp.db/insert_test/cb4e67acb0a52b1e-6a847bc500000000_1316431619_data.0.

查看新增加的数据文件cb4e67acb0a52b1e-6a847bc500000000_1316431619_data.0.为新insert的数据,可见整个insert_test目录下的不同数据文件组合在一起形成impala中对应表的数据

[hdfs@cloudera01 gp]$ hdfs dfs -cat /user/hive/warehouse/test_gp.db/insert_test/cb4e67acb0a52b1e-6a847bc500000000_1316431619_data.0.

2,false,333
上一篇 下一篇

猜你喜欢

热点阅读