大数据

DataX+Phoenix+Hbase大数据分析平台整合

2019-07-29  本文已影响0人  DreamsonMa

Phoenix是一个在hbase上面实现的基于hadoop的OLTP技术,具有低延迟、事务性、可使用sql、提供jdbc接口的特点。 而且phoenix还提供了hbase二级索引的解决方案,丰富了hbase查询的多样性,继承了hbase海量数据快速随机查询的特点。但是在生产环境中,不可以用在OLTP中。在线事务处理的环境中,需要低延迟,而Phoenix在查询HBase时,虽然做了一些优化,但延迟还是不小。所以依然是用在OLAT中,再将结果返回存储下来。

Phoenix完全使用Java编写,作为HBase内嵌的JDBC驱动。Phoenix查询引擎会将SQL查询转换为一个或多个HBase扫描,并编排执行以生成标准的JDBC结果集。直接使用HBase API、协同处理器与自定义过滤器,对于简单查询来说,其性能量级是毫秒,对于百万级别的行数来说,其性能量级是秒。

由于资源问题,这里我们部署单节点测试环境。

Hbase整合Phoenix

下载Phoenix,这里使用5.0.0-HBase-2.0版本。download
下载Hbase,经测试2.2.0不兼容新版的Phoenix,这里我们使用2.1.5版本。download

1、 解压hbase、phoenix的tar包

[admin@mvxl2429 yst]$ tar -xf hbase-2.1.5-bin.tar.gz
[admin@mvxl2429 yst]$ tar -xf apache-phoenix-5.0.0-HBase-2.0-bin.tar.gz 
[admin@mvxl2429 yst]$ mv apache-phoenix-5.0.0-HBase-2.0-bin phoenix-5.0.0
[admin@mvxl2429 yst]$ ls
apache-phoenix-5.0.0-HBase-2.0-bin.tar.gz  hbase-2.1.5  hbase-2.1.5-bin.tar.gz  phoenix-5.0.0

2、hbse整合phoenix

[admin@mvxl2429 yst]$ cp phoenix-5.0.0/phoenix-5.0.0-HBase-2.0-server.jar  hbase-2.1.5/lib
[admin@mvxl2429 yst]$ ll hbase-2.1.5/lib | grep phoenix
-rw-r--r-- 1 admin admin 41800313 Jul 29 10:08 phoenix-5.0.0-HBase-2.0-server.jar

3、新建hbase数据目录和zk数据目录

[admin@mvxl2429 yst]$ mkdir -p /tmp/hbase/{zk,data}
[admin@mvxl2429 yst]$ ls /tmp/hbase
data  zk

4、修改hbase相关配置

[admin@mvxl2429 yst]$ sed -i '/# export JAVA_HOME=/cexport JAVA_HOME=/usr/lib/jvm/jre-1.8.0' hbase-2.1.5/conf/hbase-env.sh
[admin@mvxl2429 yst]$ grep JAVA_HOME hbase-2.1.5/conf/hbase-env.sh  
export JAVA_HOME=/usr/lib/jvm/jre-1.8.0
[admin@mvxl2429 yst]$ cat > hbase-2.1.5/conf/hbase-site.xml <<EOF
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <configuration>
>         <property>
>                 <name>hbase.rootdir</name>
>                 <value>/tmp/hbase/data</value>
>         </property>
>         <property>
>                 <name>hbase.zookeeper.property.dataDir</name>
>                 <value>/tmp/hbase/zk</value>
>         </property>
>         <property>
>                 <name>hbase.zookeeper.property.clientPort</name>
>                 <value>2182</value>
>         </property>
> </configuration>
> EOF
-bash: [: too many arguments
[admin@mvxl2429 yst]$ cat hbase-2.1.5/conf/hbase-site.xml 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
                <name>hbase.rootdir</name>
                <value>/tmp/hbase/data</value>
        </property>
        <property>
                <name>hbase.zookeeper.property.dataDir</name>
                <value>/tmp/hbase/zk</value>
        </property>
        <property>
                <name>hbase.zookeeper.property.clientPort</name>
                <value>2182</value>
        </property>
</configuration>

5、修改hosts

[admin@mvxl2429 bin]$ cat /etc/hosts
10.16.72.63 mvxl2429
127.0.0.1   localhost.localdomain localhost4    localhost4.localdomain4 localhost
127.0.0.1   localhost

6、启动hbase

[admin@mvxl2429 yst]$ sh hbase-2.1.5/bin/start-hbase.sh
running master, logging to /mnt/yst/hbase-2.1.5/bin/../logs/hbase-admin-master-mvxl2429.out
[admin@mvxl2429 yst]$ ps aux | grep hbase
admin    31823  0.0  0.0 110868  1700 pts/0    S    11:06   0:00 bash /mnt/yst/hbase-2.1.5/bin/hbase-daemon.sh --config /mnt/yst/hbase-2.1.5/bin/../conf foreground_start master
admin    31837 50.3  2.3 6489552 384212 pts/0  Sl   11:06   0:16 /usr/lib/jvm/jre-1.8.0/bin/java -Dproc_master -XX:OnOutOfMemoryError=kill -9 %p -XX:+UseConcMarkSweepGC -Dhbase.log.dir=/mnt/yst/hbase-2.1.5/bin/../logs -Dhbase.log.file=hbase-admin-master-mvxl2429.log -Dhbase.home.dir=/mnt/yst/hbase-2.1.5/bin/.. -Dhbase.id.str=admin -Dhbase.root.logger=INFO,RFA -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.master.HMaster start
admin    32237  0.0  0.0 107884  1068 pts/0    S+   11:07   0:00 grep hbase

7、验证一下整合效果

可以看到已经自动生成数据仓库用到的一些表。

[admin@mvxl2429 yst]$ ./phoenix-5.0.0/bin/sqlline.py localhost:2182
Setting property: [incremental, false]
Setting property: [isolation, TRANSACTION_READ_COMMITTED]
issuing: !connect jdbc:phoenix:localhost:2182 none none org.apache.phoenix.jdbc.PhoenixDriver
Connecting to jdbc:phoenix:localhost:2182
19/07/29 11:24:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connected to: Phoenix (version 5.0)
Driver: PhoenixEmbeddedDriver (version 5.0)
Autocommit status: true
Transaction isolation: TRANSACTION_READ_COMMITTED
Building list of tables and columns for tab-completion (set fastconnect to true to skip)...
133/133 (100%) Done
Done
sqlline version 1.2.0
0: jdbc:phoenix:localhost:2182> !table
+------------+--------------+-------------+---------------+----------+------------+----------------------------+-----------------+--------------+--------+
| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  |  TABLE_TYPE   | REMARKS  | TYPE_NAME  | SELF_REFERENCING_COL_NAME  | REF_GENERATION  | INDEX_STATE  | IMMUTA |
+------------+--------------+-------------+---------------+----------+------------+----------------------------+-----------------+--------------+--------+
|            | SYSTEM       | CATALOG     | SYSTEM TABLE  |          |            |                            |                 |              | false  |
|            | SYSTEM       | FUNCTION    | SYSTEM TABLE  |          |            |                            |                 |              | false  |
|            | SYSTEM       | LOG         | SYSTEM TABLE  |          |            |                            |                 |              | true   |
|            | SYSTEM       | SEQUENCE    | SYSTEM TABLE  |          |            |                            |                 |              | false  |
|            | SYSTEM       | STATS       | SYSTEM TABLE  |          |            |                            |                 |              | false  |
+------------+--------------+-------------+---------------+----------+------------+----------------------------+-----------------+--------------+--------+
0: jdbc:phoenix:localhost:2182> !quit
Closing: org.apache.phoenix.jdbc.PhoenixConnection

安装SQuirrel客户端

SQuirrel是一个图形化界面工具。由于Phoenix是一个JDBC驱动程序,因此与此类工具的集成是无缝的。通过SQuirrel,您可以在SQL选项卡中发出SQL语句(创建表,插入数据,运行查询),并在“对象”选项卡中检查表元数据(即列表,列,主键和类型)。

1、下载客户端 download

2、安装SQuirrel

java -jar squirrel-sql-3.9.1-standard.jar

3、配置Phoenix客户端jar包
将apache-phoenix-5.0.0-HBase-2.0-bin.tar.gz中的phoenix-5.0.0-HBase-2.0-client.jarjar包拷贝到squirrel-sql-3.9.1\lib文件夹下。

4、启动SQuirrel

在安装目录下,双击打开 squirrel-sql.bat


squirrel目录结构

5、添加一个Phoenix驱动

设置zk地址和Phoenix驱动包信息

添加Phoenix驱动

6、添加本地hosts

打开文件 c:\windows\system32\drivers\etc\hosts
添加以下内容:

10.16.72.63 mvxl2429

7、测试Phoenix连接

测试Phoenix连接

8、连接Phoenix

连接Phoenix

实现DataX数据同步

DataX是阿里开源的一个数据同步中间件。在阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。Phoenix数据同步文档点击这里

1、开启Phoenix的瘦客户端

使用DataX同步需要用到Phoenix的瘦客户端连接。这里似乎有个BUG,如果该表了zk的端口,则无法连接。

[root@local14 bin]# ./queryserver.py start
starting Query Server, logging to /tmp/phoenix/phoenix-root-queryserver.log

2、添加DataX插件

下载打包好的DataX并没有phoenix5的同步插件,需要下载源码重新编译。DataX下载地址

[root@gz-tencent ~]# git clone https://github.com/alibaba/DataX.git
Cloning into 'DataX'...
remote: Enumerating objects: 7, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (7/7), done.
remote: Total 1657 (delta 1), reused 4 (delta 0), pack-reused 1650
Receiving objects: 100% (1657/1657), 11.55 MiB | 113.00 KiB/s, done.
Resolving deltas: 100% (303/303), done.
[root@gz-tencent ~]# cd DataX ; mvn clean package -DskipTests assembly:assembly
[root@gz-tencent writer]# ll /root/DataX/hbase20xsqlwriter/target/datax/plugin/writer/hbase20xsqlwriter
total 32
-rw-r--r-- 1 root root 20401 Jul 29 23:30 hbase20xsqlwriter-0.0.1-SNAPSHOT.jar
drwxr-xr-x 2 root root  4096 Jul 29 23:30 libs
-rw-r--r-- 1 root root   262 Jul 29 23:30 plugin_job_template.json
-rwxr-xr-x 1 root root   263 Jul 29 23:30 plugin.json

3、添加DataX同步配置

[admin@mvxl2429 ~]$ cat /mnt/yst/datax/job/mysql2phoenix.json 
{
    "job": {
        "setting": {
            "speed": {
                "channel": 16,
                "byte": 1048576,
                "record": 10000
            }
        },
        "content": [{
            "reader": {
                "name": "mysqlreader",
                "parameter": {
                    "username": "root",
                    "password": "root632",
                    "column": ["*"],
                    "splitPk": "id",
                    "connection": [{
                        "table": ["f_order_report_701"],
                        "jdbcUrl": ["jdbc:mysql://你的地址:3306/你的库"]
                    }]
                }
            },
            "writer": {
                "name": "hbase20xsqlwriter",
                "parameter": {
                    "batchSize": "100",
                    "column": ["ID","OUTER_ORDER_ID","SHOP_NAME"],
                    "queryServerAddress": "http://127.0.0.1:8765",
                    "nullMode": "skip",
                    "table": "F_ORDER_REPORT_701"
                }
            }
        }]
    }
}

3、跑一个验证下

[root@local14 bin]# ./datax.py /root/data/datax/job/mysql2phoenix.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2019-07-26 22:56:05.927 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2019-07-26 22:56:05.936 [main] INFO  Engine - the machine info  => 

    osInfo: Oracle Corporation 1.8 25.181-b13
    jvmInfo:    Linux amd64 3.10.0-862.11.6.el7.x86_64
    cpu num:    1

    totalPhysicalMemory:    -0.00G
    freePhysicalMemory: -0.00G
    maxFileDescriptorCount: -1
    currentOpenFileDescriptorCount: -1

    GC Names    [Copy, MarkSweepCompact]

    MEMORY_NAME                    | allocation_size                | init_size                      
    Eden Space                     | 273.06MB                       | 273.06MB                       
    Code Cache                     | 240.00MB                       | 2.44MB                         
    Survivor Space                 | 34.13MB                        | 34.13MB                        
    Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
    Metaspace                      | -0.00MB                        | 0.00MB                         
    Tenured Gen                    | 682.69MB                       | 682.69MB                       


2019-07-26 22:56:05.978 [main] INFO  Engine - 
{
    "content":[
        {
            "reader":{
                "name":"mysqlreader",
                "parameter":{
                    "column":[
                        "id",
                        "outer_order_id",
                        "shop_name"
                    ],
                    "connection":[
                        {
                            "jdbcUrl":[
                                ""jdbc:mysql://你的地址:3306/你的库""
                            ],
                            "table":[
                                "f_order_report_701"
                            ]
                        }
                    ],
                    "password":"*******",
                    "splitPk":"id",
                    "username":"root"
                }
            },
            "writer":{
                "name":"hbase20xsqlwriter",
                "parameter":{
                    "batchSize":"100",
                    "column":[
                        "ID",
                        "OUTER_ORDER_ID",
                        "SHOP_NAME"
                    ],
                    "nullMode":"skip",
                    "queryServerAddress":"http://127.0.0.1:8765",
                    "table":"F_ORDER_REPORT_701"
                }
            }
        }
    ],
    "entry":{
        "jvm":"-Xms2048m -Xmx2048m"
    },
    "setting":{
        "speed":{
            "byte":1048576,
            "channel":16,
            "record":10000
        }
    }
}

2019-07-26 22:56:06.008 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2019-07-26 22:56:06.011 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2019-07-26 22:56:06.011 [main] INFO  JobContainer - DataX jobContainer starts job.
2019-07-26 22:56:06.023 [main] INFO  JobContainer - Set jobId = 0
2019-07-26 22:56:06.850 [job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://172.16.14.155:3306/ins_632_prd?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2019-07-26 22:56:06.925 [job-0] INFO  OriginalConfPretreatmentUtil - table:[f_order_report_701] has columns:[id,outer_create_time,order_apply_id,outer_order_id,order_refund_type,shop_name,item_id,item_name,item_cnt,item_amount,adjust_amount,discount_amount,retail_price,settlement_amount,packing_return_id,allocated_time,pay_order_no,invoice_code,store_name,store_type,logistics_com_code,current_status,paragraph_time,receiver_province_name,invoice_type,shiped_return_time,invoice_time,is_return_invoice,category_name,source,volume,weight,create_time,trade_from,operate_time,tag,cate_id,outer_shop_id,platform_id,shop_id,order_item_id,store_id,franchiser,seller_memo,parent_order_id,write_off_amount,sales_order_id,replace_order_id,f_order_type,status,order_type,business_type,purchase_type,order_attribute,order_identifying,customer_code,customer_name,supplier_code,franchiser_name,item_type,received_time,ec_reimbursement_code,apply_type,sales_channel,gift_card_amount,customer_bill,purchase_price,purchase_amount,payment_amount,update_time,order_create_time,business_name,sale_type,order_payment_amount,order_purchase_amount,document_id,item_price,order_id,parent_outer_order_id,order_fx_relation_id,real_order_id,cims_entity_id,cims_entity_name,sales_center_code,outer_store_code,is_invoice,ou_id,outer_store_name,sales_center_name,compare_id,account_time,is_adjust,unique_id,is_group,operator_mip,commodity_type,receiver_city_name,tax_code,cashback_amount,wait_settle_store,brand].
2019-07-26 22:56:07.937 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2019-07-26 22:56:07.937 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2019-07-26 22:56:07.938 [job-0] INFO  JobContainer - DataX Writer.Job [hbase20xsqlwriter] do prepare work .
2019-07-26 22:56:07.938 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2019-07-26 22:56:07.939 [job-0] INFO  JobContainer - Job set Max-Byte-Speed to 1048576 bytes.
2019-07-26 22:56:07.939 [job-0] INFO  JobContainer - Job set Max-Record-Speed to 10000 records.
2019-07-26 22:56:07.944 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2019-07-26 22:56:07.944 [job-0] INFO  JobContainer - DataX Writer.Job [hbase20xsqlwriter] splits to [1] tasks.
2019-07-26 22:56:07.980 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2019-07-26 22:56:07.986 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2019-07-26 22:56:07.987 [job-0] INFO  JobContainer - Running by standalone Mode.
2019-07-26 22:56:08.029 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2019-07-26 22:56:08.033 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2019-07-26 22:56:08.034 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2019-07-26 22:56:08.099 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2019-07-26 22:56:08.116 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Begin to read record by Sql: [select id,outer_order_id,shop_name from f_order_report_701 
] jdbcUrl:[jdbc:mysql://172.16.14.155:3306/ins_632_prd?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2019-07-26 22:56:08.253 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Finished read record by Sql: [select id,outer_order_id,shop_name from f_order_report_701 
] jdbcUrl:[jdbc:mysql://172.16.14.155:3306/ins_632_prd?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2019-07-26 22:56:08.511 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[461]ms
2019-07-26 22:56:08.512 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2019-07-26 22:56:18.027 [job-0] INFO  StandAloneJobContainerCommunicator - Total 30 records, 829 bytes | Speed 82B/s, 3 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2019-07-26 22:56:18.027 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2019-07-26 22:56:18.027 [job-0] INFO  JobContainer - DataX Writer.Job [hbase20xsqlwriter] do post work.
2019-07-26 22:56:18.028 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do post work.
2019-07-26 22:56:18.028 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2019-07-26 22:56:18.029 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /root/data/datax/hook
2019-07-26 22:56:18.030 [job-0] INFO  JobContainer - 
     [total cpu info] => 
        averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
        -1.00%                         | -1.00%                         | -1.00%
                        

     [total gc info] => 
         NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
         Copy                 | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
         MarkSweepCompact     | 1                  | 1                  | 1                  | 0.075s             | 0.075s             | 0.075s             

2019-07-26 22:56:18.031 [job-0] INFO  JobContainer - PerfTrace not enable!
2019-07-26 22:56:18.031 [job-0] INFO  StandAloneJobContainerCommunicator - Total 30 records, 829 bytes | Speed 82B/s, 3 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2019-07-26 22:56:18.032 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2019-07-26 22:56:06
任务结束时刻                    : 2019-07-26 22:56:18
任务总计耗时                    :                 12s
任务平均流量                    :               82B/s
记录写入速度                    :              3rec/s
读出记录总数                    :                  30
读写失败总数                    :                   0

4、查看Phoenix中的数据记录

0: jdbc:phoenix:thin:url=http://localhost:876> select count(1) from F_ORDER_REPORT_701;
+-----------+
| COUNT(1)  |
+-----------+
| 30        |
+-----------+
1 row selected (0.069 seconds)
上一篇下一篇

猜你喜欢

热点阅读