05 HBase应用架构

2020-02-15  本文已影响0人  逸章

HBase的依赖

这里只讨论分布式环境。
HDFS(实际上是非必须的,比如运行在Amazon S3上)、JDK、Zookeeper(注意HBase2.0正在试图去除它对ZooKeeper的依赖,它准备用HMaster来取代当前Zookeeper做的track some of the information工作)。

部署了HBase的节点都需要运行HDFS,但是它不要求所有运行HDFS的节点上都运行HBase,但是推荐也运行,不然可能引起 unbalanced 情况

一、HBase 原则

表的格式

提示:可以在插入数据的时候动态产生列

表的存储

image.png

注意上图所说block和hdfs的block是两回事,HDFS的一个Block可以包含多个HFile Block

HFiles

Block

Larger blocks 产生的index values相对来说较少,对表的顺序访问是有利的,而smaller blocks则会产生更多的index values,对表的随机访问比较有利
Block的几种类型说明:

Cells

each column will be stored individually instead of storing an entire row on its own。Because those values can be inserted at different time, they might end up in different files in HDFS.

HBase 有 key compression机制,At a high level, only the delta between the current and the previous key is stored. 对于所有表来说,compression能节约空间,但是需要从前一个key中rebuild出current key也是需要一点点代价的。由于columns 是分开存储的,对于wide table来说,许多列都会用到相同的key,所以key大小的减少能减少很多空间,这个特性叫做Data block encoding

Data block encoding
这是一种 HBase feature,keys 会基于前一个key被 encoded 和 compressed. One of the encoding options ( FAST_DIFF ) asks HBase to store only the difference between the current key and the previous one. HBase stores each cell individually, with its key and value. When a row has many cells, much space can be consumed by writing the same key for each cell. Therefore, activating the data block encoding can allow important space saving. It is almost always helpful to
activate data block encoding, so if you are not sure, activate
FAST_DIFF

Internal Table Operations

HBase水平扩展基于3个机制:压缩(compactions), 分割(splits 和compaction相反), 和负载均衡( balancing).

Compaction

当memstore full以后,会flush到磁盘上,进而在HDFS上生产很多小的文件,当满足一定条件后,HBase 会选择一些文件,然后把他它们compacted together into 一个大文件
有两种类型的压缩:

if you really have a very big cluster with many tables and regions, it
is recommended to implement a process to check the number of files per regions and the age of the oldest one and trigger the compactions at the region level only if there are more files than you want or if the oldest one (even if there is just one file) is older than a configured period (a week is a good starting point).

Splits (Auto-Sharding)

Split operations are the opposite of compactions.
在compact过程中,如果not too many values are dropped,则会产生一个大文件,如果输入的文件越大,则parse的时间越长

When one of the column families of a region(一个Region可以包含多个Column Families) reaches this size(HBase0.94默认值为10G), to improve balancing of the load, HBase will trigger a split of the given region into two new regions.(可以理解为每个Column Family占据的空间虽然很小,但是如果其中有个Column Family过大,也会导致整个Region被重新分割)

记住HBase will split all the column families. Even if your first column reached the 10 GB threshold but the second one contains only a few rows or kilobytes, both of them will be split.

同一行的不同Column不会被split到不同region中,所以请注意如果你有很多很多columns ,或者他们都非常大,以至于单行都比the maximum configured size还要大,此时HBase不会split它们

Split是有代价的,当一个Region被分割后,它将会失去locality ,直到下一次compaction。而这会影响read性能,because the client will reach the RegionServer hosting the region, but from there, the data will have to be queried over the network to serve the request. Also, the more regions you have, the more you put pressure on the master, the hbase:meta table, and the region services.

Balancing

Regions的split, servers might fail, and new servers might join the cluster, so at some point the load may no longer be well distributed across all your RegionServers. To help maintain a good distribution on the cluster, every five minutes (default configured schedule time), the HBase Master will run a load balancer to ensure that all the
RegionServers are managing and serving a similar number of regions.(尽量管理相同数量的Region)

When a region is moved by the balancer from one server to a new one, it will be unavailable for a few milliseconds, and it will lose its data locality until it gets major compacted.

image.png

HBase Roles

image.png

HMaster

可以在一个集群里面有多个HMaster,HMaster不同与RegionServer,它没有太多负担,你可以在一个内存小核数不多的机器上安装它。

RegionServers所在机器的disks一般不用配置为RAID or dual power supplies, 但是构建更加可靠的HBase Masters则是必要的,Building HBase Masters (和其他Master服务比如NameNodes, ZooKeeper, etc.) on robust hardware with OS on RAID drives, dual power supply, etc. is highly recommended.

A cluster can survive without a master server as long as there is no RegionServer failing nor regions splitting(只要过程中没有RegionServer挂掉,也没有Region splitting,集群里面可以没有HMaster)

RegionServer

注意:对HBase数据的读写不一定要每次都通过HMaster,更多时候他都可以直接访问RegionServer.

When a client tries to read data from HBase for the first time, it will first go to ZooKeeper to find the master server and locate the hbase:meta region where it will locate the region and RegionServer it is looking for. In subsequent calls from the same client to the same region, all those extra calls are skipped, and the client will talk directly with the related RegionServer. This is why it is important, when possible, to reuse the same client for multiple operations.

RegionServer技术上可以在一台物理主机上,但是通常建议一台物理主机只部署最多一个RegionServer

二、HBase Ecosystem

1、监控工具

Hadoop/HBase通过XML files来做配置的,这样你就可以手工安装,我们一般使用automated configuration management tools比如Puppet or Chef,同时配合监控工具,比如Ganglia 或者 Cacti.

That said(尽管如此), 在Hadoop ecosystem中,有两种工具可以帮助部署HBase集群:Cloudera Manager和Apache Ambari. 这两个工具都能实现deploying、monitoring和managing全部的Hadoop suite

2、SQL

Hadoop市场提供SQL功能的工具大多数都主要为了提供BI(business intelligence)功能 。

2.1 Phoenix

Even with only a couple years in the Apache Foundation, Phoenix has seen a nice adoption rate and is quickly becoming the de facto tool for SQL queries. Phoenix主要竞争对手是Hive和Impala(Impala是Cloudera公司主导开发的新型查询系统,它提供SQL语义,能查询存储在Hadoop的HDFS和HBase中的PB级大数据。已有的Hive系统虽然也提供了SQL语义,但由于Hive底层执行使用的是MapReduce引擎,仍然是一个批处理过程,难以满足查询的交互性。相比之下,Impala的最大特点也是最大卖点就是它的快速)

Phoenix has been able to establish itself as a superior tool through tighter integration by leveraging HBase coprocessors, range scans, and custom filters. Hive and Impala were both built for full file scans in HDFS, which can greatly impact performance because HBase was designed for single point gets and range scans.

Finally, Hive and Apache Impala are storage engines both designed to run full table or partitioned scans against HDFS. Hive and Impala both have HBase storage handlers allowing them to connect to HBase and perform SQL queries. These systems tend to pull more data than the other systems, which will greatly increase query times. Hive or Impala make sense when a small set of reference data lives in HBase, or when the queries are not bound by SLAs.

上一篇 下一篇

猜你喜欢

热点阅读