Hadoop集群出现的日志分析

2020-06-29  本文已影响0人  喜欢打游戏的摸鱼怪

HBase异常日志——Slow ReadProcessor

今天在翻看集群日志发现出现了比较多的警告日志,具体日志信息如下:

2020-06-28 21:46:53,769 WARN  [ResponseProcessor for block BP-213784990-192.168.69.71-1571291895338:blk_1081938655_8197927] hdfs.DFSClient: Slow ReadProcessor read fields took 127841ms (threshold=30000ms); ack: seqno: 4 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 2520459 flag: 0 flag: 0 flag: 0, targets: [DatanodeInfoWithStorage[192.168.69.71:50010,DS-ee07bf58-7041-4cf9-996a-564371406acb,DISK], DatanodeInfoWithStorage[192.168.69.72:50010,DS-954cd3c5-4efb-45e6-bea8-85284338d1e4,DISK], DatanodeInfoWithStorage[192.168.69.73:50010,DS-7495e832-f9ad-4532-94d6-ae5ff9b9d9bc,DISK]]

根据上面日志可以发现,警告信息有很明显的提示“hdfs.DFSClient”。HBase本身就是基于HDFS实现数据文件的存储,因此这个警告虽然出现在HBase的RegionServer日志中但是问题是出现在HDFS上。通过分析HDFS上DN上的日志信息可以找到比较多Slow BlockReceiver/Slow flushOrSync等等字眼。一般情况下可能就是网络或者磁盘存在瓶颈或者出现了问题,当然也有可能是某个DataNode节点出现了问题导致整个集群出现了这类警告。

网络可能存在问题:
Slow BlockReceiver write packet to mirror took / Slow PacketResponder send ack to upstream took

磁盘可能存在问题:
Slow manageWriterOsCache took / Slow BlockReceiver write data to disk cost / Slow flushOrSync took

具体的可以通过命令:

egrep -o "Slow.*?(took|cost)"  DataNodeLogPath | sort | uniq -c

进行统计分析,大概就可以得出当前HDFS上到底是磁盘还是网络存在问题。当然也可以通过iotop、nload和top等工具查看当前机器负载、IO、网络情况进而分析。

HBase日志——InvalidToken

具体日志如下:

 2020-06-28 17:46:34,768 INFO  [RpcServer.FifoWFPBQ.default.handler=107,queue=2,port=16020] hdfs.DFSClient: Access token was invalid when connecting to /192.168.69.71:50010 : org.apache.hadoop.security.token.SecretManager$InvalidToken: access control error while attempting to set up short-circuit access to /apps/hbase/data/data/default/OBSV_SURF_AWOS_POINT_ELE/f503b07a80b8847be28a4728ee80cedf/0/ce5bfc46cdc34523b91375603d6fda1f

2020-06-28 17:46:34,766 INFO  [RpcServer.FifoWFPBQ.default.handler=107,queue=2,port=16020] shortcircuit.ShortCircuitCache: ShortCircuitCache(0x8bc36df): could not load 1081613847_BP-213784990-192.168.69.71-1571291895338 due to InvalidToken [exception.org.apache.hadoop.security.token.SecretManager$InvalidToken](http://exception.org.apache.hadoop.security.token.secretmanager%24invalidtoken/): access control error while attempting to set up short-circuit access to /apps/hbase/data/data/default/OBSV_SURF_AWOS_POINT_ELE/f503b07a80b8847be28a4728ee80cedf/0/ce5bfc46cdc34523b91375603d6fda1f        at org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:589)        at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:488)        at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784)        at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718)        at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)        at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:669)        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:888)        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:945)        at java.io.DataInputStream.read(DataInputStream.java:149)        at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:683)        at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1428)        at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1642)        at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1515)        at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:439)        at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:269)        at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:636)        at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:586)        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:264)        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:170)        at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:363)        at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:217)        at org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2136)        at org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2124)        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:5707)        at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2716)        at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2702)        at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2684)        at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2401)        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32385)

这个日志不是警告日志只是一个Info日志,但是在RegionServer中出现频率也比较高。本菜鸡又是捣腾HDFS short-circuit配置又是追源码,搞了半天还是无能为力。所幸在cloudera社区找到了比较靠谱的答案。国外大佬对此回答是:

this looks scarier than it is. This is an HDFS client-level exception which is expected and you should never see happening. The client will automatically fetch a new token when this happens. It is a bug that it is logged for you to see. This is not a problem nor anything you need to fix.

其实翻译过来的意思就是这个日志看起来吓人,但是这其实不是一个问题。这个日志就是记录给你看的,token过期了会自动重新获取了。这个不是问题不需要你去解决。iヽ(o_ _)o摔倒

Phoenix客户端日志-AsyncProcess

具体输出日志信息如下:

2020-06-29 19:00:35 [pool-7-thread-1] [INFO ] AsyncProcess[1566 line] - #1, waiting for 10000  actions to finish
2020-06-29 19:02:23 [pool-7-thread-1] [INFO ] AsyncProcess[1566 line] - #1, waiting for 10000  actions to finish
2020-06-29 19:02:43 [pool-7-thread-1] [INFO ] AsyncProcess[1566 line] - #1, waiting for 10000  actions to finish
2020-06-29 19:03:38 [pool-7-thread-1] [INFO ] AsyncProcess[1566 line] - #1, waiting for 10000  actions to finish

这个输出实际上是HBase Client端输出的,只是本菜这边采用的是Phoenix。出现大量这个日志其实就是说明RegionServer处理不过来,当前的数据写入的速度快于服务器处理的速度。这可能是因为表正在执行split/compaction或者gc导致的,这个具体的优化需要根据业务和服务端出现的问题进一步定位。

末尾

这篇文章主要是记录今天在集群看见的几个日志。感兴趣的欢迎到公众号“喜欢打游戏的摸鱼怪”留言。iヾ(o◕∀◕)ノヾ

上一篇下一篇

猜你喜欢

热点阅读