记一次Hive任务hang住的问题(2)

2020-03-21  本文已影响0人  井地儿

1 背景

线上Hive任务偶尔出现hang住的现象.
经排查确认是触发了Hive的bug, 该bug在Hive-10569(https://issues.apache.org/jira/browse/HIVE-10569中已修复.

2 现象

用户beeline任务运行至Ended Job之后,一直未结束,查看yarn日志,该job已经运行成功.

INFO  : 2020-03-20 20:25:21,401 Stage-36 map = 100%,  reduce = 0%, Cumulative CPU 3.96 sec
INFO  : 2020-03-20 20:25:30,224 Stage-36 map = 100%,  reduce = 100%, Cumulative CPU 9.44 sec
INFO  : MapReduce Total cumulative CPU time: 9 seconds 440 msec
INFO  : Ended Job = job_1582793079899_13676666INFO  : 2020-03-20 20:25:21,401 Stage-36 map = 100%,  reduce = 0%, Cumulative CPU 3.96 sec
INFO  : 2020-03-20 20:25:30,224 Stage-36 map = 100%,  reduce = 100%, Cumulative CPU 9.44 sec
INFO  : MapReduce Total cumulative CPU time: 9 seconds 440 msec
INFO  : Ended Job = job_1582793079899_13676666

3 排查步骤

3.1 查看hiveserver2日志

找到该job在hive server端的日志,没有发现任何异常.

3.2 抓取现场

分别通过jstack -l <pid>jmap -dump:[live,]format=b,file=<filename> <pid>获取server端的堆栈信息.

3.3 定位排查

本案例中线程栈如下:

"HiveServer2-Background-Pool: Thread-10663039" #10663039 prio=5 os_prio=0 tid=0x00007fcd94c87000 nid=0x213ed in Object.wait() [0x00007fcbec2fb000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.hive.ql.DriverContext.pollFinished(DriverContext.java:118)
        - locked <0x00007fd29eff91d0> (a org.apache.hadoop.hive.ql.DriverContext)
        at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1574)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1113)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1108)
        at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:157)
        at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:70)
        at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:209)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1923)
        at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:221)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

3.4 源码剖析

查看源码位置, 分析可知runner.isRunning状态未更新会导致hang住不结束.

  /**
   * Polls running tasks to see if a task has ended.
   *
   * @return The result object for any completed/failed task
   */
  public synchronized TaskRunner pollFinished() throws InterruptedException {
    while (!shutdown) {
      Iterator<TaskRunner> it = running.iterator();
      while (it.hasNext()) {
        TaskRunner runner = it.next();
        if (runner != null && !runner.isRunning()) {
          it.remove();
          return runner;
        }
      }
      wait(SLEEP_TIME);
    }
    return null;
  }

4 解决方案

查找社区issue发现, 该bug已经在1.3版本中进行了修复.(https://issues.apache.org/jira/browse/HIVE-10569
后续 backport 该patch即可.

上一篇 下一篇

猜你喜欢

热点阅读