记一次线上奔溃分析过程

2020-03-03 本文已影响0人天地一蜉蝣_6e86

在一次升级之后其中一个应用一直奔溃。java crash 的原因有几种：

java 程序问题，发生OOM 导致进程crash
排查步骤如下：
　　1. 查看JVM参数 -XX:+HeapDumpOnOutOfMemoryError 和 -XX:HeapDumpPath=*/java.hprof；
　　2. 根据HeapDumpPath指定的路径查看是否产生dump文件；
　　3. 若存在dump文件，使用Jhat、VisualVM等工具分析即可；
jvm 出错，jvm 或者jdk 自身的bug 导致crash
当jvm出现致命错误时，会生成一个错误文件 hs_err_pid.log，其中包括了导致jvm crash的重要信息，可以通过分析该文件定位到导致crash的根源，从而改善以保证系统稳定。当出现crash时，该文件默认会生成到工作目录下，然而可以通过jvm参数-XX:ErrorFile指定生成路径。
被操作系统oom-killer
查看操作系统日志：sudo grep –color “java” /var/log/messages，确定Java进程是否被操作系统Kill
在线上环境上可以看到有大量的hs_err_pid.log

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f847f9bf641, pid=36367, tid=0x00007f844b3f5700
#
# JRE version: OpenJDK Runtime Environment (Zulu 8.38.0.13-CA-linux64) (8.0_212-b04) (build 1.8.0_212-b04)
# Java VM: OpenJDK 64-Bit Server VM (25.212-b04 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x16f641]  __strlen_sse2_pminub+0x11
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

---------------  T H R E A D  ---------------

Current thread (0x00007f834c001000):  JavaThread "AgentMonitor-42" [_thread_in_native, id=36538, stack(0x00007f844b2f5000,0x00007f844b3f6000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 0x0000000000000000

Registers:
RAX=0x0000000000000000, RBX=0x0000000000000016, RCX=0x0000000000000000, RDX=0x0000000000000000
RSP=0x00007f844b3f16b8, RBP=0x00007f845033a330, RSI=0x00007f844b3f1b90, RDI=0x33261d74e6c20600
R8 =0x00007f844b3f1720, R9 =0x00007f847f89d27d, R10=0x0000000000000002, R11=0x00007f847f9d3df4
R12=0x0000000000000041, R13=0x00007f845033a518, R14=0x00007f84804607e0, R15=0x0000000000000004
RIP=0x00007f847f9bf641, EFLAGS=0x0000000000010283, CSGSFS=0x0000000000000033, ERR=0x0000000000000000
  TRAPNO=0x000000000000000d

Top of Stack: (sp=0x00007f844b3f16b8)
0x00007f844b3f16b8:   00007f848025b6bd 00007f8340567120
0x00007f844b3f16c8:   00007f844b3f1f90 00007f844b3f1f90
0x00007f844b3f16d8:   00007f848025dffb 00007f844b3f1720
...
0x00007f844b3f1898:   00007f845184aba8 00007f834c001000
0x00007f844b3f18a8:   00007f834c001000 00007f844b3f1920 

Instructions: (pc=0x00007f847f9bf641)
0x00007f847f9bf621:   c0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48
0x00007f847f9bf631:   31 c0 89 f9 83 e1 3f 66 0f ef c0 83 f9 30 77 1d
0x00007f847f9bf641:   f3 0f 6f 0f 66 0f 74 c1 66 0f d7 d0 85 d2 0f 85
0x00007f847f9bf651:   4e 02 00 00 48 89 f8 48 83 e0 f0 eb 24 48 89 f8 

Register to memory mapping:

RAX=0x0000000000000000 is an unknown value
RBX=0x0000000000000016 is an unknown value
RCX=0x0000000000000000 is an unknown value
RDX=0x0000000000000000 is an unknown value
RSP=0x00007f844b3f16b8 is pointing into the stack for thread: 0x00007f834c001000
RBP=0x00007f845033a330 is pointing into the stack for thread: 0x00007f838c01a000
RSI=0x00007f844b3f1b90 is pointing into the stack for thread: 0x00007f834c001000
RDI=0x33261d74e6c20600 is an unknown value
R8 =0x00007f844b3f1720 is pointing into the stack for thread: 0x00007f834c001000
R9 =0x00007f847f89d27d: _IO_vfprintf+0x4ccd in /lib64/libc.so.6 at 0x00007f847f850000
R10=0x0000000000000002 is an unknown value
R11=0x00007f847f9d3df4: <offset 0x183df4> in /lib64/libc.so.6 at 0x00007f847f850000
R12=0x0000000000000041 is an unknown value
R13=0x00007f845033a518 is pointing into the stack for thread: 0x00007f838c01a000
R14=0x00007f84804607e0: snoopy_inputdatastorage_data+0 in /usr/lib64/libsnoopy.so at 0x00007f8480255000
R15=0x0000000000000004 is an unknown value

该文件包含如下几类关键信息：
-日志头文件
-导致crash的线程信息
-所有线程信息
-安全点和锁信息
-堆信息
-本地代码缓存
-编译事件
-gc相关记录
-jvm内存映射
-jvm启动参数
-服务器信息
具体分析参考：https://my.oschina.net/xionghui/blog/498785
在stack 中可以看到：

Stack: [0x00007f844b2f5000,0x00007f844b3f6000],  sp=0x00007f844b3f16b8,  free space=1009k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)

C=native code 说明java 在执行native 代码时crash。可以使用strace 追踪系统调用。（strace ：https://blog.csdn.net/rigete/article/details/50055783）

28100 22:03:35.247322 [00007f1a9305b710] open(0x7f1a95f56110, O_RDONLY) = -1 ENOENT (No such file or directory)
28100 22:03:35.247354 [00007f1a9305b710] open(0x7f1a95f571a0, O_RDONLY) = -1 ENOENT (No such file or directory)
28100 22:03:35.247383 [00007f1a9305b710] open(0x7f1a95f561a0, O_RDONLY) = -1 ENOENT (No such file or directory)
28100 22:03:35.247414 [00007f1a9305b710] open(0x7f1a95f57120, O_RDONLY) = -1 ENOENT (No such file or directory)
28100 22:03:35.247444 [00007f1a9305b710] open(0x7f1a95f57230, O_RDONLY) = -1 ENOENT (No such file or directory)
28100 22:03:35.247473 [00007f1a9305b710] open(0x7f1a95f56220, O_RDONLY) = -1 ENOENT (No such file or directory)
28100 22:03:35.247508 [00007f1a9305b9b0] write(2, 0x7ffc86a540b8, 4) = -1 EPIPE (Broken pipe)
28100 22:03:35.247540 [00007f1a9305b9b0] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=28100, si_uid=1001} ---
28100 22:03:35.247622 [????????????????] +++  +++killed by SIGPIPE

和hs_err_pid.log中的SIGSEGV信息差不多，写入不存在的内存或者只读内存。信息不多。所以对Register to memory mapping 中的内存谷歌了下发现：
https://stackoverflow.com/questions/44922588/hadoop-nodemanager-killed-by-sigsegv
err 信息一模一样，所以尝试停止snoopy，然后解决了。

记一次线上奔溃分析过程

猜你喜欢

热点阅读