面试

抽丝剥茧之JVM-线程堆栈

2018-10-16  本文已影响382人  大方一号

分享是一种精神,是加深理解最好的方式,没有之一!

前言

本次分享以及后续章节旨在提升同学们对JVM的认知度,hold住她!干!
1、解决一般线上JVM故障(cpu,内存,gc等问题)
2、了解JVM基本结构和原理,以及常用的性能优化手段
3、掌握常用的JVM-分析工具
4、十分钟定位问题

开车了!

常用工具

线程堆栈

线程堆栈也称线程调用堆栈,是虚拟机中线程(包括锁)状态的一个瞬间快照,即系统在某一个时刻所有线程的运行状态,包括每一个线程的调用堆栈,锁的持有情况。
线程堆栈一般包含以下关键信息

so!借助堆栈信息可以帮助分析很多问题,如线程死锁,锁竞争,死循环,large线程耗时操作等等。在多数情况下,无须介入代码亦可对系统进行分析和故障定位!
线程堆栈一般可解决CPU相关的如下问题

vmstat pidstat 亦可分析CPU消耗情况,结合线程堆栈来分析更佳,但是定位问题还是优先分析dump文件,本文先忽略......

如何分析进程or线程堆栈?
获取进程id
jps 
获取进程堆栈
jstack –l <pid> >> thread.dump 
根据进程pid获取线程tid详情
top -H -p <pid>
根据线程tid 搜索进程堆栈
cat thread.dump  | grep tid...
查询进程对应的线程数 
cat /proc/进程id/status | grep Threads
10754 identify-webapi-0.0.1.war -verbose:gc -Xloggc:/data/program/cn.xxx/xxx-api/0.0.1/gc.log -
XX:+PrintGCDetails -XX:+PrintGCDateStamps -
XX:+HeapDumpOnOutOfMemoryError -
XX:HeapDumpPath=/data/program/cn.xxx/xxx-webapi/0.0.1/java.hprof -
-Xms2g -Xmx2g 
-XX:MetaspaceSize=38m -XX:MaxMetaspaceSize=380m
XX:+UseConcMarkSweepGC -XX:CMSMaxAbortablePrecleanTime=5000 -XX:+CMSClassUnlo
28985 Jps -Dapplication.home=/xx/xxx/java8 -Xms8m

上述JVM参数 其实内容丰富
1、gc -Xloggc....配置了gc 日志的路径
2、XX:PrintGCDetails,就开启了详细GC日志模式,文件内容比较冗杂,亦可结合jvisualvm来分析。
3、XX:+HeapDumpOnOutOfMemoryError :JVM在遇到OOM(OutOfMemoryError)时生成Dump文件,同时可以指定文件路径
4、-Xms2g -Xmx2g 启动堆内存和最大堆内存一样
5、-XX:MetaspaceSize=38m -XX:MaxMetaspaceSize=380m 元空间初始值和最大值
6、UseConcMarkSweepGC:使用了cms垃圾收集器
........

可视化工具jvisualvm 初探...
$jps -l
9904 org.jetbrains.jps.cmdline.Launcher
9905 com.xxx.xxx.xxx.router.xxxApp
9878 org.jetbrains.idea.maven.server.RemoteMavenServer
......略.
VS

上图重点关注的细节

ps:本次重点不在可视化工具,如果能理解JVM-堆栈内容,其实比可视化工具更强大,jvisualvm就是dump文件的的阉割版。

案例分析一 新手入门

线上CPU负载飙升到100%以上
top监控结果

$top
top - 11:37:34 up 100 days, 23:25,  1 user,  load average: 0.75, 0.70, 0.63 
Tasks:  32 total,   3 running,  29 sleeping,   0 stopped,   0 zombie
Cpu(s): 55.6%us,  3.0%sy,  0.0%ni, 41.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.1%st
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18001 root      0 -20 5631m 4.0g 102m S 93.2 49.6   1024:13 java

其中CPU负载较高 达到了93% 最近十五分钟的负载都在60+%

如何找到它?

根据常龟操作
1、jps定位到进程pid
2、根据进程id定位线程列表并根据cpu排序

$ top -H -p 10754


如图 表明线程tid=11979 cpu占比达99% 内存 60% ,
3、进制转换,11979是10进制数,在线程dump中线程id是十六进制,需要进一步转换

linux 终端 printf "%x\n" 11979 -> 2ecb -> 0x2ecb
OR
在线进制转换 十进制转换十六进制.

4、导出当前堆栈到文件中

jstack -l 10754 >> x.dump

5、搜索查询对应线程id运行明细

cat x.dump | grep 0x2ecb -A 30

结果如下

"http-nio-8848-exec-6" #31 daemon prio=5 os_prio=31 tid=0x00007ff87dae5800 nid=0x6b03 runnable [0x0000700010ef0000]
   java.lang.Thread.State: RUNNABLE
    at com.test.xxxx.identify.router.Looper.cpu_check(TestController.java:77)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)

    //模拟一段死循环
    Long i=0L;
    while(true){
      i++;
      if(i==-99){
        break;
      }
    }
for(....){
Thread.sleep(0);
or
Thread.sleep(1);
}

sleep(0)释放当前线程所剩余的时间片,让出CPU执行权,不要再"霸占"CPU",这样可以让操作系统切换其他线程来执行,也可以避免太多线程"饥饿" 提升效率

案例分析二

PID    USER      PR  NI  VIRT  RES  SHR S %CPU %MEM     
15707 root    20   0 6213m 2.5g 7844 S  1.3 65.5   
19811 root    20   0 6213m 2.5g 7844 S  1.0 65.5 
19992 root    20   0 6213m 2.5g 7844 S  1.0 65.5 
15693 root    20   0 6213m 2.5g 7844 S  1.0 65.5    
16362 root    20   0 6213m 2.5g 7844 S  1.0 65.5   
16443 root    20   0 6213m 2.5g 7844 S  1.0 65.5    
19812 root    20   0 6213m 2.5g 7844 S  0.7 65.5  
19814 root    20   0 6213m 2.5g 7844 S  0.7 65.5 
....略上百条....
"http-nio-8080-exec-261" #15738 daemon prio=5 os_prio=0 tid=0x00007fef60117800 nid=0x406b runnable [0x00007feefb50d000]
   java.lang.Thread.State: RUNNABLE
    at org.apache.ibatis.trace.TraceIntercept.replaceFirst(TraceIntercept.java:219)
    at org.apache.ibatis.trace.TraceIntercept.replacePlaceholder(TraceIntercept.java:196)
    at org.apache.ibatis.trace.TraceIntercept.getSql(TraceIntercept.java:172)
    at org.apache.ibatis.trace.TraceIntercept.intercept(TraceIntercept.java:98)
    at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61)
    at com.sun.proxy.$Proxy162.update(Unknown Source)
    at org.apache.ibatis.session.defaults.DefaultSqlSession.update$impl(DefaultSqlSession.java:198)
    at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java)
    at org.apache.ibatis.session.defaults.DefaultSqlSession.insert(DefaultSqlSession.java:185)
    at sun.reflect.GeneratedMethodAccessor138.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:433)
    at com.sun.proxy.$Proxy130.insert(Unknown Source)
    at org.mybatis.spring.SqlSessionTemplate.insert(SqlSessionTemplate.java:278)
    at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:57)
    at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:59)
    at com.sun.proxy.$Proxy149.batchInsert(Unknown Source)
    at cn.xxx.xxx.xxx.dao.RawDataDAO.batchInsert(RawDataDAO.java:72)
    at cn.xxx.xxx.xxx.xxxx.batchInsert(RawDataManager.java:113)
    at cn.xxx.xxx.xxx.xxxx.batchRawData(RawdataStorageServiceImpl.java:101)
    at cn.xxx.xxx.controller.xxxxController.batchRawData(RawDataManagerController.java:60)
$ grep -c 'org.apache.ibatis.trace.TraceIntercept.getSql' k1.dump k2.dump k3.dump
k1.dump:89
k2.dump:76
k3.dump:98
如何分析线程堆栈?
名词解释
"http." prio=1 tid=0x000000000238e800 nid=0xc940 runnable [0x00000000027af000].
    |       |   |                       |           |           |
    |       |   |                       |           |           +--线程占用内存地址
    |       |   |                       |           +-----------线程的状态
    |       |   |                       +----线程对应的本地线程id号
    |       |   +-------------------线程id
    |       +--------------------------线程优先级
    +-------------------------------线程名称
    

备注:

//查询tomcat-http线程数
grep -c 'http-nio-8080-exec' k2.dump
200
ps:200 是tomcat默认的最大线程数,多次堆栈查出http线程数都达到了200,此处说明max-threads有待增大...
//查询当前堆栈dubbo线程数
grep -c 'DubboServerHandler-10.3.10.110:20888' k3.dump
200
ps:200是dubbo默认的最大线程数,优化策略
dubbo线程命名方式:DubboServerHandler+ip+端口+...
//查询当前堆栈线程数
grep -c 'nid' k2.dump
537
//查询运行中的线程数
grep -c 'java.lang.Thread.State: RUNNABLE' k2.dump
228
//springboot tomcat-http线程数配置
server.tomcat.max-threads=400
//dubbo线程数配置
<dubbo:protocol name="dubbo" port="20788" threadpool="cached" threads="400"/>
锁的解读
"http-nio-8080-exec-38" #908 daemon prio=5 os_prio=0 tid=0x00007fa0fc040000 nid=0x440c runnable [0x00007fa0dc35d000]
   java.lang.Thread.State: RUNNABLE
    at java.xxx...
    ....略
    - locked <0x00000000937f16c0> (a java.io.BufferedInputStream)
"processor-13" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry [0x000000000825f000]
java.lang.Thread.State: BLOCKED (on object monitor)
                at beans.ConnectionPool.getConnection(ConnectionPool.java:102)
                - waiting to lock <0xe0375410> (a beans.ConnectionPool)
                at beans.cus.ServiceCnt.getTodayCount(ServiceCnt.java:111)
                at beans.cus.ServiceCnt.insertCount(ServiceCnt.java:43)
Found one Java-level deadlock:
=============================
Java stack information for the threads listed above:
===================================================
"com.xxx.xxx.jvm.deadlock.TestThread2":
    at com.xxx.xxx.jvm.deadlock.TestThread2.fun(TestThread2.java:35)
    - waiting to lock <0x00000000d77363d0> (a java.lang.Object)
    - locked <0x00000000d77363e0> (a java.lang.Object)
    at com.xxx.xxx.jvm.deadlock.TestThread2.run(TestThread2.java:22)
"com.xxx.xxx.jvm.deadlock.TestThread1":
    at com.xxx.xxx.jvm.deadlock.TestThread1.fun(TestThread1.java:33)
    - waiting to lock <0x00000000d77363e0> (a java.lang.Object)
    - locked <0x00000000d77363d0> (a java.lang.Object)
    at com.xxx.xxx.jvm.deadlock.TestThread1.run(TestThread1.java:20)

Found 1 deadlock.

ps :死锁线程如果不是很消耗cpu的线程,如高浮点计算,字符串处理,正则匹配、大循环等cpu密集型线程,cpu是不会飙升的!系统也不会崩溃。它只是挂在那里!同理 mysql、oracle出现死锁也不会影响其他sql执行。

线程状态

线程堆栈分析重点关注一下关键字出现频率
Runnable:太多-> 说明系统负载合理,吞吐量高
BLOCKED :线程阻塞,锁竞争激烈
WAITING/TIMED_WAITING: 表明大多数线程在睡觉,cpu利用率不高...

//查询堆栈中的线程状态分布列表
grep java.lang.Thread.State k2.dump | awk '{print $2$3$4$5}' | sort | uniq -c
      1 BLOCKED(onobjectmonitor)
    228 RUNNABLE
      5 TIMED_WAITING(onobjectmonitor)
     33 TIMED_WAITING(parking)
      9 TIMED_WAITING(sleeping)
      1 WAITING(onobjectmonitor)
    253 WAITING(parking)

案例分析三

接收到告警信息,一直持续....


登录到线上

查询进程15880对应的线程数
三次cat结果如下...
cat /proc/15880/status | grep Threads
Threads:    1809 /2310 /1980
$ top
top - 17:31:49 up 84 days,  2:59,  1 user,  load average: 0.59, 0.51, 0.49
Tasks: 160 total,   1 running, 158 sleeping,   1 stopped,   0 zombie
Cpu(s):  1.4%us,  34%sy, 39.1%wa
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15880 appweb    20   0 7605m 2.5g 7136 S 57.8 64.6 854:33.19 java
SocketInputStream.socketRead0...屡见不鲜..
[0x0000xxxx1111] java.lang.Thread.State: RUNNABLE
    at java.net.SocketInputStream.socketRead0(Native Method) 

说明当前应用线程间上下文切换比较严重(不算特别严重,没到崩溃边缘) --> 主要原因之一 创建了太多线程.(猜测)

"thread-129" #122 prio=5 os_prio=0 tid=0x00007fba682bc000 nid=0x3eee runnable [0x00007fb9ff622000]
   java.lang.Thread.State: RUNNABLE
    at okhttp3.RealCall.execute(RealCall.java:77)
    at cn.xxx.xxx.xxx.MyClient.send(MyClient.java:129)
    at cn.xxx.xxx.crawler.xxxCrawler.getDetail(FayuanCrawler.java:411)
    at cn.xxx.xxx.crawler.xxxxCrawler.parseSearchHtml(FayuanCrawler.java:236)
    at cn.xxx.xxx.crawler.xxxxCrawler.exec(FayuanCrawler.java:140)
    at cn.xxx.xxx.service.xx.lambda$handle$0(MessageHandle.java:41)
    at cn.xxx.xxx.service.MessageHandle$$Lambda$46/1501251393.run(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

堆栈关键字搜索次数

grep -c 'xxxxCrawler.exec' jstack.log
863
这还只是一部分....

最终定位到代码 xxxxCrawler.exec(.....
可能存在无节制的创建线程业务逻辑

   Executors.newCachedThreadPool();
   
   @RequestMapping("/start")
    public String startxxxxx() {
        new Thread(this::naxy).start();
        return "success";
    }


案例分析四

应用创建了合理的较多的线程,但是cpu使用率一直不高,吞吐量上不去!

grep -c 'p-kafka-http-consumer' kop.log
100

随便打开一个kafka消费线程堆栈

"p-kafka-http-consumer-t5-for-wodlee.visitrecord" #81 daemon prio=5 os_prio=0 tid=0x00007f39e4bf6800 nid=0x1a37 waiting on condition [0x00007f3943ffe000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.xxx.xxx.xxx.consumer.Looper$1.perform(Looper.java:154)
    at com.xxx.xxx.kafka.consumer.Looper.performAction(Looper.java:106)
    at com.xxx.xxx.kafka.consumer.Looper.loop(Looper.java:78)
    at com.xxx.xxx.kafka.consumer.Looper.run(Looper.java:43)

发现大量的kafka消费线程 都处于 TIMED_WAITING (sleeping) 睡觉状态
查看处于RUNNABLE状态的 kafka线程,多次dump 发现
p-xxx-http-consumer处于运行中的kafka消费线程个数处于[0,1]之间
以上分析表明
kafka消费端,无论开启多少线程,进程中都只会有一个线程处于运行状态,其他剩余线程都在睡大觉。

                switch (status) {
                case OK:
                    next = LooperState.PROCESSING;
                    break;
                case DEFER:
                case SERVER_ERROR:
                //表明出现了锁冲突,消费端rt 5ms左右,100个线程竞争,锁冲突的概率很高,
               //这里直接用sleep 来缓解,降低cpu负载
              //其实可以考虑采用CAS的方式fetch对应的offset消息
                case LOCK_CONFLICT:
                    next = LooperState.WAITING;
                    break;


        //.....略
       actions.put(LooperState.WAITING, new StateAction() {
            public LooperState perform() throws Exception {
                try {
                    int sleepTime = 1100;
                    if (consumer instanceof ParallelKafkaConsumer) {
                        ParallelKafkaConsumer pc = (ParallelKafkaConsumer) consumer;
                        int alive = pc.getLooperThreadNum() - pc.getDeadThreadNum();
                        if (alive > 1 && alive < 5) {
                            sleepTime = 1500;
                        } else if (alive >= 5) {
                            sleepTime = 2000;
                        }
                    }
                    Thread.sleep(sleepTime);
                } catch (InterruptedException ie) {
                }
                return LooperState.FETCHING;
            }
        });

kafka消费端fetch消息,如果发现当前状态为LOCK_CONFLICT,会执行一段1.5s-2s的sleep

MORE

三克油

上一篇下一篇

猜你喜欢

热点阅读