服务启动时出现大量 Hystrix timed-out 第一次排

2021-07-21  本文已影响0人  蓝笔头

异常日志

com.netflix.hystrix.exception.HystrixRuntimeException: UserSettingClient#get(long,String)  timed-out and no fallback available.
com.netflix.hystrix.exception.HystrixRuntimeException: UserSettingClient#get(long,String)  short-circuited 

有人说是要开启 Ribbon 饥饿加载,看了下服务配置,已经开启了。

ribbon.eager-load.enabled=true
ribbon.eager-load.clients=user-service

排查过程

因为服务启动时添加了 JMX 参数,所以可以通过 Java VisualVM 连接到 jmx 端口。

-Dcom.sun.management.jmxremote.port=12345 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

通过线程监控可以发现存在大量的线程(1000 多个)在短时间(50-24 = 16s)内被创建。

大量线程创建开始时间节点 大量线程创建结束时间节点
meikai@build-1:~/test$ while true; do pidstat -wt -p $(pgrep -f ^/usr/bin/java.*athena-web) 1 1 | awk -f stat.awk; done
2021-07-21 11:30:01 cswch:6782.32 nvcswch:252.90
2021-07-21 11:30:02 cswch:3625.42 nvcswch:35.28
2021-07-21 11:30:03 cswch:1984.16 nvcswch:229.38
2021-07-21 11:30:04 cswch:2103.82 nvcswch:19.60
2021-07-21 11:30:05 cswch:2064.64 nvcswch:11.76
2021-07-21 11:30:07 cswch:13307.72 nvcswch:315.58
2021-07-21 11:30:08 cswch:3503.60 nvcswch:73.06
2021-07-21 11:30:09 cswch:1713.68 nvcswch:19.60
2021-07-21 11:30:10 cswch:1101.86 nvcswch:1.96
2021-07-21 11:30:11 cswch:1419.46 nvcswch:0.00
2021-07-21 11:30:12 cswch:2515.64 nvcswch:9.80
2021-07-21 11:30:13 cswch:3562.66 nvcswch:21.56
2021-07-21 11:30:14 cswch:4803.94 nvcswch:237.60
2021-07-21 11:30:15 cswch:2090.08 nvcswch:9.80
2021-07-21 11:30:16 cswch:4843.42 nvcswch:223.74
2021-07-21 11:30:17 cswch:2543.02 nvcswch:11.76
2021-07-21 11:30:18 cswch:1915.60 nvcswch:3.92
2021-07-21 11:30:19 cswch:2853.28 nvcswch:277.22
2021-07-21 11:30:20 cswch:2060.68 nvcswch:9.80
2021-07-21 11:30:22 cswch:2435.06 nvcswch:156.82
2021-07-21 11:30:23 cswch:3158.64 nvcswch:5.88
2021-07-21 11:30:24 cswch:4964.60 nvcswch:7.84
2021-07-21 11:30:25 cswch:36033.14 nvcswch:356.78
2021-07-21 11:30:26 cswch:38675.18 nvcswch:192.16
2021-07-21 11:30:27 cswch:27278.76 nvcswch:592.28
2021-07-21 11:30:28 cswch:28185.20 nvcswch:1245.52
2021-07-21 11:30:30 cswch:27249.28 nvcswch:300.46
2021-07-21 11:30:31 cswch:27757.72 nvcswch:235.06
2021-07-21 11:30:32 cswch:35811.64 nvcswch:845.80
2021-07-21 11:30:33 cswch:45824.74 nvcswch:4289.28
2021-07-21 11:30:35 cswch:33469.40 nvcswch:921.34
2021-07-21 11:30:36 cswch:33097.62 nvcswch:794.70
2021-07-21 11:30:37 cswch:35508.16 nvcswch:407.34
2021-07-21 11:30:39 cswch:31193.58 nvcswch:244.96
2021-07-21 11:30:40 cswch:35123.04 nvcswch:4207.68
2021-07-21 11:30:41 cswch:34204.14 nvcswch:14115.80
2021-07-21 11:30:43 cswch:33899.98 nvcswch:549.66
2021-07-21 11:30:44 cswch:38476.98 nvcswch:899.94
2021-07-21 11:30:45 cswch:33318.62 nvcswch:264.74
2021-07-21 11:30:47 cswch:32771.18 nvcswch:1146.34
2021-07-21 11:30:48 cswch:31128.38 nvcswch:949.74
2021-07-21 11:30:49 cswch:25169.86 nvcswch:240.10
2021-07-21 11:30:51 cswch:23371.16 nvcswch:264.12
2021-07-21 11:30:52 cswch:24892.78 nvcswch:98.06
2021-07-21 11:30:53 cswch:23915.64 nvcswch:181.92
2021-07-21 11:30:55 cswch:28137.40 nvcswch:406.04
2021-07-21 11:30:56 cswch:24536.02 nvcswch:86.48
2021-07-21 11:30:57 cswch:28038.82 nvcswch:284.46
2021-07-21 11:30:59 cswch:23471.14 nvcswch:53.52
2021-07-21 11:31:00 cswch:24460.46 nvcswch:203.50
2021-07-21 11:31:01 cswch:23419.50 nvcswch:60.64
2021-07-21 11:31:03 cswch:26726.46 nvcswch:169.56
2021-07-21 11:31:04 cswch:24444.54 nvcswch:187.92
2021-07-21 11:31:05 cswch:23241.24 nvcswch:30.60
2021-07-21 11:31:07 cswch:27613.88 nvcswch:74.90
2021-07-21 11:31:08 cswch:26154.86 nvcswch:173.02
2021-07-21 11:31:09 cswch:25369.32 nvcswch:42.82
2021-07-21 11:31:11 cswch:24101.26 nvcswch:55.28
2021-07-21 11:31:12 cswch:26191.50 nvcswch:69.14
^C

我们重点关注 (11:30:24 ~ 2021-07-21 11:30:51)的输出:

2021-07-21 11:30:24 cswch:4964.60 nvcswch:7.84
2021-07-21 11:30:25 cswch:36033.14 nvcswch:356.78
2021-07-21 11:30:26 cswch:38675.18 nvcswch:192.16
2021-07-21 11:30:27 cswch:27278.76 nvcswch:592.28
2021-07-21 11:30:28 cswch:28185.20 nvcswch:1245.52
2021-07-21 11:30:30 cswch:27249.28 nvcswch:300.46
2021-07-21 11:30:31 cswch:27757.72 nvcswch:235.06
2021-07-21 11:30:32 cswch:35811.64 nvcswch:845.80
2021-07-21 11:30:33 cswch:45824.74 nvcswch:4289.28
2021-07-21 11:30:35 cswch:33469.40 nvcswch:921.34
2021-07-21 11:30:36 cswch:33097.62 nvcswch:794.70
2021-07-21 11:30:37 cswch:35508.16 nvcswch:407.34
2021-07-21 11:30:39 cswch:31193.58 nvcswch:244.96
2021-07-21 11:30:40 cswch:35123.04 nvcswch:4207.68
2021-07-21 11:30:41 cswch:34204.14 nvcswch:14115.80
2021-07-21 11:30:43 cswch:33899.98 nvcswch:549.66
2021-07-21 11:30:44 cswch:38476.98 nvcswch:899.94
2021-07-21 11:30:45 cswch:33318.62 nvcswch:264.74
2021-07-21 11:30:47 cswch:32771.18 nvcswch:1146.34
2021-07-21 11:30:48 cswch:31128.38 nvcswch:949.74
2021-07-21 11:30:49 cswch:25169.86 nvcswch:240.10
2021-07-21 11:30:51 cswch:23371.16 nvcswch:264.12

可以发现,每秒非自愿上下文切换数(在下一节中有描述)比前后都高非常多,最多的达到了 14115.80 次,平时基本都是在 100 次内,偶尔有一两个大于 100 的。

** 因此,可以断定是因为服务瞬时创建大量线程,使线程上下文切换太频繁,线程无法连续执行,从而导致出现大量 timed-out。**

所用到的命令

Linux 进度调度测试 一文中的测试代码为例。

(1)获取进程的 pid

$ pgrep -f TestThread
242241

通过 pgrep -f 进程名称 获取进程的 pid。

可以和 ps -ef 命令中输出的 pid 对比一下。

$ ps -ef | grep TestThread
meikai   242241 247192  0 12:28 pts/38   00:00:00 java TestThread
meikai   242359 147324  0 12:28 pts/19   00:00:00 grep TestThread

(2)获取进程的上下文切换信息


meikai@dev:~/test$ pidstat -wt -p $(pgrep -f TestThread) 1 1
Linux 4.19.0-17-amd64 (dev.wumii.net)   07/21/2021      _x86_64_        (8 CPU)

12:31:59 PM   UID      TGID       TID   cswch/s nvcswch/s  Command
12:32:00 PM 19200733    242241         -      0.00      0.00  java
12:32:00 PM 19200733         -    242241      0.00      0.00  |__java
12:32:00 PM 19200733         -    242242      0.00      0.00  |__java
12:32:00 PM 19200733         -    242243      1.00      0.00  |__java
12:32:00 PM 19200733         -    242244      0.00      0.00  |__java
12:32:00 PM 19200733         -    242245      0.00      0.00  |__java
12:32:00 PM 19200733         -    242247      0.00      0.00  |__java
12:32:00 PM 19200733         -    242248      0.00      0.00  |__java
12:32:00 PM 19200733         -    242249      0.00      0.00  |__java
12:32:00 PM 19200733         -    242250      0.00      0.00  |__java
12:32:00 PM 19200733         -    242251     20.00      0.00  |__java
12:32:00 PM 19200733         -    242416      4.00      5.00  |__java
12:32:00 PM 19200733         -    242417      4.00      1.00  |__java
12:32:00 PM 19200733         -    242418      4.00      1.00  |__java
12:32:00 PM 19200733         -    242419      3.00      7.00  |__java
12:32:00 PM 19200733         -    242420      5.00      0.00  |__java
12:32:00 PM 19200733         -    242421      4.00      9.00  |__java
12:32:00 PM 19200733         -    242422      4.00      0.00  |__java
12:32:00 PM 19200733         -    242423      4.00      0.00  |__java
12:32:00 PM 19200733         -    242424      4.00      2.00  |__java
12:32:00 PM 19200733         -    242425      4.00      0.00  |__java

Average:      UID      TGID       TID   cswch/s nvcswch/s  Command
...

参数简介:

(3)统计进程中所有线程的上下文切换数量

自定义 awk 脚本,脚本名为 stat.awk

#!/bin/bin/awk -f

#运行前
BEGIN {
    cswch = 0
    nvcswch = 0
}
#运行中
{
    if(NF == 0){
        next;
    }
    if ($1 == "Average:") {
        exit;
    }
    
    
    # 注意:所有用作算术运算符进行操作,操作数自动转为数值,所有非数值都变为0
    cswch+=$(NF-2) # cswch 是倒数第3列
    nvcswch+=$(NF-1) # nvcswch 是倒数第2列
    
    # 调试语句
    # print
    # printf "cswch:%.2f nvcswch:%.2f\n", cswch, nvcswch
}
#运行后
END {
    time = strftime("%Y-%m-%d %H:%M:%S", systime()) # 获取当前时间
    printf "%s cswch:%.2f nvcswch:%.2f\n", time, cswch, nvcswch
}

统计进程 TestThread 的上下文切换总数:

$ pidstat -wt -p $(pgrep -f TestThread) 1 1 | awk -f stat.awk
2021-07-21 12:39:23 cswch:120.00 nvcswch:94.00

(4)循环执行命令

先测试一下是否正常。

$ while true; do date; sleep 1; done
Wed 21 Jul 2021 12:39:42 PM CST
Wed 21 Jul 2021 12:39:43 PM CST

实际使用:

$ while true; do pidstat -wt -p $(pgrep -f TestThread) 1 1 | awk -f stat.awk; done
2021-07-21 12:41:53 cswch:114.00 nvcswch:44.00
2021-07-21 12:41:55 cswch:116.00 nvcswch:78.00
2021-07-21 12:41:56 cswch:120.00 nvcswch:50.00
2021-07-21 12:41:57 cswch:122.76 nvcswch:87.12
2021-07-21 12:41:58 cswch:112.86 nvcswch:73.26
2021-07-21 12:41:59 cswch:114.00 nvcswch:60.00

参考

上一篇下一篇

猜你喜欢

热点阅读