服务启动时出现大量 Hystrix timed-out 第一次排
2021-07-21 本文已影响0人
蓝笔头
异常日志
com.netflix.hystrix.exception.HystrixRuntimeException: UserSettingClient#get(long,String) timed-out and no fallback available.
com.netflix.hystrix.exception.HystrixRuntimeException: UserSettingClient#get(long,String) short-circuited
有人说是要开启 Ribbon 饥饿加载,看了下服务配置,已经开启了。
ribbon.eager-load.enabled=true
ribbon.eager-load.clients=user-service
排查过程
因为服务启动时添加了 JMX 参数,所以可以通过 Java VisualVM
连接到 jmx 端口。
-Dcom.sun.management.jmxremote.port=12345 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
通过线程监控可以发现存在大量的线程(1000 多个)在短时间(50-24 = 16s)内被创建。
大量线程创建开始时间节点 大量线程创建结束时间节点meikai@build-1:~/test$ while true; do pidstat -wt -p $(pgrep -f ^/usr/bin/java.*athena-web) 1 1 | awk -f stat.awk; done
2021-07-21 11:30:01 cswch:6782.32 nvcswch:252.90
2021-07-21 11:30:02 cswch:3625.42 nvcswch:35.28
2021-07-21 11:30:03 cswch:1984.16 nvcswch:229.38
2021-07-21 11:30:04 cswch:2103.82 nvcswch:19.60
2021-07-21 11:30:05 cswch:2064.64 nvcswch:11.76
2021-07-21 11:30:07 cswch:13307.72 nvcswch:315.58
2021-07-21 11:30:08 cswch:3503.60 nvcswch:73.06
2021-07-21 11:30:09 cswch:1713.68 nvcswch:19.60
2021-07-21 11:30:10 cswch:1101.86 nvcswch:1.96
2021-07-21 11:30:11 cswch:1419.46 nvcswch:0.00
2021-07-21 11:30:12 cswch:2515.64 nvcswch:9.80
2021-07-21 11:30:13 cswch:3562.66 nvcswch:21.56
2021-07-21 11:30:14 cswch:4803.94 nvcswch:237.60
2021-07-21 11:30:15 cswch:2090.08 nvcswch:9.80
2021-07-21 11:30:16 cswch:4843.42 nvcswch:223.74
2021-07-21 11:30:17 cswch:2543.02 nvcswch:11.76
2021-07-21 11:30:18 cswch:1915.60 nvcswch:3.92
2021-07-21 11:30:19 cswch:2853.28 nvcswch:277.22
2021-07-21 11:30:20 cswch:2060.68 nvcswch:9.80
2021-07-21 11:30:22 cswch:2435.06 nvcswch:156.82
2021-07-21 11:30:23 cswch:3158.64 nvcswch:5.88
2021-07-21 11:30:24 cswch:4964.60 nvcswch:7.84
2021-07-21 11:30:25 cswch:36033.14 nvcswch:356.78
2021-07-21 11:30:26 cswch:38675.18 nvcswch:192.16
2021-07-21 11:30:27 cswch:27278.76 nvcswch:592.28
2021-07-21 11:30:28 cswch:28185.20 nvcswch:1245.52
2021-07-21 11:30:30 cswch:27249.28 nvcswch:300.46
2021-07-21 11:30:31 cswch:27757.72 nvcswch:235.06
2021-07-21 11:30:32 cswch:35811.64 nvcswch:845.80
2021-07-21 11:30:33 cswch:45824.74 nvcswch:4289.28
2021-07-21 11:30:35 cswch:33469.40 nvcswch:921.34
2021-07-21 11:30:36 cswch:33097.62 nvcswch:794.70
2021-07-21 11:30:37 cswch:35508.16 nvcswch:407.34
2021-07-21 11:30:39 cswch:31193.58 nvcswch:244.96
2021-07-21 11:30:40 cswch:35123.04 nvcswch:4207.68
2021-07-21 11:30:41 cswch:34204.14 nvcswch:14115.80
2021-07-21 11:30:43 cswch:33899.98 nvcswch:549.66
2021-07-21 11:30:44 cswch:38476.98 nvcswch:899.94
2021-07-21 11:30:45 cswch:33318.62 nvcswch:264.74
2021-07-21 11:30:47 cswch:32771.18 nvcswch:1146.34
2021-07-21 11:30:48 cswch:31128.38 nvcswch:949.74
2021-07-21 11:30:49 cswch:25169.86 nvcswch:240.10
2021-07-21 11:30:51 cswch:23371.16 nvcswch:264.12
2021-07-21 11:30:52 cswch:24892.78 nvcswch:98.06
2021-07-21 11:30:53 cswch:23915.64 nvcswch:181.92
2021-07-21 11:30:55 cswch:28137.40 nvcswch:406.04
2021-07-21 11:30:56 cswch:24536.02 nvcswch:86.48
2021-07-21 11:30:57 cswch:28038.82 nvcswch:284.46
2021-07-21 11:30:59 cswch:23471.14 nvcswch:53.52
2021-07-21 11:31:00 cswch:24460.46 nvcswch:203.50
2021-07-21 11:31:01 cswch:23419.50 nvcswch:60.64
2021-07-21 11:31:03 cswch:26726.46 nvcswch:169.56
2021-07-21 11:31:04 cswch:24444.54 nvcswch:187.92
2021-07-21 11:31:05 cswch:23241.24 nvcswch:30.60
2021-07-21 11:31:07 cswch:27613.88 nvcswch:74.90
2021-07-21 11:31:08 cswch:26154.86 nvcswch:173.02
2021-07-21 11:31:09 cswch:25369.32 nvcswch:42.82
2021-07-21 11:31:11 cswch:24101.26 nvcswch:55.28
2021-07-21 11:31:12 cswch:26191.50 nvcswch:69.14
^C
我们重点关注 (11:30:24 ~ 2021-07-21 11:30:51
)的输出:
2021-07-21 11:30:24 cswch:4964.60 nvcswch:7.84
2021-07-21 11:30:25 cswch:36033.14 nvcswch:356.78
2021-07-21 11:30:26 cswch:38675.18 nvcswch:192.16
2021-07-21 11:30:27 cswch:27278.76 nvcswch:592.28
2021-07-21 11:30:28 cswch:28185.20 nvcswch:1245.52
2021-07-21 11:30:30 cswch:27249.28 nvcswch:300.46
2021-07-21 11:30:31 cswch:27757.72 nvcswch:235.06
2021-07-21 11:30:32 cswch:35811.64 nvcswch:845.80
2021-07-21 11:30:33 cswch:45824.74 nvcswch:4289.28
2021-07-21 11:30:35 cswch:33469.40 nvcswch:921.34
2021-07-21 11:30:36 cswch:33097.62 nvcswch:794.70
2021-07-21 11:30:37 cswch:35508.16 nvcswch:407.34
2021-07-21 11:30:39 cswch:31193.58 nvcswch:244.96
2021-07-21 11:30:40 cswch:35123.04 nvcswch:4207.68
2021-07-21 11:30:41 cswch:34204.14 nvcswch:14115.80
2021-07-21 11:30:43 cswch:33899.98 nvcswch:549.66
2021-07-21 11:30:44 cswch:38476.98 nvcswch:899.94
2021-07-21 11:30:45 cswch:33318.62 nvcswch:264.74
2021-07-21 11:30:47 cswch:32771.18 nvcswch:1146.34
2021-07-21 11:30:48 cswch:31128.38 nvcswch:949.74
2021-07-21 11:30:49 cswch:25169.86 nvcswch:240.10
2021-07-21 11:30:51 cswch:23371.16 nvcswch:264.12
可以发现,每秒非自愿上下文切换数(在下一节中有描述)比前后都高非常多,最多的达到了 14115.80
次,平时基本都是在 100
次内,偶尔有一两个大于 100
的。
** 因此,可以断定是因为服务瞬时创建大量线程,使线程上下文切换太频繁,线程无法连续执行,从而导致出现大量 timed-out
。**
所用到的命令
以 Linux 进度调度测试 一文中的测试代码为例。
(1)获取进程的 pid
$ pgrep -f TestThread
242241
通过 pgrep -f 进程名称
获取进程的 pid。
可以和 ps -ef
命令中输出的 pid 对比一下。
$ ps -ef | grep TestThread
meikai 242241 247192 0 12:28 pts/38 00:00:00 java TestThread
meikai 242359 147324 0 12:28 pts/19 00:00:00 grep TestThread
(2)获取进程的上下文切换信息
meikai@dev:~/test$ pidstat -wt -p $(pgrep -f TestThread) 1 1
Linux 4.19.0-17-amd64 (dev.wumii.net) 07/21/2021 _x86_64_ (8 CPU)
12:31:59 PM UID TGID TID cswch/s nvcswch/s Command
12:32:00 PM 19200733 242241 - 0.00 0.00 java
12:32:00 PM 19200733 - 242241 0.00 0.00 |__java
12:32:00 PM 19200733 - 242242 0.00 0.00 |__java
12:32:00 PM 19200733 - 242243 1.00 0.00 |__java
12:32:00 PM 19200733 - 242244 0.00 0.00 |__java
12:32:00 PM 19200733 - 242245 0.00 0.00 |__java
12:32:00 PM 19200733 - 242247 0.00 0.00 |__java
12:32:00 PM 19200733 - 242248 0.00 0.00 |__java
12:32:00 PM 19200733 - 242249 0.00 0.00 |__java
12:32:00 PM 19200733 - 242250 0.00 0.00 |__java
12:32:00 PM 19200733 - 242251 20.00 0.00 |__java
12:32:00 PM 19200733 - 242416 4.00 5.00 |__java
12:32:00 PM 19200733 - 242417 4.00 1.00 |__java
12:32:00 PM 19200733 - 242418 4.00 1.00 |__java
12:32:00 PM 19200733 - 242419 3.00 7.00 |__java
12:32:00 PM 19200733 - 242420 5.00 0.00 |__java
12:32:00 PM 19200733 - 242421 4.00 9.00 |__java
12:32:00 PM 19200733 - 242422 4.00 0.00 |__java
12:32:00 PM 19200733 - 242423 4.00 0.00 |__java
12:32:00 PM 19200733 - 242424 4.00 2.00 |__java
12:32:00 PM 19200733 - 242425 4.00 0.00 |__java
Average: UID TGID TID cswch/s nvcswch/s Command
...
参数简介:
-
TID
:线程 ID -
cswch/s
:每秒自愿上下文切换数。- 进程运行时由于系统资源不足,如 IO,内存等原因不得不进行切换。
- 自愿的切换的多了,表明在等待资源,如 IO
-
nvcswch/s
:每秒非自愿上下文切换数。- 比如时间片用完,系统调度让其他任务运行,或者竞争 CPU 的时候也会发生。
- 非自愿的切换的多了,表明 CPU 被任务竞争
(3)统计进程中所有线程的上下文切换数量
自定义 awk 脚本,脚本名为 stat.awk
。
#!/bin/bin/awk -f
#运行前
BEGIN {
cswch = 0
nvcswch = 0
}
#运行中
{
if(NF == 0){
next;
}
if ($1 == "Average:") {
exit;
}
# 注意:所有用作算术运算符进行操作,操作数自动转为数值,所有非数值都变为0
cswch+=$(NF-2) # cswch 是倒数第3列
nvcswch+=$(NF-1) # nvcswch 是倒数第2列
# 调试语句
# print
# printf "cswch:%.2f nvcswch:%.2f\n", cswch, nvcswch
}
#运行后
END {
time = strftime("%Y-%m-%d %H:%M:%S", systime()) # 获取当前时间
printf "%s cswch:%.2f nvcswch:%.2f\n", time, cswch, nvcswch
}
统计进程 TestThread 的上下文切换总数:
$ pidstat -wt -p $(pgrep -f TestThread) 1 1 | awk -f stat.awk
2021-07-21 12:39:23 cswch:120.00 nvcswch:94.00
(4)循环执行命令
先测试一下是否正常。
$ while true; do date; sleep 1; done
Wed 21 Jul 2021 12:39:42 PM CST
Wed 21 Jul 2021 12:39:43 PM CST
实际使用:
$ while true; do pidstat -wt -p $(pgrep -f TestThread) 1 1 | awk -f stat.awk; done
2021-07-21 12:41:53 cswch:114.00 nvcswch:44.00
2021-07-21 12:41:55 cswch:116.00 nvcswch:78.00
2021-07-21 12:41:56 cswch:120.00 nvcswch:50.00
2021-07-21 12:41:57 cswch:122.76 nvcswch:87.12
2021-07-21 12:41:58 cswch:112.86 nvcswch:73.26
2021-07-21 12:41:59 cswch:114.00 nvcswch:60.00