线程池引发的故障到底该怎么排查?
前情提要
最近读了一篇关于线程池故障排查的文章,收货颇丰.文章地址如下:
https://mp.weixin.qq.com/s/0U_xmzvmnY354Lgv0IwZ6g
这里简要回顾一下,感兴趣的同学可以仔细读一下这篇文章:
-
故障场景:dubbo线程池打满,服务处于夯死状态.但是5分钟之后却自动恢复了.
-
排查过程:略
-
故障原因:
- 项目使用RestTemplate访问某个外部接口.
- RestTemplate使用的是HttpClient的实现,HttpClient实现了连接池,但是默认的最大连接数只有5.
基于以上两点,一旦该外部接口超时很有可能在HttpClient获取连接的时候阻塞当前线程,最终造成dubbo线程池全部打满.
排查过程优化
问题是找到了,但是原文排查过程颇为艰辛.那么有没有更为有效的方法来快速定位线程池打满的问题呢?
可以通过jstack dump线程,然后对线程进行分析,从而快速定位问题!
场景构建
这里构建一个简单的环境复现上述场景:
-
构建一个spring boot项目,简单起见使用Tmocat线程池代替案例中的dubbo线程池.项目引入了httpclient, spring boot会自动将RestTemplate配置为httpclient的实现.
<dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.4</version> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> </dependencies>
设置Tomcat工作线程池,最小10个线程,最大20个线程.这里只是为了模拟场景,所以最大线程池设置的比较小.
server.tomcat.min-spare-threads=10 server.tomcat.max-threads=20
-
编写一个http接口模拟案例中的外部接口,这里睡眠3秒模拟接口超时
@GetMapping("/slowApi") public String slowApi() throws InterruptedException { TimeUnit.SECONDS.sleep(3); return "ok!"; }
-
构建一个RestTemplate,连接超时,读取超时设置为2秒
@Bean public RestTemplate restTemplate(RestTemplateBuilder builder) { return builder.setConnectTimeout(Duration.ofSeconds(2)) .setReadTimeout(Duration.ofSeconds(2)).build(); }
-
提供一个接口,使用RestTemplate访问外部接口
@GetMapping("/getMsg") public String getMsg() { try { String demo = restTemplate.getForObject("http://localhost:8080/slowApi", String.class); return ">>>" + demo; } catch (Exception e) { return "failed!"; } }
-
编写一个简单的测试模拟并发访问
System.setProperty("http.maxConnections", String.valueOf(THREAD_POOL_SIZE));
这里是设置HttpClient最大连接数,否则将达不到我们所期望的并发量.public class ConcurrenceTest { private static final int TOTAL_REQUEST = 100000; private static final int THREAD_POOL_SIZE = 50; @Test public void test() throws InterruptedException { System.setProperty("http.maxConnections", String.valueOf(THREAD_POOL_SIZE)); RestTemplateBuilder builder = new RestTemplateBuilder(); RestTemplate restTemplate = builder.build(); ExecutorService executorService = Executors.newFixedThreadPool(THREAD_POOL_SIZE); CountDownLatch countDownLatch = new CountDownLatch(TOTAL_REQUEST); for (int i = 0; i < TOTAL_REQUEST; i++) { executorService.execute(()->{ restTemplate.getForObject("http://localhost:8080/getMsg", String.class); countDownLatch.countDown(); }); } countDownLatch.await(); executorService.shutdown(); } }
以上代码已经上传至:https://github.com/sunshujie1990/thread-troubleshooting
故障定位
- 启动项目
- 通过jps -l找到项目的进程号, 如下是16555
ssj@ssj-PC$: jps -l
16555 com.shujie.threadtroubleshooting.ThreadTroubleshootingApplication
- 通过jstack dump线程信息,如下会将线程信息dump到一个名为thread.txt的文件中
jstack 16555 > thread.txt
- 启动并发测试,通过浏览器访问
http://localhost:8080/getMsg
一直等待,说明Tomcat线程池已经被打满了,无法对外提供访问. - 再次dump线程信息到另外一个文件
jstack 16555 > thread2.txt
- 分析dump线程信息
首先打开thread.txt,看下健康的Tomcat线程池是什么样子的:
"http-nio-8080-exec-10" #34 daemon prio=5 os_prio=0 tid=0x00007fa500bf9000 nid=0x40ea waiting on condition [0x00007fa4e02dc000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007844d1038> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:107)
at org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:33)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
- http-nio-8080-exec-10表示当前是第10个线程,因为我们设置了Tomcat初始化10个线程.
- waiting on condition说明线程在等待某个条件.
- 线程栈从上往下看,排除JDK的方法,第一个是
org.apache.tomcat.util.threads.TaskQueue.take(TaskQueue.java:107)
从方法名称可以看出,这里线程处于Tomcat的任务队列中,等待的条件就是一个新的请求的到来.
然后打开thread2.txt,再来看一下线程池被打满之后的线程信息
"http-nio-8080-exec-20" #54 daemon prio=5 os_prio=0 tid=0x00007fa468011800 nid=0x421c waiting on condition [0x00007fa4cd6ff000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007067027e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:379)
at org.apache.http.pool.AbstractConnPool.access$200(AbstractConnPool.java:69)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:245)
- locked <0x00000007824713a0> (a org.apache.http.pool.AbstractConnPool$2)
at org.apache.http.pool.AbstractConnPool$2.get(AbstractConnPool.java:193)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:303)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:191)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.springframework.http.client.HttpComponentsClientHttpRequest.executeInternal(HttpComponentsClientHttpRequest.java:87)
at org.springframework.http.client.AbstractBufferingClientHttpRequest.executeInternal(AbstractBufferingClientHttpRequest.java:48)
at org.springframework.http.client.AbstractClientHttpRequest.execute(AbstractClientHttpRequest.java:53)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:735)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:670)
at org.springframework.web.client.RestTemplate.getForObject(RestTemplate.java:311)
at com.shujie.threadtroubleshooting.ThreadTroubleshootingApplication.getMsg(ThreadTroubleshootingApplication.java:35)
- http-nio-8080-exec-20 说明当前是第20个线程,线程池确实被打满了,因为我们设置的最大线程数就是20个.
- waiting on condition同上,线程处于等待状态,等待某个状态.
- 同样线程栈从上往下看,排除JDK的方法,第一个是
org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:379)
从方法名可以看出是获取池化的资源阻塞. - 再往下看
PoolingHttpClientConnectionManager.leaseConnection
可以看到HttpClient使用了连接池,每次从连接池租借一个连接执行任务,如果获取不到链接会使用J.U.C中的AQS阻塞当前线程.当前线程所等待的条件就是其他线程释放连接.
至此排查过程结束.
总结
线程池被打满,无法对外提供服务是非常严重的故障.产生的原因也是多种多样,不仅仅限于本文提到的这个场景.本文希望能够通过这个简单的例子帮助大家在遇到类似的问题时快速定位BUG,而不至于一脸懵逼.