网关超时等待引发的血案
网关为ZUUL
事由:
2018年的1024。本该是一个和平,宁静的周三,本该是一个享受美女按摩的节日~
然鹅还没有开始上班,噩梦就已经开始。
早上八点半到公司,闲着没事儿逛Kr,突然业务群反馈用户使用APP请求一直在转圈。
半小时过去了,定位到问题—— API网关服务挂了。
网关服务挂了……网关服务……挂了。。。
jstack一查。250个线程在跑。100个线程挂起。
链路日志一查,一大半的请求超过10分钟。。 10分钟???WTF?你是凯丁吗?
最后定位到某个服务调用疯狂超时,导致其他服务一直等待线程资源。
分析
首先不能理解的是,为什么又那么多执行在10分钟的(调用日志从调用开始到调用结束),而且还超时了。因为网关服务又设置超时时间60s。
zuul.host.socket-timeout-millis=60000
zuul.host.connect-timeout-millis=2000
然后在服务的nginx上也有超时时间设置 75s
可是程序里面记录的时间不可能骗人啊。一定有别的地方在消耗这个时间。具体是哪儿呢。
疯狂跟踪代码之后发现了。
时间耗在了等待线程上面。没错,耗费在了等待线程上面。
SimpleHostRoutingFilter->run()->forward()->forwardRequest()->CloseableHttpClient.execute()->InternalHttpClient.doExecute()
在InternalHttpClient会创建或者获取RequestConfig, 如果没有获取到RequestConfig就会使用HttpClient的defaultConfig ,通过setupContext
入下:
try {
final HttpRequestWrapper wrapper = HttpRequestWrapper.wrap(request, target);
final HttpClientContext localcontext = HttpClientContext.adapt(
context != null ? context : new BasicHttpContext());
RequestConfig config = null;
if (request instanceof Configurable) {
config = ((Configurable) request).getConfig();
}
//如果Config为空就会创建一个基础的Config
if (config == null) {
final HttpParams params = request.getParams();
if (params instanceof HttpParamsNames) {
if (!((HttpParamsNames) params).getNames().isEmpty()) {
config = HttpClientParamConfig.getRequestConfig(params, this.defaultConfig);
}
} else {
config = HttpClientParamConfig.getRequestConfig(params, this.defaultConfig);
}
}
if (config != null) {
localcontext.setRequestConfig(config);
}
//设置HTTPClient的默认属性到HttpRequestConfig
setupContext(localcontext);
final HttpRoute route = determineRoute(target, wrapper, localcontext);
return this.execChain.execute(route, wrapper, localcontext, execAware);
} catch (final HttpException httpException) {
throw new ClientProtocolException(httpException);
}
private void setupContext(final HttpClientContext context) {
if (context.getAttribute(HttpClientContext.TARGET_AUTH_STATE) == null) {
context.setAttribute(HttpClientContext.TARGET_AUTH_STATE, new AuthState());
}
if (context.getAttribute(HttpClientContext.PROXY_AUTH_STATE) == null) {
context.setAttribute(HttpClientContext.PROXY_AUTH_STATE, new AuthState());
}
if (context.getAttribute(HttpClientContext.AUTHSCHEME_REGISTRY) == null) {
context.setAttribute(HttpClientContext.AUTHSCHEME_REGISTRY, this.authSchemeRegistry);
}
if (context.getAttribute(HttpClientContext.COOKIESPEC_REGISTRY) == null) {
context.setAttribute(HttpClientContext.COOKIESPEC_REGISTRY, this.cookieSpecRegistry);
}
if (context.getAttribute(HttpClientContext.COOKIE_STORE) == null) {
context.setAttribute(HttpClientContext.COOKIE_STORE, this.cookieStore);
}
if (context.getAttribute(HttpClientContext.CREDS_PROVIDER) == null) {
context.setAttribute(HttpClientContext.CREDS_PROVIDER, this.credentialsProvider);
}
//重点,设置默认的Config
if (context.getAttribute(HttpClientContext.REQUEST_CONFIG) == null) {
context.setAttribute(HttpClientContext.REQUEST_CONFIG, this.defaultConfig);
}
}
由于传入的contxt为null 所以会创建一个BasicHttpContext
接着
RedirectExec.execute
RetryExec.execute
ProtocolExec.execute
最终通过执行MainClientExec.execute
从连接池获取连接
try {
//获取连接超时时间
final int timeout = config.getConnectionRequestTimeout();
//获取连接
managedConn = connRequest.get(timeout > 0 ? timeout : 0, TimeUnit.MILLISECONDS);
} catch(final InterruptedException interrupted) {
Thread.currentThread().interrupt();
throw new RequestAbortedException("Request aborted", interrupted);
} catch(final ExecutionException ex) {
Throwable cause = ex.getCause();
if (cause == null) {
cause = ex;
}
throw new RequestAbortedException("Request execution failed", cause);
}
获取连接
PoolingHttpClientConnectionManager.leaseConnection()
AbstractConnPool.get
@Override
public E get(final long timeout, final TimeUnit tunit) throws InterruptedException, ExecutionException, TimeoutException {
if (entry != null) {
return entry;
}
synchronized (this) {
try {
for (;;) {
//阻塞获取连接资源
final E leasedEntry = getPoolEntryBlocking(route, state, timeout, tunit, this);
if (validateAfterInactivity > 0) {
if (leasedEntry.getUpdated() + validateAfterInactivity <= System.currentTimeMillis()) {
if (!validate(leasedEntry)) {
leasedEntry.close();
release(leasedEntry, false);
continue;
}
}
}
entry = leasedEntry;
done = true;
onLease(entry);
if (callback != null) {
callback.completed(entry);
}
return entry;
}
} catch (IOException ex) {
done = true;
if (callback != null) {
callback.failed(ex);
}
throw new ExecutionException(ex);
}
}
}
};
AbstractConnPool.getPoolEntryBlocking
看这个名字就知道。这是一个阻塞获取池资源的方法
注意 高能来了。
private E getPoolEntryBlocking(
final T route, final Object state,
final long timeout, final TimeUnit tunit,
final Future<E> future) throws IOException, InterruptedException, TimeoutException {
Date deadline = null;
if (timeout > 0) {
deadline = new Date (System.currentTimeMillis() + tunit.toMillis(timeout));
}
this.lock.lock();
try {
final RouteSpecificPool<T, C, E> pool = getPool(route);
E entry;
for (;;) {
Asserts.check(!this.isShutDown, "Connection pool shut down");
for (;;) {
entry = pool.getFree(state);
if (entry == null) {
break;
}
if (entry.isExpired(System.currentTimeMillis())) {
entry.close();
}
if (entry.isClosed()) {
this.available.remove(entry);
pool.free(entry, false);
} else {
break;
}
}
if (entry != null) {
this.available.remove(entry);
this.leased.add(entry);
onReuse(entry);
return entry;
}
// New connection is needed
final int maxPerRoute = getMax(route);
// Shrink the pool prior to allocating a new connection
final int excess = Math.max(0, pool.getAllocatedCount() + 1 - maxPerRoute);
if (excess > 0) {
for (int i = 0; i < excess; i++) {
final E lastUsed = pool.getLastUsed();
if (lastUsed == null) {
break;
}
lastUsed.close();
this.available.remove(lastUsed);
pool.remove(lastUsed);
}
}
if (pool.getAllocatedCount() < maxPerRoute) {
final int totalUsed = this.leased.size();
final int freeCapacity = Math.max(this.maxTotal - totalUsed, 0);
if (freeCapacity > 0) {
final int totalAvailable = this.available.size();
if (totalAvailable > freeCapacity - 1) {
if (!this.available.isEmpty()) {
final E lastUsed = this.available.removeLast();
lastUsed.close();
final RouteSpecificPool<T, C, E> otherpool = getPool(lastUsed.getRoute());
otherpool.remove(lastUsed);
}
}
final C conn = this.connFactory.create(route);
entry = pool.add(conn);
this.leased.add(entry);
return entry;
}
}
boolean success = false;
try {
if (future.isCancelled()) {
throw new InterruptedException("Operation interrupted");
}
pool.queue(future);
this.pending.add(future);
if (deadline != null) {
success = this.condition.awaitUntil(deadline);
} else {
this.condition.await();
success = true;
}
if (future.isCancelled()) {
throw new InterruptedException("Operation interrupted");
}
} finally {
// In case of 'success', we were woken up by the
// connection pool and should now have a connection
// waiting for us, or else we're shutting down.
// Just continue in the loop, both cases are checked.
pool.unqueue(future);
this.pending.remove(future);
}
// check for spurious wakeup vs. timeout
if (!success && (deadline != null && deadline.getTime() <= System.currentTimeMillis())) {
break;
}
}
throw new TimeoutException("Timeout waiting for connection");
} finally {
this.lock.unlock();
}
}
这段代码有点长。分开来分析一下这个获取池资源的代码:
1.代码已建立有一个deadline ,然后判断timeout ,这个timeout要注意。如果大于零才会赋值deadline, 如果为0 则不会赋值deadline 也就是说deadline始终为null
Date deadline = null;
if (timeout > 0) {
//如果超时时间有效,则设定deadline
deadline = new Date (System.currentTimeMillis() + tunit.toMillis(timeout));
}
2.进入锁代码。pool.getFree 获取池资源。如果获取到了,并且Connect的检验并没有被关闭,则直接return entry
Asserts.check(!this.isShutDown, "Connection pool shut down");
for (;;) {
//获取池资源
entry = pool.getFree(state);
if (entry == null) {
break;
}
//校验超时
if (entry.isExpired(System.currentTimeMillis())) {
entry.close();
}
if (entry.isClosed()) {
this.available.remove(entry);
pool.free(entry, false);
} else {
break;
}
}
if (entry != null) {
this.available.remove(entry);
this.leased.add(entry);
onReuse(entry);
return entry;
}
3.如果没有获取到 进行接下来的代码。
4.判断是否达到了host配置的最大池数量,是否需要增加, 如果需要增加,则会在增加新连接之前缩小池,然后再分配返回entry
// New connection is needed 获取是否需要创建新的连接
final int maxPerRoute = getMax(route);
// Shrink the pool prior to allocating a new connection
final int excess = Math.max(0, pool.getAllocatedCount() + 1 - maxPerRoute);
if (excess > 0) {
for (int i = 0; i < excess; i++) {
final E lastUsed = pool.getLastUsed();
if (lastUsed == null) {
break;
}
lastUsed.close();
this.available.remove(lastUsed);
pool.remove(lastUsed);
}
}
if (pool.getAllocatedCount() < maxPerRoute) {
final int totalUsed = this.leased.size();
final int freeCapacity = Math.max(this.maxTotal - totalUsed, 0);
if (freeCapacity > 0) {
final int totalAvailable = this.available.size();
if (totalAvailable > freeCapacity - 1) {
if (!this.available.isEmpty()) {
final E lastUsed = this.available.removeLast();
lastUsed.close();
final RouteSpecificPool<T, C, E> otherpool = getPool(lastUsed.getRoute());
otherpool.remove(lastUsed);
}
}
final C conn = this.connFactory.create(route);
entry = pool.add(conn);
this.leased.add(entry);
return entry;
}
}
6.如果并不是上面的情况,实际情况就是池子被用光了,而且还达到了最大。就不能从池子中获取资源了。只能等了……
7.等待的时候会判断deadline , 如果deadline不为null 就会await一个时间。如果为null,那么等待就会无限等待,直到有资源。
boolean success = false;
try {
if (future.isCancelled()) {
throw new InterruptedException("Operation interrupted");
}
pool.queue(future);
this.pending.add(future);
//判断deadline是否有效
if (deadline != null) {
//如果有效就等待至deadline
success = this.condition.awaitUntil(deadline);
} else {
//如果无效就一直等待,没有超时时间
this.condition.await();
success = true;
}
if (future.isCancelled()) {
throw new InterruptedException("Operation interrupted");
}
} finally {
// In case of 'success', we were woken up by the
// connection pool and should now have a connection
// waiting for us, or else we're shutting down.
// Just continue in the loop, both cases are checked.
pool.unqueue(future);
this.pending.remove(future);
}
总结
分析到这儿事情就已经明了了。
1.有一个后端服务因为调用第三方导致完全处于宕机状态,所有gw过去的请求都会超时。
2.由于这个服务的请求又特别多,导致GW分给这个服务的连接池耗尽无法获取到连接资源,导致资源请求线程一直积累在GW
3.GW的对应这个服务的线程数一直在增加,导致别的服务也无法正常工作。
处理
其实很简单,加入一个timeout 就可以了。
这个timeout是等待池资源的超时时间。
Zuul中,重写SimpleHostRoutingFilter ,重写创建HTTPClient, RequestConfig中设置了ConnectionRequestTimeout
protected CloseableHttpClient newClient() {
if(connectionRequestTimeout == null || connectionRequestTimeout <= 0){
connectionRequestTimeout = 60;
}
final RequestConfig requestConfig = RequestConfig.custom()
//设置socket 时间长度
.setSocketTimeout(SOCKET_TIMEOUT.get())
//设置连接时长
.setConnectTimeout(CONNECTION_TIMEOUT.get())
//设置等待时长
.setConnectionRequestTimeout(connectionRequestTimeout)
.setCookieSpec(CookieSpecs.IGNORE_COOKIES).build();
HttpClientBuilder httpClientBuilder = HttpClients.custom();
if (!this.sslHostnameValidationEnabled) {
httpClientBuilder.setSSLHostnameVerifier(NoopHostnameVerifier.INSTANCE);
}
return httpClientBuilder.setConnectionManager(newConnectionManager())
.disableContentCompression()
.useSystemProperties().setDefaultRequestConfig(requestConfig)
.setRetryHandler(new DefaultHttpRequestRetryHandler(0, false))
.setRedirectStrategy(new RedirectStrategy() {
@Override
public boolean isRedirected(HttpRequest request,
HttpResponse response, HttpContext context)
throws ProtocolException {
return false;
}
@Override
public HttpUriRequest getRedirect(HttpRequest request,
HttpResponse response, HttpContext context)
throws ProtocolException {
return null;
}
}).build();
}