K8s watch长连接中Okhttp的坑

2023-05-29 本文已影响0人 Teddy_b

背景

一个HA的K8s集群，部署了三节点的kube-apiserver，然后通过域名访问kube-apiserver；其中一个master挂了后，客户端的watch就再也收不到事件了，其他的list等查询、create操作一切正常

开始

客户端使用的是k8s官方的Java客户端库，简化后的代码示例：

ApiClient client = Config.fromToken("https://prod.domain.ss:6443", "token", false);

CoreV1Api coreV1Api = new CoreV1Api(client);

        SharedInformerFactory factory = new SharedInformerFactory(client);
        SharedIndexInformer<V1Pod> podInformer =
                factory.sharedIndexInformerFor(
                        (CallGeneratorParams params) -> coreV1Api.listPodForAllNamespacesCall(null, null, null, labelQuery, null,
                                null, params.resourceVersion, null, null, params.watch, null), V1Pod.class, V1PodList.class, 60 * 1000L);

        factory.startAllRegisteredInformers();

其实和官方的Informer使用示例基本一致，但是在master节点挂掉后Informer就一直收不到事件了，表现很奇怪，只要重启Informer即可恢复。为了搞清楚为什么Informer会收不到事件，做了一系列尝试去复现这个问题；

探索

添加额外的日志

K8s的官方Java客户端中主要执行list+watch逻辑的是ReflectorRunnable，因此我们参考官方的实现，重新实现一个自己的ReflectorRunnable，并加上一些异常日志方便我们排查问题

public class MyReflectorRunnable <
        ApiType extends KubernetesObject, ApiListType extends KubernetesListObject>
        implements Runnable {
        
        public void run() {
                  ...
                  if ((t instanceof RuntimeException)
                            && t.getMessage() != null
                            && t.getMessage().contains("IO Exception during hasNext")) {
                        t.printStackTrace();
                        log.info("{}#Read timeout retry list and watch", this.apiTypeClass);
                        // IO timeout should be taken as a normal case
                        return;
                    }
                  ...
        }
}

扩展kube-apiserver

由于只能在测试环境进行复现，而我们的测试环境是单点的kube-apiserver，为了能够模拟线上的多节点kube-apiserver，然后通过域名访问，首先在测试集群中快速复制出了另一个kube-apiserver，参考上一篇（https://www.jianshu.com/p/c683ccbfa30f）；两个kube-apiserver，我们在测试环境也通过域名去访问

由于稍后需要对kube-apiserver不断的对kube-apiserver进行重启测试，我们首先需要修改下kube-apiserver的启动配置，关闭kube-apiserver的自动重启

root@bigo:~# cat /lib/systemd/system/kube-apiserver.service 
[Unit]
Description=Kubernetes API Server
Documentation=https://kubernetes.io/docs/concepts/overview/components/#kube-apiserver https://kubernetes.io/docs/reference/generated/kube-apiserver/
After=network.target

[Service]
EnvironmentFile=-/etc/kubernetes/apiserver
ExecStart=/usr/bin/kube-apiserver \
            $KUBE_API_ARGS \
            $KUBE_LOGTOSTDERR \
            $KUBE_LOG_LEVEL \
            $KUBE_ETCD_SERVERS \
            $KUBE_API_ADDRESS \
            $KUBE_API_PORT \
            $KUBELET_PORT \
            $KUBE_ALLOW_PRIV \
            $KUBE_SERVICE_ADDRESSES \
            $KUBE_ADMISSION_CONTROL 
Restart=no   # 把这个改为no，这样systemd才不会再我们kill掉kube-apiserver的时候自动重新拉起
Type=notify
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

测试

测试代码：

CoreV1Api coreV1Api = new CoreV1Api(client);

        SharedInformerFactory factory = new SharedInformerFactory(client);
        SharedIndexInformer<V1Pod> podInformer =
                    factory.sharedIndexInformerFor(
                            (CallGeneratorParams params) -> coreV1Api.listPodForAllNamespacesCall(null, null, null, labelQuery, null,
                                    null, params.resourceVersion, null, params.timeoutSeconds, params.watch, null), V1Pod.class, V1PodList.class, 60 * 1000L);
       
        factory.startAllRegisteredInformers();


        ScheduledExecutorService checkExecutor =
                Executors.newSingleThreadScheduledExecutor(
                        Threads.threadFactory("controller-resync-" + "test" + "-%d"));
        try {
            checkExecutor.scheduleWithFixedDelay(() -> {
                List<V1Pod> podIndex = podInformer.getIndexer().byIndex("podIndex", "demo-a-null");
                log.info("Check size================================================:{}", podIndex.size());
                for (V1Pod pod : podIndex) {
                    log.info("Check name================================================:{}", pod.getMetadata().getName());
                }
            }
                    , 0L, 2000, TimeUnit.MILLISECONDS);
        } catch (RejectedExecutionException e) {
            // submitting reflector list-watching job can fail due to concurrent invocation of
            // `shutdown`. handling exception with a warning then return.
            log.warn("reflector list-watching job exiting because the thread-pool is shutting down");
            return;
        }

测试代码使用的都是我们加了异常日志后的那些类，然后加上了一个定时执行用于探测指定deployment的pod数量及其名称。因为我们会在kill掉kube-apiserver后，手动删除这个deployment的pod，然后让这个定时任务检测pod数量和名称是否会变化，如果正常的变化了，说明Informer仍在正常工作，如果在我们删除pod后，定时任务打印的却没变化，说明Informer已经卡死了，没有收到后续的事件通知。

kill & kill -9

首先我们尝试正常kill掉kube-apiserver，kill后的日志输出

image.png

可以看到kill掉kube-apiserver后，Informer正常的抛出了异常，并且开始重新的list+watch，很遗憾，没能复现。

直接重启节点

第二次尝试直接重启节点，节点重启后的日志和上面kill的一致，Informer也是正常的抛出了异常，并且开始重新的list+watch

添加http请求详细日志

HttpLoggingInterceptor loggingInterceptor = new HttpLoggingInterceptor();
        loggingInterceptor.setLevel(HttpLoggingInterceptor.Level.HEADERS);
        client.setHttpClient(client.getHttpClient().newBuilder().addNetworkInterceptor(loggingInterceptor).build());

k8s官方的Java库使用的是okhttp客户端，首先修改下okhttp的日志级别，打印出HTTP请求的详细信息

image.png

h2对应的是http2，由于我window本机的jdk版本为1.8.0_291，默认使用的就是http2协议，我们尝试将协议改为http1.1

List<Protocol> protocols = new ArrayList<>();
        protocols.add(Protocol.HTTP_1_1);
        client.setHttpClient(client.getHttpClient().newBuilder()
                .protocols(protocols).build());

直接重启节点，发现这次Informer就没法收到后续的watch事件了，至此复现了线上的场景

原因分析

登录线上服务器，发现线上的jdk版本是1.8.0_121，这个版本的jdk默认使用的是http1.1协议，okhttp长连接默认是没有开启tcp keepalive的，因此在服务器节点直接挂掉后，就会出现这种死连接的情况；因此，总结其原因是http1.1协议+没开启tcp keepalive，出现了死连接，导致watch没法收到后续的事件。

解决方案

升级jdk的小版本，打破原因中的第一项，强制使用http2协议
启用tcp keepalive，我们可以直接使用curl 来进行试验；由于操作系统默认的tcp keepalive时间是7200s（我这里为了测试改成了72s），所以不太建议依赖tcp keepalive来规避这个问题，线上故障两个小时，可能黄花菜都凉了~

# curl 发起watch请求
curl -k -H 'Authorazation: Bearer token' https://test.domain.ss:6443/api/v1/pods?watch=true


# 查看连接状况
# 可以看到curl的时候默认就是开启了tcp keepalive
netstat -anpo|grep curl
tcp     0      0     source_ip:22345     target_ip:6443   ESTABLISHED 51730/cur7      keepalive (1.02/0/1)

# 查看系统tcp keepalive内核参数
# 在72s后，再次netstat查看的时候连接已经被正常释放
root@bigo:# sysct1 -a lgrep keepalive
net.ipv4.tcp_keepalive_intv7  =  5
net.ipv4 .tcp_keepalive_probes = 1
net.ipv4.tcp_keepalive_time = 72

另外一种方式是将长连接改为websocket应该也是可行的，websocket的心跳机制应该可以避免死连接的出现，待验证