定位kube-vip 公网ip 1/3 概率性不通的问题
- 集群内部检查并查看svc对应的后端三个pod
[root@pro-k8s-master-1 ~]# kubectl get po -A -o wide | grep ingress
kube-system traefik-ingress-controller-bb4fd888c-6cdlz 1/1 Running 0 4m23s 10.120.37.211 inner-worker-3 <none> <none>
kube-system traefik-ingress-controller-bb4fd888c-7jddp 1/1 Running 0 3m39s 10.120.37.212 inner-worker-2 <none> <none>
kube-system traefik-ingress-controller-bb4fd888c-pmrnt 1/1 Running 0 4m57s 10.120.37.210 inner-worker-1
[root@pro-k8s-master-1 ~]# ping 10.120.37.211
PING 10.120.37.211 (10.120.37.211) 56(84) bytes of data.
64 bytes from 10.120.37.211: icmp_seq=1 ttl=64 time=52.5 ms
^C
--- 10.120.37.211 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 52.527/52.527/52.527/0.000 ms
[root@pro-k8s-master-1 ~]# ping 10.120.37.212
PING 10.120.37.212 (10.120.37.212) 56(84) bytes of data.
64 bytes from 10.120.37.212: icmp_seq=1 ttl=64 time=2.33 ms
^C
--- 10.120.37.212 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.330/2.330/2.330/0.000 ms
[root@pro-k8s-master-1 ~]# ping 10.120.37.210
PING 10.120.37.210 (10.120.37.210) 56(84) bytes of data.
64 bytes from 10.120.37.210: icmp_seq=1 ttl=64 time=2.38 ms
[root@pro-k8s-master-3 ~]# curl 10.120.37.211
404 page not found
[root@pro-k8s-master-3 ~]# curl 10.120.37.212
404 page not found
[root@pro-k8s-master-3 ~]# curl 10.120.37.210
404 page not found
[root@pro-k8s-master-3 ~]# kubectl get svc -A -o wide | grep traefik-ingress-service
kube-system traefik-ingress-service LoadBalancer 10.98.104.134 10.120.47.203 80:31753/TCP,8080:31108/TCP,443:31925/TCP 380d app=traefik-ingress-lb
[root@pro-k8s-master-3 ~]#
[root@pro-k8s-master-3 ~]#
[root@pro-k8s-master-3 ~]# curl 10.98.104.134
404 page not found
[root@pro-k8s-master-3 ~]# curl 10.98.104.134
404 page not found
[root@pro-k8s-master-3 ~]# curl 10.98.104.134
404 page not found
[root@pro-k8s-master-3 ~]# curl 10.98.104.134
404 page not found
[root@pro-k8s-master-3 ~]# curl 10.98.104.134
404 page not found
[root@pro-k8s-master-3 ~]# curl 10.98.104.134
404 page not found
# 可以看到集群内部是完全没问题的
- 排查集群外部到集群内的链路
ping 完全没问题,表示eth0 vip 的网卡子接口,完全没问题
- 查看10.120.47.203对应的网卡初始化是否异常
(py3env) [root@deployer env-inner-prod-on-prem]# ansible all -i inventory/env-inner-prod-on-prem/inventory.ini -m shell -a "ip a | grep 10.120.47.203"
pro-k8s-master-1 | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
pro-k8s-master-2 | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
pro-k8s-master-3 | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-worker-2 | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global eth0 # 子接口在这个节点,在这里抓外部进来的包
inet 10.120.47.203/32 scope global kube-ipvs0
inner-prod-common-c6-4xl-asg-ofg-6ns-y5j-server-tv4 | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-prod-common-c6-4xl-asg-ofg-ntl-bcr-server-x5k | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-worker-3 | CHANGED | rc=0 >> # 这个dummy接口看起来有问题
inet 10.120.47.203/32 scope global kube-ipvs0Dump was interrupted and may be inconsistent.
inner-prod-common-c6-4xl-asg-ofg-jkr-2mx-server-n45 | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-prod-common-c6-4xl-asg-ofg-tyw-5lw-server-7dp | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-prod-common-c6-4xl-asg-ofg-a2s-4jv-server-qcq | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-worker-1 | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-prod-common-c6-4xl-asg-ofg-hok-aok-server-ujn | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-prod-common-c6-4xl-asg-ofg-qgh-63l-server-tci | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
cn-xm-logging-es-asg-3pm-qut-4xx | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
inner-prod-common-c6-4xl-asg-ofg-j7q-2mj-server-ytk | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
cn-xm-logging-es-asg-3pm-eik-cql | CHANGED | rc=0 >>
inet 10.120.47.203/32 scope global kube-ipvs0
- 抓包定位比较成功的包和失败的包的异同
[root@inner-worker-2 ~]# tcpdump -i any host 10.60.22.36 and port 80 -netvv
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14018, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.57922 > 10.120.47.203.http: Flags [S], cksum 0x0c37 (correct), seq 3301729766, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 52)
10.120.47.203.http > 10.60.22.36.57922: Flags [S.], cksum 0xf034 (correct), seq 897740622, ack 3301729767, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 196, id 14019, offset 0, flags [DF], proto TCP (6), length 40)
10.60.22.36.57922 > 10.120.47.203.http: Flags [.], cksum 0xa118 (correct), seq 1, ack 1, win 513, length 0
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 133: (tos 0x0, ttl 196, id 14020, offset 0, flags [DF], proto TCP (6), length 117)
10.60.22.36.57922 > 10.120.47.203.http: Flags [P.], cksum 0x0db7 (correct), seq 1:78, ack 1, win 513, length 77: HTTP, length: 77
GET / HTTP/1.1
Host: 10.120.47.203
User-Agent: curl/7.84.0
Accept: */*
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 63, id 31000, offset 0, flags [DF], proto TCP (6), length 40)
10.120.47.203.http > 10.60.22.36.57922: Flags [.], cksum 0xa292 (correct), seq 1, ack 78, win 58, length 0
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 232: (tos 0x0, ttl 63, id 31001, offset 0, flags [DF], proto TCP (6), length 216)
10.120.47.203.http > 10.60.22.36.57922: Flags [P.], cksum 0x676a (correct), seq 1:177, ack 78, win 58, length 176: HTTP, length: 176
HTTP/1.1 404 Not Found
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 29 Sep 2022 02:14:32 GMT
Content-Length: 19
404 page not found
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 196, id 14021, offset 0, flags [DF], proto TCP (6), length 40)
10.60.22.36.57922 > 10.120.47.203.http: Flags [F.], cksum 0xa01b (correct), seq 78, ack 177, win 512, length 0
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 63, id 31002, offset 0, flags [DF], proto TCP (6), length 40)
10.120.47.203.http > 10.60.22.36.57922: Flags [F.], cksum 0xa1e0 (correct), seq 177, ack 79, win 58, length 0
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 196, id 14022, offset 0, flags [DF], proto TCP (6), length 40)
10.60.22.36.57922 > 10.120.47.203.http: Flags [.], cksum 0xa01a (correct), seq 79, ack 178, win 512, length 0
# 正常的包的走向
client eip <--> lb eip <--> 后端ip
traefik-ingress-controller-bb4fd888c-pmrnt 1/1 Running 0 4m57s 10.120.37.210 inner-worker-1
traefik-ingress-controller-bb4fd888c-6cdlz 1/1 Running 0 4m23s 10.120.37.211 inner-worker-3
traefik-ingress-controller-bb4fd888c-7jddp 1/1 Running 0 3m39s 10.120.37.212 inner-worker-2
# 可以看到这里没进行原地址替换为vip,直接转过去了,由于路由问题,导致回包接收不到
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14033, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.47.203.http: Flags [S], cksum 0x04d9 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14033, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
P fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14033, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out 00:00:00:c1:56:c1 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
10.120.37.212.http > 10.60.22.36.58181: Flags [S.], cksum 0x3a14 (correct), seq 2311377861, ack 2980740964, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0
P dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 196, id 462, offset 0, flags [DF], proto TCP (6), length 40)
10.60.22.36.58181 > 10.120.37.212.http: Flags [R], cksum 0x1981 (correct), seq 2980740964, win 0, length 0
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14034, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.47.203.http: Flags [S], cksum 0x04d9 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14034, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
P fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14034, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out 00:00:00:c1:56:c1 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
10.120.37.212.http > 10.60.22.36.58181: Flags [S.], cksum 0x50d2 (incorrect -> 0x2178), seq 2326981491, ack 2980740964, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0
P dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 196, id 463, offset 0, flags [DF], proto TCP (6), length 40)
10.60.22.36.58181 > 10.120.37.212.http: Flags [R], cksum 0x1981 (correct), seq 2980740964, win 0, length 0
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14035, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.47.203.http: Flags [S], cksum 0x04d9 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14035, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
P fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14035, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out 00:00:00:c1:56:c1 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
10.120.37.212.http > 10.60.22.36.58181: Flags [S.], cksum 0x50d2 (incorrect -> 0xca8d), seq 2358263936, ack 2980740964, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0
P dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 196, id 464, offset 0, flags [DF], proto TCP (6), length 40)
10.60.22.36.58181 > 10.120.37.212.http: Flags [R], cksum 0x1981 (correct), seq 2980740964, win 0, length 0
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14036, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.47.203.http: Flags [S], cksum 0x04d9 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14036, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
P fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14036, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out 00:00:00:c1:56:c1 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
10.120.37.212.http > 10.60.22.36.58181: Flags [S.], cksum 0x50d2 (incorrect -> 0x0cd8), seq 2420767356, ack 2980740964, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0
P dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 196, id 465, offset 0, flags [DF], proto TCP (6), length 40)
10.60.22.36.58181 > 10.120.37.212.http: Flags [R], cksum 0x1981 (correct), seq 2980740964, win 0, length 0
In dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14037, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.47.203.http: Flags [S], cksum 0x04d9 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14037, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
P fa:16:3e:47:db:fc ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 196, id 14037, offset 0, flags [DF], proto TCP (6), length 52)
10.60.22.36.58181 > 10.120.37.212.http: Flags [S], cksum 0x0ed0 (correct), seq 2980740963, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
Out 00:00:00:c1:56:c1 ethertype IPv4 (0x0800), length 68: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
10.120.37.212.http > 10.60.22.36.58181: Flags [S.], cksum 0x50d2 (incorrect -> 0x67da), seq 2545784838, ack 2980740964, win 29200, options [mss 1460,nop,nop,sackOK,nop,wscale 9], length 0
P dc:ef:80:5a:44:13 ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 196, id 466, offset 0, flags [DF], proto TCP (6), length 40)
10.60.22.36.58181 > 10.120.37.212.http: Flags [R], cksum 0x1981 (correct), seq 2980740964, win 0, length 0
# 规则是没问题的
[root@inner-worker-2 ~]# ipvsadm -ln | grep -A 4 10.120.47.203:80
TCP 10.120.47.203:80 rr
-> 10.120.37.210:80 Masq 1 8 2
-> 10.120.37.211:80 Masq 1 5 1
-> 10.120.37.212:80 Masq 1 2 0
#而且集群内测试 无法复现该问题
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
[root@inner-worker-2 ~]# curl 10.120.47.203:80
404 page not found
其实从各个角度大致排除了连接数的问题:
- 当前集群本就是小集群,链接数都没到100,而最大限制是1024
[root@inner-worker-2 ~]# ulimit -n
1024
[root@inner-worker-2 ~]#
[root@inner-worker-2 ~]#
[root@inner-worker-2 ~]#
[root@inner-worker-2 ~]# netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
ESTABLISHED 17
TIME_WAIT 53
[https://www.jianshu.com/p/71d554222f9e](https://www.jianshu.com/p/71d554222f9e)
- 内存 cpu 也是足够的
[root@inner-worker-2 ~]# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 578352 255092 9095200 0 0 4 44 0 0 6 2 92 0 0
[root@inner-worker-2 ~]#
- 这个问题的实质在于 就只有最后一个pod会出问题,且100%复现,所以不太可能是资源限制或不足的问题
- 后来查了下ipvs复用连接没有snat的问题: 发现跟这个bug很像
参考: https://imroc.cc/kubernetes/networking/faq/ipvs-conn-reuse-mode.html
# 默认系统配置为1
# 当 conn_reuse_mode 为 0 表示启用 ipvs 连接复用,为 1 表示不复用,
# 是不是有点反直觉?这个确实也比较有争议。
[root@inner-worker-2 ~]# sysctl -a | grep net.ipv4.vs.conn_reuse_mode
net.ipv4.vs.conn_reuse_mode = 1 # 为了解决syn丢包的问题,可以暂时将该参数改为0
开启这个内核参数实际就表示 ipvs 转发时不做连接复用,每次新建的连接都会重新调度 rs 并新建 ip_vs_conn,但它的实现有个问题: 在新建连接时 (SYN 包),如果 client ip:client port
匹配到了 ipvs 旧连接 (TIME_WIAT
状态),且使用了 conntrack,就会丢掉第一个 SYN 包,等待重传后 (1s) 才能成功建连,从而导致建连性能急剧下降。
Kubernetes 社区也发现了这个 bug,所以当 kube-proxy 使用 ipvs 转发模式时,默认将 conn_reuse_mode
置为 0 来规避这个问题,详见 PR #71114 与 issue #70747 。
但是即使是改为0,依然会有一定的问题,但是这个问题只要后端pod没问题就ok的,这个实际上可以通过增强ipvs的健康检查来实现, 而如果用了kube-ovn的话,可以用kube-ovn的lb以及lb健康检查机制来规避
conn_resue_mode=0 引发的问题
由于 Kubernetes 为了规避 conn_resue_mode=1
带来的性能问题,在 ipvs 模式下,让 kube-proxy 在启动时将 conn_resue_mode
置为了 0 ,即使用 ipvs 连接复用的能力,但 ipvs 连接复用有两个问题:
- 只要有
client ip:client port
匹配上 ip_vs_conn (发生复用),就直接转发给对应的 rs,不管 rs 当前是什么状态,即便 rs 的 weight 为 0 (通常是TIME_WAIT
状态) 也会转发,TIME_WAIT
的 rs 通常是 Terminating 状态已销毁的 Pod,转发过去的话连接就必然异常。 - 高并发下大量复用,没有为新连接没有调度 rs,直接转发到所复用连接对应的 rs 上,导致很多新连接被 "固化" 到部分 rs 上。
业务中实际遇到的现象可能有很多种:
-
滚动更新连接异常。 被访问的服务滚动更新时,Pod 有新建有销毁,ipvs 发生连接复用时转发到了已销毁的 Pod 导致连接异常 (
no route to host
)。 - 滚动更新负载不均。 由于复用时不会重新调度连接,导致新连接也被 "固化" 在某些 Pod 上了。
- 新扩容的 Pod 接收流量少。 同样也是由于复用时不会重新调度连接,导致很多新连接被 "固化" 在扩容之前的这些 Pod 上了。
规避方案
参考: https://imroc.cc/kubernetes/networking/faq/ipvs-conn-reuse-mode.html
还有一个点: k8s 集群都要开启
sysctl -w net.bridge.bridge-nf-call-iptables=1
当然我这里是开启的,所以不是这个原因干扰:
https://imroc.cc/kubernetes/networking/faq/why-enable-bridge-nf-call-iptables.html
bug原文介绍:
Hello everyone:
We are very fortunate to tell you that this bug has been fixed by us and has been verified to work very well. The patch(ipvs: avoid drop first packet by reusing conntrack) is being submitted to the Linux kernel community. You can also apply this patch to your own kernel, and then only need to set net.ipv4.vs.conn_reuse_mode=1(default) and net.ipv4.vs.conn_reuse_old_conntrack=1(default). As the net.ipv4.vs.conn_reuse_old_conntrack sysctl switch is newly added. You can adapt the kube-proxy by judging whether there is net.ipv4.vs.conn_reuse_old_conntrack, if so, it means that the current kernel is the version that fixed this bug.
That Can solve the following problems:
- Rolling update, IPVS keeps scheduling traffic to the destroyed Pod
- Unbalanced IPVS traffic scheduling after scaled up or rolling update
- fix IPVS low throughput issue fix IPVS low throughput issue #71114
fix IPVS low throughput issue #71114 - One second connection delay in masque
https://marc.info/?t=151683118100004&r=1&w=2 - IPVS low throughput IPVS low throughput #70747
IPVS low throughput #70747 - Apache Bench can fill up ipvs service proxy in seconds Support Restart policy in the kubelet (pre-design) #544
Apache Bench can fill up ipvs service proxy in seconds cloudnativelabs/kube-router#544 - Additional 1s latency in
host -> service IP -> pod
when upgrading from1.15.3 -> 1.18.1
on RHEL 8.1 Additional 1s latency inhost -> service IP -> pod
when upgrading from1.15.3 -> 1.18.1
on RHEL 8.1 #90854
Additional 1s latency inhost -> service IP -> pod
when upgrading from1.15.3 -> 1.18.1
on RHEL 8.1 #90854 - kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client #81775
kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client #81775
Thank you.
By Yang Yuxi (TencentCloudContainerTeam)
https://github.com/kubernetes/kubernetes/pull/71114
最终结论:
# 解决centos7 3.10 内核,kube-vip,ipvs drop syn的问题
sysctl -w net.ipv4.vs.conn_reuse_mode=0