pod内部访问svc失败分析
pod 无法访问svc
环境:
3 mst 2 worker
node 双网卡
node eth0:默认路由在eth0 ,k8s管理网络,node访问svc ,pod经过node访问svc,以及pod回包给node都会经过eth0
pod eth1:pod 访问pod需要经过 eth1的网关
情况描述:
svc 信息
[root@(l2)k8s-master-1 ~]# kubectl get svc | grep srvclb-ngnx
srvclb-ngnx LoadBalancer 10.111.240.224 <pending> 80:31288/TCP 23h
[root@(l2)k8s-master-1 ~]# ipvsadm -ln | grep -A 2 10.111.240.224
TCP 10.111.240.224:80 rr
-> 172.33.1.255:80 Masq 1 0 0
-> 172.33.2.17:80 Masq 1 0 0
后端pod是两个nginx web
[root@(l2)k8s-master-1 ~]#kubectl get pod -A -o wide| grep -E "172.33.1.255|172.33.2.17|172.33.2.4"
default loadbalancer-5554b69d95-clgjd 1/1 Running 0 22h 172.33.1.255 k8s-worker-3
default loadbalancer-5554b69d95-tt99x 1/1 Running 0 17h 172.33.2.17 k8s-worker-1
default sshd-k8s-master-1 1/1 Running 0 20h 172.33.2.4 k8s-master-1
sshd-k8s-master-1 是测试发起端
在发端pod所在的node抓包分析
# 发起端
## 在pod内部保持telnet
[root@sshd-k8s-master-1 /]# telnet 10.111.240.224 80
Trying 10.111.240.224...
# node 抓包
## mac 说明
[root@(l2)k8s-master-1 env-test]# ansible all -i inventory/inventory.ini -m shell -a "ip a | grep -i -C 2 -E '00:00:00:fa:f1:34|00:00:00:b2:8f:1b'"
k8s-master-1 | CHANGED | rc=0 >>
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
link/ether 00:00:00:b2:8f:1b brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:00:00:9c:1c:c7 brd ff:ff:ff:ff:ff:ff
--
valid_lft forever preferred_lft forever
10: ipvl_3@eth1: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default
link/ether 00:00:00:b2:8f:1b brd ff:ff:ff:ff:ff:ff
inet 172.33.192.10/32 scope host ipvl_3
valid_lft forever preferred_lft forever
k8s-worker-1 | CHANGED | rc=0 >>
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc fq_codel state UP group default qlen 1000
link/ether 00:00:00:fa:f1:34 brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:00:00:0e:78:83 brd ff:ff:ff:ff:ff:ff
--
valid_lft forever preferred_lft forever
10: ipvl_3@eth1: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default
link/ether 00:00:00:fa:f1:34 brd ff:ff:ff:ff:ff:ff
inet 172.33.192.15/32 scope host ipvl_3
valid_lft forever preferred_lft forever
tcpdump -i any host 172.33.2.4 or 10.111.240.224 or 172.33.1.255 or 172.33.2.17 -netvv
## 发包: pod 发给 node svc cluster ip,走eth1网卡出来
Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 76: (tos 0x10, ttl 64, id 20122, offset 0, flags [DF], proto TCP (6), length 60)
172.33.2.4.49454 > 10.111.240.224.http: Flags [S], cksum 0xa9a3 (incorrect -> 0x66d3), seq 2359839294, win 65280, options [mss 1360,sackOK,TS val 2450394791 ecr 0,nop,wscale 7], length 0
## 奇怪 这里怎么就有pod后端的响应包了,应该先有发起端发过去的包才对
In 00:00:00:fa:f1:34 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.33.2.17.http > 172.33.2.4.49454: Flags [S.], cksum 0x5c86 (incorrect -> 0x5944), seq 3684244595, ack 2359839295, win 64704, options [mss 1360,sackOK,TS val 1076700321 ecr 2450394791,nop,wscale 7], length 0
## ipvs 前端转后端之后
#### 只是mac变了
Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
172.33.2.4.49454 > 172.33.2.17.http: Flags [R], cksum 0xbb22 (correct), seq 2359839295, win 0, length 0
## 发包
Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 76: (tos 0x10, ttl 64, id 20123, offset 0, flags [DF], proto TCP (6), length 60)
172.33.2.4.49454 > 10.111.240.224.http: Flags [S], cksum 0xa9a3 (incorrect -> 0x62df), seq 2359839294, win 65280, options [mss 1360,sackOK,TS val 2450395803 ecr 0,nop,wscale 7], length 0
#### 为什么两个pod 会直接通信,没有经过cluster ip
In 00:00:00:fa:f1:34 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.33.2.17.http > 172.33.2.4.49454: Flags [S.], cksum 0x5c86 (incorrect -> 0x29bf), seq 3700048671, ack 2359839295, win 64704, options [mss 1360,sackOK,TS val 1076701333 ecr 2450395803,nop,wscale 7], length 0
Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
172.33.2.4.49454 > 172.33.2.17.http: Flags [R], cksum 0xbb22 (correct), seq 2359839295, win 0, length 0
Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 76: (tos 0x10, ttl 64, id 20124, offset 0, flags [DF], proto TCP (6), length 60)
#### 重复尝试
172.33.2.4.49454 > 10.111.240.224.http: Flags [S], cksum 0xa9a3 (incorrect -> 0x5adf), seq 2359839294, win 65280, options [mss 1360,sackOK,TS val 2450397851 ecr 0,nop,wscale 7], length 0
In 00:00:00:fa:f1:34 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
172.33.2.17.http > 172.33.2.4.49454: Flags [S.], cksum 0x5c86 (incorrect -> 0xcc70), seq 3732049541, ack 2359839295, win 64704, options [mss 1360,sackOK,TS val 1076703381 ecr 2450397851,nop,wscale 7], length 0
Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 56: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 40)
172.33.2.4.49454 > 172.33.2.17.http: Flags [R], cksum 0xbb22 (correct), seq 2359839295, win 0, length 0
In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.4 tell 172.33.2.17, length 28
# 往返包正常 又发arp做什么?
Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.4 is-at 00:00:00:b2:8f:1b, length 28
Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.17 tell 172.33.2.4, length 28
Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 10.111.240.224 tell 172.33.2.4, length 28
In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.17 is-at 00:00:00:fa:f1:34, length 28
In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 10.111.240.224 is-at 00:00:00:fa:f1:34, length 28
原因:kube-proxy 没有开启masquerade导致的, 不开启pod发出的包经过ipvs就不会被伪装成eth0的ip和mac,而是只替换了mac。由于ipvlan模式下,eth1网卡无法向外转发,所以走了eth0出去,即eth0发出了一个不是自己的ip也不是自己的mac的包,所以会导致出问题。
像macvlan,ipvlan,kube-ovn 多网卡场景的node,eth0是k8s管理网卡(svc依赖)必须开启该模式
对比 启用 masquerade 之后
[root@(l2)k8s-master-1 ~]# grep masquerade -r /etc/kubernetes/
/etc/kubernetes/kubeadm-config.yaml: masqueradeAll: True
基于kubespray 更新全伪装之后,旧的lb未生效,所以直接重建了下测试集群
svc 信息
[root@(l2)k8s-master-1 ~]# kubectl get svc | grep srvclb-ngnx
default srvclb-ngnx LoadBalancer 10.105.106.250 172.32.1.6 80:32244/TCP 17m app=hello,tier=frontend
[root@(l2)k8s-master-1 ~]# ipvsadm -ln | grep -A 2 10.105.106.250
TCP 10.105.106.250:80 rr
-> 172.33.2.17:80 Masq 1 0 0
-> 172.33.2.18:80 Masq 1 0 0
后端pod是两个nginx web
[root@(l2)k8s-master-1 ~]# kubectl get pod -A -o wide| grep -E "172.33.2.17|172.33.2.18|172.33.2.7"
default loadbalancer-5554b69d95-tp778 1/1 Running 47m 172.33.2.18 k8s-worker-1
default loadbalancer-5554b69d95-wsk8k 1/1 Running 47m 172.33.2.17 k8s-worker-3
default sshd-k8s-master-1 1/1 Running 88m 172.33.2.7 k8s-master-1
sshd-k8s-master-1 是测试发起端
sh-4.2# telnet 10.105.106.250 80
Trying 10.105.106.250...
Connected to 10.105.106.250.
Escape character is '^]'.
^]
## pod 所在 node 抓包
telnet 能通的情况下的包
[root@(l2)k8s-master-1 ~]# tcpdump -i any host 10.105.106.250 or 172.33.2.7 or 172.33.2.17 or 172.33.2.18 -netvv
Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 76: (tos 0x10, ttl 64, id 42606, offset 0, flags [DF], proto TCP (6), length 60)
172.33.2.7.53432 > 10.105.106.250.http: Flags [S], cksum 0x23ba (incorrect -> 0xfed3), seq 292962656, win 65280, options [mss 1360,sackOK,TS val 582082675 ecr 0,nop,wscale 7], length 0
In 00:00:00:fa:f1:34 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 62, id 0, offset 0, flags [DF], proto TCP (6), length 60)
10.105.106.250.http > 172.33.2.7.53432: Flags [S.], cksum 0xd8ba (correct), seq 4218169578, ack 292962657, win 64704, options [mss 1360,sackOK,TS val 2603772094 ecr 582082675,nop,wscale 7], length 0
Out 00:00:00:b2:8f:1b ethertype IPv4 (0x0800), length 68: (tos 0x10, ttl 64, id 42607, offset 0, flags [DF], proto TCP (6), length 52)
172.33.2.7.53432 > 10.105.106.250.http: Flags [.], cksum 0x23b2 (incorrect -> 0x01e5), seq 1, ack 1, win 510, options [nop,nop,TS val 582082676 ecr 2603772094], length 0
# 以下arp相关的包
In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.7 tell 172.33.192.15, length 28
Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.7 is-at 00:00:00:b2:8f:1b, length 28
P 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.0.1 tell 172.33.2.18, length 28
Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 10.105.106.250 tell 172.33.2.7, length 28
In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 10.105.106.250 is-at 00:00:00:fa:f1:34, length 28
In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.7 tell 172.33.192.15, length 28
Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.7 is-at 00:00:00:b2:8f:1b, length 28
In 00:00:00:fa:f1:34 ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Request who-has 172.33.2.7 tell 172.33.192.15, length 28
Out 00:00:00:b2:8f:1b ethertype ARP (0x0806), length 44: Ethernet (len 6), IPv4 (len 4), Reply 172.33.2.7 is-at 00:00:00:b2:8f:1b, length 28
首要问题pod 访问svc 概率性不通的问题:
-
svc 后端为host network模式的pod,比如 curl -k https://kubernetes:443/livez?verbose
-
svc 后端为 和node eth1 同网段ip的pod,比如自建的svc
场景1: 在kube-proxy 不启用全伪装模式的时候, 此时pod 内部访问 curl -k https://kubernetes:443/livez?verbose
小概率可成功响应,大概率失败。 而node上访问完全正常。
当kubernetes有三个后端,那么pod内部的成功率是1/3,而node成功率是100%。
原因: 由于是双网关,且无伪装。 pod访问svc后端的node需要eth0的网关,而回包是直接返回给pod。由于node回包给pod时,网关转发包的混乱,包看起来是随机发送给任何一个node的,所以小概率,pod可以收到包。
小结: 本质上是因为node访问pod,跨网关转发,这个转发是不稳定的。
开启全伪装模式之后, pod内部访问 curl -k https://kubernetes:443/livez?verbose 100%成功, 只是有首包慢的问题,localdns的缓存效果,后续请求就很快
前提: 在kube-proxy 启用全伪装模式的情况下,进行2 自建svc的测试。
一般情况下,pod创建出来,node 能否平通 pod是概率性的,但是pod可以100% ping通node
在pod ping 通node之后的一段时间内,node ping pod 也是100%可通的。 也就是在这段时间内,网关知道pod在哪里,可以准确转发。
情况1: 在node 无法ping通pod的情况下,pod 访问svc的情况是概率性的,和kube-proxy未开启全伪装的表现几乎一样
当自定义svc有两个后端, pod内部访问svc成功率是1/2, 而node成功率为0
原因:
跟踪contrack表发现
pod 内部 ping svc, 没有新的contrack条目创建。也就是就是没基于cluster ip建立连接
node 内部 ping svc,发现有新的contrack条目创建。 但是由于node ping 不同 pod,所以依旧无法访问
情况2: 在 node 可以ping 通 pod的情况下,pod 访问svc 100% 成功
保持pod 对node的ping
image.png此时有两条icmp的记录,svc两个后端pod 保持 对 master1的ping
image.png此时在master1 访问cluster ip 100%成功
等待 node访问的contrack记录消失,再进行 pod访问的测试
测试pod访问自定义svc
始终是1/2的成功率,但是完全没有新的contrack记录。 pod访问svc 应该没走kube-proxy。
解决方式: 移除eth1的网卡上的ip,该情况完全消失。 pod 内部访问100%成功,node访问会产生新的contrack
追加测试
如果自定义 svc 只有一个pod,pod 访问 svc 始终都是成功的
持续跟踪了下arp表的变化, 同一个ip对应的mac会以肉眼可见的速度不断更新,但是不会出现冲突以及错乱,同一个ip始终都映射到一个mac。
image.png抓包发现触发arp更替的请求是本地的eth1网卡发起的。
image.png可以看到eth1网卡发起的arp广播。
由于eth1作为master,是无法通往外部以及本地pod的
也就是说这张网卡,完全没有联通的功能,但是仍然会触发arp的更新。
解决方式:该网卡的ip以及路由需要移除掉,达到禁用该网卡的目的
禁用网卡时,可以看到eth1相关的arp记录全部清除
执行ip addr flush dev eth1 时,eth1的arp记录全部清除
pod 内部 访问多pod 后端的svc 也100%回复正常
参考:ipvs https://blog.dianduidian.com/post/lvs-snat%E5%8E%9F%E7%90%86%E5%88%86%E6%9E%90/