docker 学习笔记5:flannel vxlan 实现的 o
TL;DR udp 实现的 overlay 性能比较差,数据包需要来回在内核空间与用户空间 copy,而 vxlan 的实现正是全部在内核空间。看其它文章分享,主流的云厂商 vpc 实现也是基于类似 vxlan 的技术,只不过可能把计算拆包解包 off 到硬件层了。
vxlan 原理
推荐先看这篇vxlan 原理,讲得深入浅出,需要高版本内核才支持,本次实验基于 ubuntu 18.04 内核 4.15 足够了。简单来说:vxlan is a framework for overlaying virtualized layer 2 networks over lay 3 networks. 通过三层网络来实现的一个虚拟大二层网络就叫 vxlan,很多公有云 vpc 的技术也很类似,前几年左耳朵耗子有两篇文章关于阿里云经典网络的问题,科普一下公有云的网络 掀起了一翻争论。
vxlan 封包上图就是 vxlan 技术的封包结构,白色块
Original L2 Frame
就是传统的网络数据包,外层的 Outer Mac Header
, Outer IP Header
, UDP Header
是底层真实物理机的网络包,这里可以看到
- vxlan 是使用 udp 协义来发送的
Original L2 Frame
,并不需要可靠性 - 基于 vxlan 技术会导致带宽浪费,无论多少的包都会额外消耗 50 bytes
- vxlan header 中 VNID 3 bytes,一共支持 16777215 个租户,足够用了
测试实验
由于前文提到过,阿里云网络本身就是 vpc 的,无法测试。实验使用 virtual box 虚拟机。
测试 vxlan
1. 启动 etcd
/usr/bin/etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379
然后配置 flannel 172.17.0.0/16 一个大的 B 类网段,以及 backend 类型 vxlan
etcdctl set /coreos.com/network/config '{ "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}}'
2. 启动 flanneld
./flanneld-amd64 -etcd-endpoints=http://192.168.43.161:2379 -etcd-prefix=/coreos.com/network -v=3 -etcd-username="" > /var/log/flanneld 2>&1 &
查看 flanneld 启动日志
root@ubuntu1:~# tail -f /var/log/flanneld
I1226 09:24:26.533665 2930 main.go:317] Wrote subnet file to /run/flannel/subnet.env
I1226 09:24:26.533673 2930 main.go:321] Running backend.
I1226 09:24:26.559376 2930 vxlan_network.go:60] watching for new subnet leases
I1226 09:24:26.559745 2930 main.go:429] Waiting for 22h59m59.973172363s to renew lease
I1226 09:24:26.569684 2930 iptables.go:145] Some iptables rules are missing; deleting and recreating rules
I1226 09:24:26.569705 2930 iptables.go:167] Deleting iptables rule: -s 172.17.0.0/16 -j ACCEPT
I1226 09:24:26.570685 2930 iptables.go:167] Deleting iptables rule: -d 172.17.0.0/16 -j ACCEPT
I1226 09:24:26.571397 2930 iptables.go:155] Adding iptables rule: -s 172.17.0.0/16 -j ACCEPT
I1226 09:24:26.572893 2930 iptables.go:155] Adding iptables rule: -d 172.17.0.0/16 -j ACCEPT
root@ubuntu1:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:50:03:fc brd ff:ff:ff:ff:ff:ff
inet 192.168.43.161/24 brd 192.168.43.255 scope global dynamic enp0s3
valid_lft 3270sec preferred_lft 3270sec
inet6 2409:8900:1d61:1ab8:a00:27ff:fe50:3fc/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 3265sec preferred_lft 3265sec
inet6 fe80::a00:27ff:fe50:3fc/64 scope link
valid_lft forever preferred_lft forever
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether d2:89:f1:a4:8a:27 brd ff:ff:ff:ff:ff:ff
inet 172.17.4.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::d089:f1ff:fea4:8a27/64 scope link
valid_lft forever preferred_lft forever
查看 ip 地址会发现,flannel.1
其实是一个 vtep
设备,并且拥有了 mac 地址 d2:89:f1:a4:8a:27
,这点是与 udp 实观的 overlay 不同点之一
3. 启动 docker
这里与前文一致,要根据 flannel
生成的 docker_opts 去启动 docker
root@ubuntu1:~# ./mk-docker-opts.sh -i
root@ubuntu1:~# cat /run/docker_opts.env
DOCKER_OPT_BIP="--bip=172.17.4.1/24"
DOCKER_OPT_IPMASQ="--ip-masq=true"
DOCKER_OPT_MTU="--mtu=1450"
设置 docker 网桥地址是 172.17.4.1/24
,另外注意 mtu 被设置成了 1450,为什么呢?因为 vxlan 要额外消耗 50 字节
root@ubuntu1:~# cat /lib/systemd/system/docker.service
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
EnvironmentFile=/run/docker_opts.env
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --exec-opt native.cgroupdriver=systemd $DOCKER_OPT_BIP $DOCKER_OPT_IPM
ASQ $DOCKER_OPT_MTU
然后启动 docker,并查看 ip 地址
root@ubuntu1:~# systemctl start docker
root@ubuntu1:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 08:00:27:50:03:fc brd ff:ff:ff:ff:ff:ff
inet 192.168.43.161/24 brd 192.168.43.255 scope global dynamic enp0s3
valid_lft 2887sec preferred_lft 2887sec
inet6 2409:8900:1d61:1ab8:a00:27ff:fe50:3fc/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 3247sec preferred_lft 3247sec
inet6 fe80::a00:27ff:fe50:3fc/64 scope link
valid_lft forever preferred_lft forever
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether d2:89:f1:a4:8a:27 brd ff:ff:ff:ff:ff:ff
inet 172.17.4.0/32 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::d089:f1ff:fea4:8a27/64 scope link
valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:44:fa:4a:1a brd ff:ff:ff:ff:ff:ff
inet 172.17.4.1/24 brd 172.17.4.255 scope global docker0
valid_lft forever preferred_lft forever
同样的操作,两台测试机都要执行,然后在两台宿主机上分别启动 docker 容器
root@afbc9a0329ec:/# docker run -it myubuntu /bin/bash
root@afbc9a0329ec:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
5: eth0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 02:42:ac:11:04:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.17.4.2/24 brd 172.17.4.255 scope global eth0
valid_lft forever preferred_lft forever
查看容器地址,分别是 172.17.4.2 和 172.17.22.2
4. 查看 etcd 配置
root@ubuntu1:~# etcdctl ls /coreos.com/network/subnets
/coreos.com/network/subnets/172.17.4.0-24
/coreos.com/network/subnets/172.17.22.0-24
root@ubuntu1:~# etcdctl get /coreos.com/network/subnets/172.17.4.0-24
{"PublicIP":"192.168.43.161","BackendType":"vxlan","BackendData":{"VtepMAC":"d2:89:f1:a4:8a:27"}}
root@ubuntu1:~# etcdctl get /coreos.com/network/subnets/172.17.22.0-24
{"PublicIP":"192.168.43.222","BackendType":"vxlan","BackendData":{"VtepMAC":"d6:d8:a7:a2:7f:4a"}}
这里很关键,每台宿主机上的 flannel.1
网卡 mac 地址己经上报到了 etcd
5. 关于 iptables
docker
启动后默认会开 nat,其实没有必要,关掉就可以了
iptables -t nat -F
6. 测试互 ping
root@afbc9a0329ec:/# ping 172.17.22.2
PING 172.17.22.2 (172.17.22.2) 56(84) bytes of data.
64 bytes from 172.17.22.2: icmp_seq=1 ttl=62 time=0.509 ms
64 bytes from 172.17.22.2: icmp_seq=2 ttl=62 time=0.821 ms
64 bytes from 172.17.22.2: icmp_seq=3 ttl=62 time=0.619 ms
64 bytes from 172.17.22.2: icmp_seq=4 ttl=62 time=0.607 ms
--- 172.17.22.2 ping statistics ---
12 packets transmitted, 12 received, 0% packet loss, time 11250ms
rtt min/avg/max/mdev = 0.458/0.666/0.877/0.135 ms
这里只举了一个例子,从 172.17.4.2 ping 172.17.22.2
7. 抓包
分别在两台宿主机上抓 vtep 设备和物理网卡 enp0s3 数据包。
root@ubuntu1:~# tcpdump -n -e -v -i flannel.1
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
09:41:18.777756 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 20925, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.4.2 > 172.17.22.2: ICMP echo request, id 27, seq 1, length 64
09:41:18.778317 d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 60869, offset 0, flags [none], proto ICMP (1), length 84)
172.17.22.2 > 172.17.4.2: ICMP echo reply, id 27, seq 1, length 64
09:41:19.784032 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 21032, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.4.2 > 172.17.22.2: ICMP echo request, id 27, seq 2, length 64
09:41:19.784695 d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 61095, offset 0, flags [none], proto ICMP (1), length 84)
172.17.22.2 > 172.17.4.2: ICMP echo reply, id 27, seq 2, length 64
09:41:20.808877 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 21260, offset 0, flags [DF], proto ICMP (1), length 84)
root@ubuntu2:~# tcpdump -n -e -v -i flannel.1
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
09:41:18.811134 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 20925, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.4.2 > 172.17.22.2: ICMP echo request, id 27, seq 1, length 64
09:41:18.811318 d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 60869, offset 0, flags [none], proto ICMP (1), length 84)
172.17.22.2 > 172.17.4.2: ICMP echo reply, id 27, seq 1, length 64
09:41:19.817451 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 21032, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.4.2 > 172.17.22.2: ICMP echo request, id 27, seq 2, length 64
09:41:19.817528 d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 61095, offset 0, flags [none], proto ICMP (1), length 84)
172.17.22.2 > 172.17.4.2: ICMP echo reply, id 27, seq 2, length 64
09:41:20.842282 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 21260, offset 0, flags [DF], proto ICMP (1), length 84)
可以看到和普通的网络 icmp 包没什么区别,但是这里有个重点,mac 地址是从 etcd 中拿到的。并且自动加到 fdb 中 (forward database)
root@ubuntu1:~# bridge fdb show flannel.1
33:33:00:00:00:01 dev enp0s3 self permanent
01:00:5e:00:00:01 dev enp0s3 self permanent
33:33:ff:50:03:fc dev enp0s3 self permanent
01:80:c2:00:00:00 dev enp0s3 self permanent
01:80:c2:00:00:03 dev enp0s3 self permanent
01:80:c2:00:00:0e dev enp0s3 self permanent
d6:d8:a7:a2:7f:4a dev flannel.1 dst 192.168.43.222 self permanent
33:33:00:00:00:01 dev docker0 self permanent
01:00:5e:00:00:01 dev docker0 self permanent
33:33:ff:fa:4a:1a dev docker0 self permanent
02:42:44:fa:4a:1a dev docker0 master docker0 permanent
02:42:44:fa:4a:1a dev docker0 vlan 1 master docker0 permanent
b2:c5:d9:1d:db:a4 dev veth4ce4c9b vlan 1 master docker0 permanent
b2:c5:d9:1d:db:a4 dev veth4ce4c9b master docker0 permanent
33:33:00:00:00:01 dev veth4ce4c9b self permanent
01:00:5e:00:00:01 dev veth4ce4c9b self permanent
33:33:ff:1d:db:a4 dev veth4ce4c9b self permanent
root@ubuntu1:~# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.43.1 0.0.0.0 UG 100 0 0 enp0s3
172.17.4.0 0.0.0.0 255.255.255.0 U 0 0 0 docker0
172.17.22.0 172.17.22.0 255.255.255.0 UG 0 0 0 flannel.1
192.168.43.0 0.0.0.0 255.255.255.0 U 0 0 0 enp0s3
192.168.43.1 0.0.0.0 255.255.255.255 UH 100 0 0 enp0s3
另外也可以看到宿主机路由,多了一条针对 flannel.1 vtep 设备的路由,这就是 flannel 针对 L2 miss 和 L3 miss 的处理,提前设置好。然后再看物理网卡抓包
root@ubuntu1:~# tcpdump -n -e -v -i enp0s3 -T vxlan
09:57:23.194886 08:00:27:50:03:fc > 08:00:27:c5:a1:4f, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 21548, offset 0, flags [none], proto UDP (17), length 134)
192.168.43.161.43077 > 192.168.43.222.8472: VXLAN, flags [I] (0x08), vni 1
d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 56153, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.4.2 > 172.17.22.2: ICMP echo request, id 32, seq 1, length 64
09:57:23.195311 08:00:27:c5:a1:4f > 08:00:27:50:03:fc, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 38400, offset 0, flags [none], proto UDP (17), length 134)
192.168.43.222.57258 > 192.168.43.161.8472: VXLAN, flags [I] (0x08), vni 1
d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 17834, offset 0, flags [none], proto ICMP (1), length 84)
172.17.22.2 > 172.17.4.2: ICMP echo reply, id 32, seq 1, length 64
root@ubuntu2:~# tcpdump -n -e -v -i enp0s3 -T vxlan
09:57:23.239296 08:00:27:c5:a1:4f > 08:00:27:50:03:fc, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 38400, offset 0, flags [none], proto UDP (17), length 134)
192.168.43.222.57258 > 192.168.43.161.8472: VXLAN, flags [I] (0x08), vni 1
d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 17834, offset 0, flags [none], proto ICMP (1), length 84)
172.17.22.2 > 172.17.4.2: ICMP echo reply, id 32, seq 1, length 64
09:57:24.244397 08:00:27:50:03:fc > 08:00:27:c5:a1:4f, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 21710, offset 0, flags [none], proto UDP (17), length 134)
192.168.43.161.43077 > 192.168.43.222.8472: VXLAN, flags [I] (0x08), vni 1
d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 56326, offset 0, flags [DF], proto ICMP (1), length 84)
172.17.4.2 > 172.17.22.2: ICMP echo request, id 32, seq 2, length 64
物理网卡抓包可以看到,当前 vni 值是 1,vxlan 携带的数据包正是 flannel.1 的 Original L2 Frame
小结
网络还是有点复杂了,接下来再看其它网络方案的实现。