docker 学习笔记5:flannel vxlan 实现的 o

2019-12-27  本文已影响0人  董泽润

TL;DR udp 实现的 overlay 性能比较差,数据包需要来回在内核空间与用户空间 copy,而 vxlan 的实现正是全部在内核空间。看其它文章分享,主流的云厂商 vpc 实现也是基于类似 vxlan 的技术,只不过可能把计算拆包解包 off 到硬件层了。

vxlan 原理

推荐先看这篇vxlan 原理,讲得深入浅出,需要高版本内核才支持,本次实验基于 ubuntu 18.04 内核 4.15 足够了。简单来说:vxlan is a framework for overlaying virtualized layer 2 networks over lay 3 networks. 通过三层网络来实现的一个虚拟大二层网络就叫 vxlan,很多公有云 vpc 的技术也很类似,前几年左耳朵耗子有两篇文章关于阿里云经典网络的问题科普一下公有云的网络 掀起了一翻争论。

vxlan 封包
上图就是 vxlan 技术的封包结构,白色块 Original L2 Frame 就是传统的网络数据包,外层的 Outer Mac Header, Outer IP Header, UDP Header 是底层真实物理机的网络包,这里可以看到

测试实验

由于前文提到过,阿里云网络本身就是 vpc 的,无法测试。实验使用 virtual box 虚拟机。


测试 vxlan

1. 启动 etcd

/usr/bin/etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379

然后配置 flannel 172.17.0.0/16 一个大的 B 类网段,以及 backend 类型 vxlan

etcdctl set /coreos.com/network/config '{ "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}}'

2. 启动 flanneld

./flanneld-amd64 -etcd-endpoints=http://192.168.43.161:2379 -etcd-prefix=/coreos.com/network -v=3 -etcd-username="" > /var/log/flanneld 2>&1 &

查看 flanneld 启动日志

root@ubuntu1:~# tail -f /var/log/flanneld
I1226 09:24:26.533665    2930 main.go:317] Wrote subnet file to /run/flannel/subnet.env
I1226 09:24:26.533673    2930 main.go:321] Running backend.
I1226 09:24:26.559376    2930 vxlan_network.go:60] watching for new subnet leases
I1226 09:24:26.559745    2930 main.go:429] Waiting for 22h59m59.973172363s to renew lease
I1226 09:24:26.569684    2930 iptables.go:145] Some iptables rules are missing; deleting and recreating rules
I1226 09:24:26.569705    2930 iptables.go:167] Deleting iptables rule: -s 172.17.0.0/16 -j ACCEPT
I1226 09:24:26.570685    2930 iptables.go:167] Deleting iptables rule: -d 172.17.0.0/16 -j ACCEPT
I1226 09:24:26.571397    2930 iptables.go:155] Adding iptables rule: -s 172.17.0.0/16 -j ACCEPT
I1226 09:24:26.572893    2930 iptables.go:155] Adding iptables rule: -d 172.17.0.0/16 -j ACCEPT
root@ubuntu1:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:50:03:fc brd ff:ff:ff:ff:ff:ff
    inet 192.168.43.161/24 brd 192.168.43.255 scope global dynamic enp0s3
       valid_lft 3270sec preferred_lft 3270sec
    inet6 2409:8900:1d61:1ab8:a00:27ff:fe50:3fc/64 scope global dynamic mngtmpaddr noprefixroute
       valid_lft 3265sec preferred_lft 3265sec
    inet6 fe80::a00:27ff:fe50:3fc/64 scope link
       valid_lft forever preferred_lft forever
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether d2:89:f1:a4:8a:27 brd ff:ff:ff:ff:ff:ff
    inet 172.17.4.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::d089:f1ff:fea4:8a27/64 scope link
       valid_lft forever preferred_lft forever

查看 ip 地址会发现,flannel.1 其实是一个 vtep 设备,并且拥有了 mac 地址 d2:89:f1:a4:8a:27,这点是与 udp 实观的 overlay 不同点之一

3. 启动 docker

这里与前文一致,要根据 flannel 生成的 docker_opts 去启动 docker

root@ubuntu1:~# ./mk-docker-opts.sh -i
root@ubuntu1:~# cat /run/docker_opts.env
DOCKER_OPT_BIP="--bip=172.17.4.1/24"
DOCKER_OPT_IPMASQ="--ip-masq=true"
DOCKER_OPT_MTU="--mtu=1450"

设置 docker 网桥地址是 172.17.4.1/24,另外注意 mtu 被设置成了 1450,为什么呢?因为 vxlan 要额外消耗 50 字节

root@ubuntu1:~# cat /lib/systemd/system/docker.service
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
EnvironmentFile=/run/docker_opts.env
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --exec-opt native.cgroupdriver=systemd $DOCKER_OPT_BIP $DOCKER_OPT_IPM
ASQ $DOCKER_OPT_MTU

然后启动 docker,并查看 ip 地址

root@ubuntu1:~# systemctl start docker
root@ubuntu1:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:50:03:fc brd ff:ff:ff:ff:ff:ff
    inet 192.168.43.161/24 brd 192.168.43.255 scope global dynamic enp0s3
       valid_lft 2887sec preferred_lft 2887sec
    inet6 2409:8900:1d61:1ab8:a00:27ff:fe50:3fc/64 scope global dynamic mngtmpaddr noprefixroute
       valid_lft 3247sec preferred_lft 3247sec
    inet6 fe80::a00:27ff:fe50:3fc/64 scope link
       valid_lft forever preferred_lft forever
3: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether d2:89:f1:a4:8a:27 brd ff:ff:ff:ff:ff:ff
    inet 172.17.4.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::d089:f1ff:fea4:8a27/64 scope link
       valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
    link/ether 02:42:44:fa:4a:1a brd ff:ff:ff:ff:ff:ff
    inet 172.17.4.1/24 brd 172.17.4.255 scope global docker0
       valid_lft forever preferred_lft forever

同样的操作,两台测试机都要执行,然后在两台宿主机上分别启动 docker 容器

root@afbc9a0329ec:/# docker run -it myubuntu /bin/bash
root@afbc9a0329ec:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
5: eth0@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:04:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.17.4.2/24 brd 172.17.4.255 scope global eth0
       valid_lft forever preferred_lft forever

查看容器地址,分别是 172.17.4.2 和 172.17.22.2

4. 查看 etcd 配置

root@ubuntu1:~# etcdctl ls /coreos.com/network/subnets
/coreos.com/network/subnets/172.17.4.0-24
/coreos.com/network/subnets/172.17.22.0-24
root@ubuntu1:~# etcdctl get /coreos.com/network/subnets/172.17.4.0-24
{"PublicIP":"192.168.43.161","BackendType":"vxlan","BackendData":{"VtepMAC":"d2:89:f1:a4:8a:27"}}
root@ubuntu1:~# etcdctl get /coreos.com/network/subnets/172.17.22.0-24
{"PublicIP":"192.168.43.222","BackendType":"vxlan","BackendData":{"VtepMAC":"d6:d8:a7:a2:7f:4a"}}

这里很关键,每台宿主机上的 flannel.1 网卡 mac 地址己经上报到了 etcd

5. 关于 iptables

docker 启动后默认会开 nat,其实没有必要,关掉就可以了

iptables -t nat -F

6. 测试互 ping

root@afbc9a0329ec:/# ping 172.17.22.2
PING 172.17.22.2 (172.17.22.2) 56(84) bytes of data.
64 bytes from 172.17.22.2: icmp_seq=1 ttl=62 time=0.509 ms
64 bytes from 172.17.22.2: icmp_seq=2 ttl=62 time=0.821 ms
64 bytes from 172.17.22.2: icmp_seq=3 ttl=62 time=0.619 ms
64 bytes from 172.17.22.2: icmp_seq=4 ttl=62 time=0.607 ms
--- 172.17.22.2 ping statistics ---
12 packets transmitted, 12 received, 0% packet loss, time 11250ms
rtt min/avg/max/mdev = 0.458/0.666/0.877/0.135 ms

这里只举了一个例子,从 172.17.4.2 ping 172.17.22.2

7. 抓包

分别在两台宿主机上抓 vtep 设备和物理网卡 enp0s3 数据包。

root@ubuntu1:~# tcpdump -n -e -v -i flannel.1
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
09:41:18.777756 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 20925, offset 0, flags [DF], proto ICMP (1), length 84)
    172.17.4.2 > 172.17.22.2: ICMP echo request, id 27, seq 1, length 64
09:41:18.778317 d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 60869, offset 0, flags [none], proto ICMP (1), length 84)
    172.17.22.2 > 172.17.4.2: ICMP echo reply, id 27, seq 1, length 64
09:41:19.784032 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 21032, offset 0, flags [DF], proto ICMP (1), length 84)
    172.17.4.2 > 172.17.22.2: ICMP echo request, id 27, seq 2, length 64
09:41:19.784695 d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 61095, offset 0, flags [none], proto ICMP (1), length 84)
    172.17.22.2 > 172.17.4.2: ICMP echo reply, id 27, seq 2, length 64
09:41:20.808877 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 21260, offset 0, flags [DF], proto ICMP (1), length 84)
root@ubuntu2:~#  tcpdump -n -e -v -i flannel.1
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
09:41:18.811134 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 20925, offset 0, flags [DF], proto ICMP (1), length 84)
    172.17.4.2 > 172.17.22.2: ICMP echo request, id 27, seq 1, length 64
09:41:18.811318 d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 60869, offset 0, flags [none], proto ICMP (1), length 84)
    172.17.22.2 > 172.17.4.2: ICMP echo reply, id 27, seq 1, length 64
09:41:19.817451 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 21032, offset 0, flags [DF], proto ICMP (1), length 84)
    172.17.4.2 > 172.17.22.2: ICMP echo request, id 27, seq 2, length 64
09:41:19.817528 d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 61095, offset 0, flags [none], proto ICMP (1), length 84)
    172.17.22.2 > 172.17.4.2: ICMP echo reply, id 27, seq 2, length 64
09:41:20.842282 d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 21260, offset 0, flags [DF], proto ICMP (1), length 84)

可以看到和普通的网络 icmp 包没什么区别,但是这里有个重点,mac 地址是从 etcd 中拿到的。并且自动加到 fdb 中 (forward database)

root@ubuntu1:~# bridge fdb show flannel.1
33:33:00:00:00:01 dev enp0s3 self permanent
01:00:5e:00:00:01 dev enp0s3 self permanent
33:33:ff:50:03:fc dev enp0s3 self permanent
01:80:c2:00:00:00 dev enp0s3 self permanent
01:80:c2:00:00:03 dev enp0s3 self permanent
01:80:c2:00:00:0e dev enp0s3 self permanent
d6:d8:a7:a2:7f:4a dev flannel.1 dst 192.168.43.222 self permanent
33:33:00:00:00:01 dev docker0 self permanent
01:00:5e:00:00:01 dev docker0 self permanent
33:33:ff:fa:4a:1a dev docker0 self permanent
02:42:44:fa:4a:1a dev docker0 master docker0 permanent
02:42:44:fa:4a:1a dev docker0 vlan 1 master docker0 permanent
b2:c5:d9:1d:db:a4 dev veth4ce4c9b vlan 1 master docker0 permanent
b2:c5:d9:1d:db:a4 dev veth4ce4c9b master docker0 permanent
33:33:00:00:00:01 dev veth4ce4c9b self permanent
01:00:5e:00:00:01 dev veth4ce4c9b self permanent
33:33:ff:1d:db:a4 dev veth4ce4c9b self permanent
root@ubuntu1:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.43.1    0.0.0.0         UG    100    0        0 enp0s3
172.17.4.0      0.0.0.0         255.255.255.0   U     0      0        0 docker0
172.17.22.0     172.17.22.0     255.255.255.0   UG    0      0        0 flannel.1
192.168.43.0    0.0.0.0         255.255.255.0   U     0      0        0 enp0s3
192.168.43.1    0.0.0.0         255.255.255.255 UH    100    0        0 enp0s3

另外也可以看到宿主机路由,多了一条针对 flannel.1 vtep 设备的路由,这就是 flannel 针对 L2 miss 和 L3 miss 的处理,提前设置好。然后再看物理网卡抓包

root@ubuntu1:~# tcpdump -n -e -v -i enp0s3 -T vxlan
09:57:23.194886 08:00:27:50:03:fc > 08:00:27:c5:a1:4f, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 21548, offset 0, flags [none], proto UDP (17), length 134)
    192.168.43.161.43077 > 192.168.43.222.8472: VXLAN, flags [I] (0x08), vni 1
d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 56153, offset 0, flags [DF], proto ICMP (1), length 84)
    172.17.4.2 > 172.17.22.2: ICMP echo request, id 32, seq 1, length 64
09:57:23.195311 08:00:27:c5:a1:4f > 08:00:27:50:03:fc, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 38400, offset 0, flags [none], proto UDP (17), length 134)
    192.168.43.222.57258 > 192.168.43.161.8472: VXLAN, flags [I] (0x08), vni 1
d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 17834, offset 0, flags [none], proto ICMP (1), length 84)
    172.17.22.2 > 172.17.4.2: ICMP echo reply, id 32, seq 1, length 64
root@ubuntu2:~# tcpdump -n -e -v -i enp0s3 -T vxlan
09:57:23.239296 08:00:27:c5:a1:4f > 08:00:27:50:03:fc, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 38400, offset 0, flags [none], proto UDP (17), length 134)
    192.168.43.222.57258 > 192.168.43.161.8472: VXLAN, flags [I] (0x08), vni 1
d6:d8:a7:a2:7f:4a > d2:89:f1:a4:8a:27, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 17834, offset 0, flags [none], proto ICMP (1), length 84)
    172.17.22.2 > 172.17.4.2: ICMP echo reply, id 32, seq 1, length 64
09:57:24.244397 08:00:27:50:03:fc > 08:00:27:c5:a1:4f, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 21710, offset 0, flags [none], proto UDP (17), length 134)
    192.168.43.161.43077 > 192.168.43.222.8472: VXLAN, flags [I] (0x08), vni 1
d2:89:f1:a4:8a:27 > d6:d8:a7:a2:7f:4a, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 56326, offset 0, flags [DF], proto ICMP (1), length 84)
    172.17.4.2 > 172.17.22.2: ICMP echo request, id 32, seq 2, length 64

物理网卡抓包可以看到,当前 vni 值是 1,vxlan 携带的数据包正是 flannel.1 的 Original L2 Frame

小结

网络还是有点复杂了,接下来再看其它网络方案的实现。

上一篇 下一篇

猜你喜欢

热点阅读