OVN环境虚机热迁移丢包严重
2021-10-12 本文已影响0人
LC0127
问题描述
虚机热迁移时ping包丢包5个以上。
[root@node-4 ~]# ping 172.47.0.21
PING 172.47.0.21 (172.47.0.21) 56(84) bytes of data.
......
64 bytes from 172.47.0.21: icmp_seq=82 ttl=63 time=0.307 ms
64 bytes from 172.47.0.21: icmp_seq=83 ttl=63 time=0.340 ms
64 bytes from 172.47.0.21: icmp_seq=89 ttl=63 time=1.22 ms
64 bytes from 172.47.0.21: icmp_seq=90 ttl=63 time=0.413 ms
......
--- 172.47.0.21 ping statistics ---
95 packets transmitted, 90 received, 5% packet loss, time 96206ms
rtt min/avg/max/mdev = 0.282/0.483/3.467/0.532 ms
热迁移过程:
- 在目标节点创建虚机的tap设备
- 网卡up后拷贝虚机进程的内存
- 迁移完成后源节点删除虚机的tap设备
- 调用ovs-vsctl删除源节点ovsdb数据库中的port数据
- 调用neutron client更新port的binding host信息为目标节点
_post_live_migration
|- post_live_migration_at_source
|- unplug_vifs(去源节点删除vif)
|- post_live_migration_at_destination
|- migrate_instance_finish
|- _update_port_binding_for_instance(调用neutron client执行port update,修改binding:host_id信息)
分析过程:
设置ovn-controller binding模块vlog日志等级为debug,并抓包, 发现丢包时间正好位于源节点release lport后到目标节点claim lport成功这段时间
/* 源节点 */
2021-10-11T08:00:40.691Z|28327|binding|INFO|Releasing lport 84282c8b-0002-47b1-a5c0-a7947cb795ca from this chassis.
/* 目标节点 */
2021-10-11T07:59:35.690Z|31303|binding|INFO|Not claiming lport 84282c8b-0002-47b1-a5c0-a7947cb795ca, chassis 63c4fb22-b817-4c84-9594-a8c554b8de46 requested-chassis node-3.domain.tld
2021-10-11T08:00:40.711Z|31304|binding|INFO|Not claiming lport 84282c8b-0002-47b1-a5c0-a7947cb795ca, chassis 63c4fb22-b817-4c84-9594-a8c554b8de46 requested-chassis node-3.domain.tld
2021-10-11T08:00:49.621Z|31305|binding|INFO|Claiming lport 84282c8b-0002-47b1-a5c0-a7947cb795ca for this chassis.
2021-10-11T08:00:49.621Z|31306|binding|INFO|84282c8b-0002-47b1-a5c0-a7947cb795ca: Claiming fa:16:3e:f8:c5:5d 192.168.222.112
count_pkt_lose.png
ovn-controller release和claim lport部分代码:
bool
binding_handle_ovs_interface_changes(struct binding_ctx_in *b_ctx_in,
struct binding_ctx_out *b_ctx_out)
{
...
const char *iface_id = smap_get(&iface_rec->external_ids, "iface-id");
const char *old_iface_id = smap_get(b_ctx_out->local_iface_ids,
iface_rec->name);
const char *cleared_iface_id = NULL;
if (!ovsrec_interface_is_deleted(iface_rec)) {
int64_t ofport = iface_rec->n_ofport ? *iface_rec->ofport : 0;
if (iface_id) {
/* Check if iface_id is changed. If so we need to
* release the old port binding and associate this
* inteface to new port binding. */
if (old_iface_id && strcmp(iface_id, old_iface_id)) {
cleared_iface_id = old_iface_id;
} else if (ofport <= 0) {
/* If ofport is <= 0, we need to release the iface if
* already claimed. */
cleared_iface_id = iface_id;
}
} else if (old_iface_id) {
cleared_iface_id = old_iface_id;
}
} else {
cleared_iface_id = iface_id;
}
if (cleared_iface_id) {
handled = consider_iface_release(iface_rec, cleared_iface_id,
b_ctx_in, b_ctx_out);
}
gdb调试controller代码时发现ofport为-1,根据代码ofport ≤ 0时就会release lport
debug
claim lport时能否claim判断:
static bool
can_bind_on_this_chassis(const struct sbrec_chassis *chassis_rec,
const char *requested_chassis)
{
return !requested_chassis || !requested_chassis[0]
|| !strcmp(requested_chassis, chassis_rec->name)
|| !strcmp(requested_chassis, chassis_rec->hostname);
}
尝试模拟Interface ofport为-1:
- 创建tap设备
- 将tap设备挂给br-int
- iprouter2命令将tap设备删除
此时查看interface的ofport字段为-1
[root@node-1 ~]# ovs-vsctl list Interface | grep --color -C 10 lc-tap
...
error : "could not open network device lc-tap (No such device)"
external_ids : {}
...
name : lc-tap
ofport : -1
...
设备删除时间点确定
在nova执行unplug前加日志,并执行热迁移,对比vswitchd和nova日志,发现vswitchd在nova做unplug前已经将interface删除
nova日志:
2021-10-27 16:07:01.368 28641 INFO nova.virt.libvirt.driver [req-8b6d0c19-8088-4b91-94d6-bc8e037ac010 cf354206167f49599583663544832c9b d988d53fd2a94686b0c56fc8576e727b - - -] Do unplug vif from post_live_migration_at_source
2021-10-27 16:07:01.369 28641 INFO nova.virt.libvirt.driver [req-8b6d0c19-8088-4b91-94d6-bc8e037ac010 cf354206167f49599583663544832c9b d988d53fd2a94686b0c56fc8576e727b - - -] Do unplug vif from unplug_vifs
2021-10-27 16:07:01.373 28641 INFO os_vif [req-8b6d0c19-8088-4b91-94d6-bc8e037ac010 cf354206167f49599583663544832c9b d988d53fd2a94686b0c56fc8576e727b - - -] Successfully unplugged vif VIFOpenVSwitch(active=False,address=fa:16:3e:b3:f9:b4,bridge_name='br-int',has_traffic_filtering=True,id=a88326d4-bca7-444c-8476-abcaddec9f12,network=Network(2c4dfab0-7362-4ad8-9a92-27cec0fe6c05),plugin='ovs',port_profile=VIFPortProfileBase,preserve_on_delete=False,vif_name='tapa88326d4-bc')
vswitchd日志:
258:2021-10-27T08:06:43.183Z|08976|bridge|INFO|bridge br-int: deleted interface tapa88326d4-bc on port 1493
259:2021-10-27T08:06:43.188Z|08977|bridge|WARN|could not open network device tapa88326d4-bc (No such device)
最后想计算team确定在执行unplug vif前qemu会删除源节点的tap设备
热迁移过程及丢包时序图