arp_cache: neighbor table overfl

2023-11-29  本文已影响0人  wwq2020

背景

线上k8s集群的cni为flannel,apiserver时不时发生重启

排查

查看apiserver重启前日志

kubectl logs -f -n kube-system {apiserver-pod-name}

发现关闭前大量timeout

查看syslog
发现大量neighbour: arp_cache: neighbor table overflow!

https://man7.org/linux/man-pages/man7/arp.7.html中两个重要的信息

       Entries which are marked as permanent are never deleted by the
       garbage-collector

       gc_thresh1 (since Linux 2.2)
              The minimum number of entries to keep in the ARP cache.
              The garbage collector will not run if there are fewer than
              this number of entries in the cache.  Defaults to 128.

       gc_thresh2 (since Linux 2.2)
              The soft maximum number of entries to keep in the ARP
              cache.  The garbage collector will allow the number of
              entries to exceed this for 5 seconds before collection
              will be performed.  Defaults to 512.

       gc_thresh3 (since Linux 2.2)
              The hard maximum number of entries to keep in the ARP
              cache.  The garbage collector will always run if there are
              more than this number of entries in the cache.  Defaults
              to 1024

当arp cache存活超过5s的条目大于gc_thresh2会触发gc
当arp cache条目大于gc_thresh3时会触发gc,gc后仍然大于gc_thresh3则报错arp_cache: neighbor table overflow!
但是由于flannel这个arp配置是flannel自动配置成permanent,所以无法被gc,也就是一旦大于了gc_thresh3就无法添加新的arp cache条目

解决

提高arp_cache的gc阈值,修改/etc/sysctl.conf添加,例如

net.ipv4.neigh.default.gc_thresh1 = 8192
net.ipv4.neigh.default.gc_thresh2 = 16384
net.ipv4.neigh.default.gc_thresh3 = 32768

然后加载

sysctl -p
上一篇下一篇

猜你喜欢

热点阅读