Kafka Bug记录--1

2022-06-25 本文已影响0人 Alen_ab56

运维维护的elk kafka集群最近出现了几起CPU “跳崖”，server完全不可用，最终报错file服务，节点服务雪崩

通过监控看到的现象是

通过日志没看出来有什么异常或者原因

只是发现周期性的ISR收缩

怀疑是流量大导致副本未及时同步造成的，不过运维说明流量变化不符合这一猜想

最近一次出现时让运维打印了堆栈信息

发现以上报错

很显然是遇到死锁了

通过kafka官方的ISSUE说明也可以很清晰的看到这个issue造成的后果

drop all connections, including replication to other brokers

和我们的集群表现非常相似

最后分析issue后得出结论是GroupCoordinator和DelayProducer之间发生了死锁

导致死锁的pr是：https://github.com/apache/kafka/pull/3133/commits/f62ed9aecd0129ae2e6e9d68d921f2271a55f76a

解决死锁问题方案是使用重入锁，pr是：

https://github.com/apache/kafka/pull/3956/commits/fd45ac3ec90c653a5cb5c6c3d5ae66e8bdde6296

这个死锁问题的重现条件是
In our case, turns out there's a very chatty consumer calling commitAsync() after each message, about 30/s across about 8 consumer threads. (Found by heap dump of the deadlocked JVM -> inspect the GroupMetadata object.) Been running this way since 0.9.0. Less commits -> less likely but still possible.

有一个非常健谈的消费者在每条消息之后调用commitAsync（），大约30/s，跨越大约8个消费者线程

Kafka Bug记录--1

猜你喜欢

热点阅读