Kafka Bug记录--1
运维维护的elk kafka集群最近出现了几起CPU “跳崖”,server完全不可用,最终报错file服务,节点服务雪崩
通过监控看到的现象是
通过日志没看出来有什么异常或者原因
只是发现周期性的ISR收缩


最近一次出现时让运维打印了堆栈信息

很显然是遇到死锁了
通过kafka官方的ISSUE说明也可以很清晰的看到这个issue造成的后果
drop all connections, including replication to other brokers
和我们的集群表现非常相似
最后分析issue后得出结论是GroupCoordinator和DelayProducer之间发生了死锁
导致死锁的pr是:https://github.com/apache/kafka/pull/3133/commits/f62ed9aecd0129ae2e6e9d68d921f2271a55f76a
解决死锁问题方案是使用重入锁,pr是:
https://github.com/apache/kafka/pull/3956/commits/fd45ac3ec90c653a5cb5c6c3d5ae66e8bdde6296
这个死锁问题的重现条件是
In our case, turns out there's a very chatty consumer calling commitAsync() after each message, about 30/s across about 8 consumer threads. (Found by heap dump of the deadlocked JVM -> inspect the GroupMetadata object.) Been running this way since 0.9.0. Less commits -> less likely but still possible.
有一个非常健谈的消费者在每条消息之后调用commitAsync(),大约30/s,跨越大约8个消费者线程