kafka部分分区leader none的处理办法
今天生产环境出现警告【1 partitions have leader brokers without a matching listener lncluding [xxx-topic-0]】,java从kafka读写消息部分正常部分不正常。推测部分broker出现问题。
通过zookeeper去查看topic,好像没问题,实际上被误导了。

从程序和其他方面排查了之后,还是觉得是kafka有问题,于是绕过zk逐台broker排查,发现有两个broker读取topic出现leader none。

1、3号broker不正常,排查kafka server.log,1、2号机没发现错误,3号机日志有问题。
[2021-07-05 14:37:54,516] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition xxxx-0 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-07-05 14:37:54,516] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition xxxx-0 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-07-05 14:37:55,518] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition xxxx-1 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-07-05 14:37:55,518] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition xxxx-0 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
[2021-07-05 14:37:55,518] INFO [ReplicaFetcher replicaId=3, leaderId=1, fetcherId=0] Retrying leaderEpoch request for partition xxxx-0 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
百度一下UNKNOWN_LEADER_EPOCH,没看到什么有用的信息,看了官方的API说明
The request contained a leader epoch which is larger than that on the broker that received the request. This can happen if the client observes a metadata update before it has been propagated to all brokers. Clients need not refresh metadata before retrying.
大概就是主从之间同步信标不正常了,由于目前leader是在2号机,而且一切正常,于是决定重启3号机的kafka。
sudo systemctl restart kafka
重启之后观察3号机kafka server.log,一切正常,不再报错。
再看kafka-topic.sh的输出,全部都是leader: 2,调整消费者游标之后,生产系统mq恢复正常。