追踪Redis Sentinel的CPU占有率长期接近100%的

2017-07-01 本文已影响292人宅楠军

问题描述

近日，在生产环境上，一个使用了Jedis 操作Redis数据库的客户端应用，出现了如下异常：

15:43:55.102 ERROR [RpcWorkThread_7] cache.MstpMysqlCacheClient getInstance 168 - Cache 异常 
redis.clients.jedis.exceptions.JedisConnectionException: All sentinels down, cannot determine where is X.X.X.X-6379 master is running...
 at redis.clients.jedis.JedisSentinelPool.initSentinels(JedisSentinelPool.java:153) ~[jedis-2.6.2.jar:na]
 at redis.clients.jedis.JedisSentinelPool.<init>(JedisSentinelPool.java:69) ~[jedis-2.6.2.jar:na]
 at redis.clients.jedis.JedisSentinelPool.<init>(JedisSentinelPool.java:52) ~[jedis-2.6.2.jar:na]

省略

从异常可以明确，这是一个Jedis连接异常。All sentinels down? 马上去检查了所有Sentinel服务器，发现所有的Sentinel服务进程都在运行，没有Down掉。但是却出现了下面的问题：

image.png

可以发现每台Sentinel的Cpu占有率一直都是接近于100%（注意是接近，经常在99%上下），然后使用Redis-cli连接到Sentinel上进行状态查看，但是输入任何命令都会超时：

image.png

现象就这么简单，叙述完了，问题很诡异，现在开始进行问题的追踪和分析。

问题追踪

生产环境Redis使用的版本是2.8.21，Jedis是2.6.2。为了方便定位，我们需要浏览他们的源码。

寻找问题发生所在地

首先我们要了解目前生产环境的Jedis连接Redis集群的机制。目前的集群架构设计为：

image.png

如上图所示，客户端使用JedisClient（即JedisSentinelPool）连接到各个Sentinel，通过Sentinel操作相应的Redis主服务器。

首先，我们定位到异常所在函数：initSentinels。这段函数比较长，但很关键，能让我们清楚问题到底出现在Jedis和服务器的哪一端。

private HostAndPort initSentinels(Set<String> sentinels, final String masterName) {

    HostAndPort master = null;
    boolean sentinelAvailable = false;

    log.info("Trying to find master from available Sentinels...");

    for (String sentinel : sentinels) {
      final HostAndPort hap = toHostAndPort(Arrays.asList(sentinel.split(":")));

      log.fine("Connecting to Sentinel " + hap);

      Jedis jedis = null;
      try {
        jedis = new Jedis(hap.getHost(), hap.getPort());

        List<String> masterAddr = jedis.sentinelGetMasterAddrByName(masterName);

        // connected to sentinel...
        sentinelAvailable = true;

        if (masterAddr == null || masterAddr.size() != 2) {
          log.warning("Can not get master addr, master name: " + masterName + ". Sentinel: " + hap
              + ".");
          continue;
        }

        master = toHostAndPort(masterAddr);
        log.fine("Found Redis master at " + master);
        break;
      } catch (JedisException e) {
        // resolves #1036, it should handle JedisException there's another chance
        // of raising JedisDataException
        log.warning("Cannot get master address from sentinel running @ " + hap + ". Reason: " + e
            + ". Trying next one.");
      } finally {
        if (jedis != null) {
          jedis.close();
        }
      }
    }

    if (master == null) {
      if (sentinelAvailable) {
        // can connect to sentinel, but master name seems to not
        // monitored
        throw new JedisException("Can connect to sentinel, but " + masterName
            + " seems to be not monitored...");
      } else {
        throw new JedisConnectionException("All sentinels down, cannot determine where is "
            + masterName + " master is running...");
      }
    }

    log.info("Redis master running at " + master + ", starting Sentinel listeners...");

    for (String sentinel : sentinels) {
      final HostAndPort hap = toHostAndPort(Arrays.asList(sentinel.split(":")));
      MasterListener masterListener = new MasterListener(masterName, hap.getHost(), hap.getPort());
      // whether MasterListener threads are alive or not, process can be stopped
      masterListener.setDaemon(true);
      masterListeners.add(masterListener);
      masterListener.start();
    }

我们先了解initSentinels的作用：

首先，根据客户端提供的Sentinel集合（存储了所有Sentinel服务器的地址）进行遍历，每次遍历中：
- 构造一个Jedis实例与当前Sentinel建立连接。
- 通过调用sentinelGetMasterAddrByName向Sentinel取得Master的地址和端口。
以上遍历结束后，如果成功的获取所有Master的地址和端口话，则继续启动多个（与Sentinel服务器个数对应）MasterListener线程。
这些线程的作用是为了不断的订阅Sentinel的消息，以监听Master的主从切换。

为什么要监听主从切换呢，这涉及到了另一个函数initPool，篇幅限制，我们只要了解它的作用：
如果发生了主从切换，则重新初始化连接池，以改变连接池里的Jedis的句柄关联对象，以始终关联Master而不是Slave。所以，监听主从切换的目的是确保连接池给用户的Jedis连接句柄绝对指向Master，因为我们的集群只有Master接受读写。

现在，我们开始追踪问题。可以清楚的看到，问题出现在第50行“All sentinels down...”。程序运行到这里，说明前面的sentinel集合遍历结束了。并且，此时“master == null、sentinelAvailable==false”。而“sentinelAvailable==false”说明了程序还未执行到“sentinelAvailable = true;”就发生了异常，继而说明问题出现在了前面：

jedis = new Jedis(hap.getHost(), hap.getPort());
List<String> masterAddr = jedis.sentinelGetMasterAddrByName(masterName);

结合之前连接Sentinel出现的超时现象，猜测或许是因为无法与Sentinel建立正常的连接导致这里抛出异常。至此，可以明确这个问题与本地客户端没有多大关系，所以我们可以大胆的将注意力投入到Sentinel服服务器之上。

寻找问题函数

通过第一步，我们确定了问题出现在Sentinel服务器之上，继续追踪和分析。

根据我的经验，一个进程的CPU占有率一直接近或达到100%，大部分情况下都可以说明该处于一个无限或类无限循环中，并且循环间隔没有进行Sleep的调用，使得进程持续占有CPU时间片。这个猜测对吗？我也不确定。

对于主服务是单进程的Sentinel服务，如果出现死循环，对于外界的响应必定是超时的。所以，我们目前首先要定位这个假设的循环到底在哪里。

此时，需要祭出两个大杀器：GDB以及Strace，这两个工具我就不做过多介绍了。

首先，在使用GDB前，我们需要得到Sentinel进程ID，运行top：

image.png

然后，根据其进程ID得到其下所有线程的运行状况，运行top -H -p 5515：

image.png

可以发现，Sentinel总共有三个线程，5515线程CPU占用率异常，我们记住这个线程ID：5515。然后，使用GDB调试正在运行的Sentinel进程，执行 gdb icdn 5515：

image.png

然后，我们输入info thread以列出当前进程的所有线程信息：

image.png

上面，LWP即表示线程ID，所以，我们定位到了1号，继续执行thread 1表示切换到该进程：

image.png

接着输入bt，查看当前的函数栈帧信息：

image.png

观察栈帧信息，这一行很引人注目：

#1  0x00000000004194d1 in anetGenericAccept (err=0x6f9438 "accept: Too many open files", s=5, sa=0x7fff4851ab00, len=0x7fff4851ab8c) at anet.c:487

也就是说anetGenericAccept函数的一个err参数的值为“accept:Too many open files”。

另外，如果你多次运行gdb icdn 5515，有时候也会得到不同与上面的如下信息（每次attach不一样说明程序调用的不是同一个函数栈）：

image.png

可以看到，redisLograw的参数也包括了“Too many open files”这个错误。

综上，可以发现两类栈帧都出现了“Too many open files”错误，难道这些都是因为打开的文件数太多造成的？不急，我们先记住这个错误。

由前面可知，当前Sentinel进程的函数调用链是“main->aeMain->aeProcessEvents->acceptTcpHandler->anetTcpAccept->anetGenericAcept->accept”。马上打开Redis的源码，最终定位到这里：

static int anetGenericAccept(char *err, int s, struct sockaddr *sa, socklen_t *len) {
    int fd;
    while(1) {
        fd = accept(s,sa,len);
        if (fd == -1) {
            if (errno == EINTR)
                continue;
            else {
                anetSetError(err, "accept: %s", strerror(errno));
                return ANET_ERR;
            }
        }
        break;
    }
    return fd;
}

看到一个代码里的while(1)让我很是激动，所以我猜测是不是这里出现了问题，但只是猜测。此时，还需要使用另外一个杀器——strace，我们利用它对Sentinel使用的系统调用进行跟踪。执行strace -T -r -c -p 5515（需要等待一定时间后手动结束strace）：

image.png

由上图可以清楚的看到，在跟踪的这段时间内，accept、epoll_wait、open三个系统调用就占用了总计98%的时间，这当然是不正常的。另外，accetp函数在短短的时间内调用了2434680次，失败调用占据了100%，这说明了之前anetGenericAccept中，accept 返回值每次都是-1。

最开始猜测的是Sentinel在anetGenericAccept中陷入while(1)不能自拔，但发现我想错了，因为此时errno != EINTR，而应该是“Too many open files”（错误号我没查），整个函数返回的是ANET_ERR（-1）。

为了再次证实这个结论，决定使用strace打印当前Sentinel正在执行的函数，运行strace -t -p 5515：

image.png

果然，可以看到基本上都是这三个函数的执行，说明Sentinel正在被这三个函数所纠缠。但是这个如何与之前GDB调试的结果对应呢？很清楚，这里accept即为之前anetGenericAccept的accept调用。

而open函数，是否对应了redisLograw函数？查看了redisLograw的源码：

void redisLogRaw(int level, const char *msg) {
    const int syslogLevelMap[] = { LOG_DEBUG, LOG_INFO, LOG_NOTICE, LOG_WARNING };
    const char *c = ".-*#";
    FILE *fp;
    char buf[64];
    int rawmode = (level & REDIS_LOG_RAW);
    int log_to_stdout = server.logfile[0] == '\0';

    level &= 0xff; /* clear flags */
    if (level < server.verbosity) return;

    fp = log_to_stdout ? stdout : fopen(server.logfile,"a");
    if (!fp) return;

    if (rawmode) {
        fprintf(fp,"%s",msg);
    } else {
        int off;
        struct timeval tv;

        gettimeofday(&tv,NULL);
        off = strftime(buf,sizeof(buf),"%d %b %H:%M:%S.",localtime(&tv.tv_sec));
        snprintf(buf+off,sizeof(buf)-off,"%03d",(int)tv.tv_usec/1000);
        fprintf(fp,"[%d] %s %c %s\n",(int)getpid(),buf,c[level],msg);
    }
    fflush(fp);

    if (!log_to_stdout) fclose(fp);
    if (server.syslog_enabled) syslog(syslogLevelMap[level], "%s", msg);
}

原来，关键就在于fp = log_to_stdout ? stdout : fopen(server.logfile,”a”);这一句,下面请听解释：

fopen底层实际调用的是open系统调用：int open(const char *pathname, int flags, mode_t mode);。
通过查看fopen源码，模式a表示的即为O_WRONLY|O_CREAT|O_APPEND，默认权限为0666。
所以，最终体现为：open(“/usr/local/SINO/redis/log/sentinel.log”,O_WRONLY|O_CREAT|O_APPEND, 0666)

这样，我们可以得出这个结论：accept和fopen的所在函数anetGenericAccept与redisLogRaw的调用，导致CPU居高不下。

继续往下定位

前面GDB的调试的两类函数栈帧结果，发现调用栈中都存在acceptTcpHandler这个函数：

image.png

acceptTcpHandler的源码非常关键：

void acceptTcpHandler(aeEventLoop *el, int fd, void *privdata, int mask) {
    int cport, cfd, max = MAX_ACCEPTS_PER_CALL;
    char cip[REDIS_IP_STR_LEN];
    REDIS_NOTUSED(el);
    REDIS_NOTUSED(mask);
    REDIS_NOTUSED(privdata);

    while(max--) {
        // accept 客户端连接
        cfd = anetTcpAccept(server.neterr, fd, cip, sizeof(cip), &cport);
        if (cfd == ANET_ERR) {
            if (errno != EWOULDBLOCK)
                redisLog(REDIS_WARNING,
                    "Accepting client connection: %s", server.neterr);
            return;
        }
        redisLog(REDIS_VERBOSE,"Accepted %s:%d", cip, cport);
        // 为客户端创建客户端状态（redisClient）
        acceptCommonHandler(cfd,0);

根据redislogRaw记录的错误“Accepting client connection…”，可以发现，问题出现在这句：redisLog(REDIS_WARNING,”Accepting client connection: %s”, server.neterr);

原来，anetTcpAccept中的accept调用返回了ANET_ERR，并且errno 对应错误内容为“Too many open files”，所以acceptTcpHandler会马上返回。基本上可以确定，CPU占用率居高不下的原因是对acceptTcpHandler不停的进行调用，并且调用时间很短。

看到acceptTcpHandler这个函数，如果了解Redis事件处理机制的人应该会比较熟悉。对滴!这就是Redis事件处理机制中文件事件的处理函数之一，大概提一下：
Redis的事件包括时间和文件两种事件，其中文件事件使用IO多路复用监听多个套接字。

当套接字变得可读时，如客户端执行write、close或connect操作时，套接字产生AS_READBALE事件
当套接字变得可写时，如客户端执行read操作时，套接字产生AE_WRITABLE事件。

而这里的acceptTcpHandler，是专门针对connect操作的，是Redis的连接应答处理函数，用来对连接服务器监听套接字的客户端进行应答。

在initServer函数中可以看到，专门为连接事件注册了acceptTcpHandler事件处理函数(传入函数指针)。

void initServer() {
省略
  /* Create an event handler for accepting new connections in TCP and Unix
     * domain sockets. */
    for (j = 0; j < server.ipfd_count; j++) {
        if (aeCreateFileEvent(server.el, server.ipfd[j], AE_READABLE,
            acceptTcpHandler,NULL) == AE_ERR)
            {
                redisPanic(
                    "Unrecoverable error creating server.ipfd file event.");
            }
    }

省略
}

谁在不停的调用acceptTcpHandler

现在，我们定位到了CPU占有率持续增长的原因，是由于“Too many open files”导致acceptTcpHandler持续调用而产生的。归纳为两个问题：

为什么会产生too many open files
acceptTcpHandler为什么会持续调用

对于第一个问题，我们最后再做解释，我们首先分析第二个。

在前面的GDB调试的结果的图中，我们都发现了main->aeMain->aeProcessEvents这样一个调用链，查看源码，发现在aeMain中，这样调用aeProcessEvents：

void aeMain(aeEventLoop *eventLoop) {
    eventLoop->stop = 0;
    while (!eventLoop->stop) {
        if (eventLoop->beforesleep != NULL)
            eventLoop->beforesleep(eventLoop);
        aeProcessEvents(eventLoop, AE_ALL_EVENTS);
    }
}

aeMain是Redis事件处理的主循环，这里有个while()循环，我觉得似乎发现了什么。原来，这一句表示，只要eventLoop->stop == 0，循环会一直继续，aeProcessEvents会一直被调用。其中，eventLoop你可以理解为保存了所有事件的一个结构体，里面的每个事件都会绑定一个相应的事件处理函数，如对连接事件，会绑定acceptTcpHandler这个事件，具体可以查看源码。

进入aeProcessEvents中，我们可以发现其首先调用了aeApiPoll得到就绪事件的个数numevents，然后使用一个for循环去循环处理事件，直到numevents个数为0。

aeProcessEvents中主要通过fe->rfileProc(eventLoop,fd,fe->clientData,mask);来调用acceptTcpHandler以处理事件。为什么是rfileProc，回顾之前的Redis事件定义，因为连接事件是读事件。

aeProcessEvents源码如下：

int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
省略
        numevents = aeApiPoll(eventLoop, tvp);
        for (j = 0; j < numevents; j++) {
            aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd];

            int mask = eventLoop->fired[j].mask;
            int fd = eventLoop->fired[j].fd;
            int rfired = 0;

           /* note the fe->mask & mask & ... code: maybe an already processed
             * event removed an element that fired and we still didn't
             * processed, so we check if the event is still valid. */
            if (fe->mask & mask & AE_READABLE) {
                // rfired 
                rfired = 1;
                fe->rfileProc(eventLoop,fd,fe->clientData,mask);
            }
            if (fe->mask & mask & AE_WRITABLE) {
                if (!rfired || fe->wfileProc != fe->rfileProc)
                    fe->wfileProc(eventLoop,fd,fe->clientData,mask);
            }

            processed++;
        }

省略
}

所以，结合以上代码，我们终于可以得出结论：

由于aeMain中的while循环一直进行，导致aeProcessEvents会不断执行。
aeProcessEvents先调用aeApiPoll，aeApiPoll会返回就绪事件的个数numevents 。
aeProcessEvents在有限循环（numevents ）中调用fe->rfileProc(eventLoop,fd,fe->clientData,mask);，也就是调用了acceptTcpHandler去处理事件。
而acceptTcpHandler又因为出现了“Too many open files”问题快速返回，导致整个aeProcessEvents快速返回。
如此周而复始。

当然，有人会问，为什么CPU占有率不一直是100%，上面这个调用链即给出了答案，因为还有一丁点的时间片分给了这个调用链里的其他调用，而不只是accept、epoll_wait、open这三个系统调用。

至此，我们只剩下一个问题，为什么会出现“Too many open files”这个问题，造成前面acceptTcpHandler无法正确处理连接事件，这也是本次问题追踪的最终目标。下一篇文章将针对这个问题继续进行探讨，敬请期待。

第一次写真正意义上的技术博文，希望能得到大家的认可，如有错误，请不吝指出，非常感谢。我会继续努力，O(∩_∩)O哈哈~。

补充

aeApiPoll

aeApiPoll会调用epoll_wait，这也对应了前面strace展示的三个函数之一，但epoll_wait调用正常，如图：

image.png

aeApiPoll会获取可执行事件（其实是保存在了eventLoop->fired的fired数组中),最后返回就绪事件个数。源码如下：


static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
    aeApiState *state = eventLoop->apidata;
    int retval, numevents = 0;

    retval = epoll_wait(state->epfd,state->events,eventLoop->setsize,
            tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1);

    if (retval > 0) {
        int j;

        numevents = retval;
        for (j = 0; j < numevents; j++) {
            int mask = 0;
            struct epoll_event *e = state->events+j;

            if (e->events & EPOLLIN) mask |= AE_READABLE;
            if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
            if (e->events & EPOLLERR) mask |= AE_WRITABLE;
            if (e->events & EPOLLHUP) mask |= AE_WRITABLE;

            eventLoop->fired[j].fd = e->data.fd;
            eventLoop->fired[j].mask = mask;
        }
    }

    return numevents;
}

aeEventLoop

可以保存所有的事件及其对应的事件处理器，这是一个很重要的结构体。
aeApiPoll将事件保存到aeEventLoop 里面，然后aeProcessEvents调用aeEventLoop 的事件处理函数去处理事件。当然，事件处理函数一开始已经注册给aeEventLoop 了。

/* State of an event based program 
 */
typedef struct aeEventLoop {
    int maxfd;   /* highest file descriptor currently registered */
    int setsize; /* max number of file descriptors tracked */
    long long timeEventNextId;
    time_t lastTime;     /* Used to detect system clock skew */
    aeFileEvent *events; /* Registered events */
    aeFiredEvent *fired; /* Fired events */
    aeTimeEvent *timeEventHead;
    int stop;
    void *apidata; /* This is used for polling API specific data */
    aeBeforeSleepProc *beforesleep;
} aeEventLoop;

aeFileEvent 即表示了一个事件，其包括事件数据clientData及对应的处理函数rfileProc或wfileProc。

/* File event structure
 */
typedef struct aeFileEvent {
    int mask; /* one of AE_(READABLE|WRITABLE) */
    aeFileProc *rfileProc;
    aeFileProc *wfileProc;
    void *clientData;
} a

欢迎关注本人微信公众号：

爱你之心.jpg

追踪Redis Sentinel的CPU占有率长期接近100%的

问题描述

问题追踪

寻找问题发生所在地

寻找问题函数

继续往下定位

谁在不停的调用acceptTcpHandler

补充

aeApiPoll

aeEventLoop

猜你喜欢

热点阅读