MySQL-Innodb-Hazard Pointer的设计与实

2020-10-08  本文已影响0人  多血

前提

----fil_aio_wait
----buf_page_io_complete(static_cast<buf_page_t*>(message));
----buf_page_io_complete
--------buf_flush_write_complete
-------------buf_flush_remove
-----------------UT_LIST_REMOVE(buf_pool->flush_list, bpage);

以前的代码实现(版本5.7.1):

刷脏线程刷完一个数据页,就需要回到Flush List末尾,重新扫描新的可刷盘的脏页。

buf_do_flush_list_batch{
    do {
        /* Start from the end of the list looking for a suitable
        block to be flushed. */

        buf_flush_list_mutex_enter(buf_pool);

        /* We use len here because theoretically insertions can
        happen in the flush_list below while we are traversing
        it for a suitable candidate for flushing. We'd like to
        set a limit on how farther we are willing to traverse
        the list. */
        len = UT_LIST_GET_LEN(buf_pool->flush_list);
        bpage = UT_LIST_GET_LAST(buf_pool->flush_list);

        if (bpage) {
            ut_a(bpage->oldest_modification > 0);
        }

        if (!bpage || bpage->oldest_modification >= lsn_limit) {

            /* We have flushed enough */
            buf_flush_list_mutex_exit(buf_pool);
            break;
        }

        ut_a(bpage->oldest_modification > 0);

        ut_ad(bpage->in_flush_list);

        buf_flush_list_mutex_exit(buf_pool);

        /* The list may change during the flushing and we cannot
        safely preserve within this function a pointer to a
        block in the list! */
        while (bpage != NULL
               && len > 0
               && !buf_flush_page_and_try_neighbors(
                bpage, BUF_FLUSH_LIST, min_n, &count)) {

            ++scanned;
            buf_flush_list_mutex_enter(buf_pool);

            /* If we are here that means that buf_pool->mutex
             was not released in buf_flush_page_and_try_neighbors()
            above and this guarantees that bpage didn't get
            relocated since we released the flush_list
            mutex above. There is a chance, however, that
            the bpage got removed from flush_list (not
            currently possible because flush_list_remove()
            also obtains buf_pool mutex but that may change
            in future). To avoid this scenario we check
            the oldest_modification and if it is zero
            we start all over again. */
            if (bpage->oldest_modification == 0) {
                buf_flush_list_mutex_exit(buf_pool);
                break;
            }

            bpage = UT_LIST_GET_PREV(list, bpage);

            ut_ad(!bpage || bpage->in_flush_list);

            buf_flush_list_mutex_exit(buf_pool);

            --len;
        }

    } while (count < min_n && bpage != NULL && len > 0);
}

为什么这样设计?

因为刷脏线程刷的这个脏页之前的脏页可能被其他线程给刷走了,之前的脏页可能已经不在Flush list中。
造成这个现象的原因:

带来了什么后果

我们的某一个刷脏线程拿到队尾最后一个数据页,IO fixed,发送给IO线程,最后再从队尾扫描寻找可刷盘的脏页。在这次扫描中,它发现最后一个数据页(也就是刚刚发送到IO线程中的数据页)状态为IO fixed(磁盘很慢,还没处理完)所以不能刷,跳过,开始刷倒数第二个数据页,同样IO fixed,发送给IO线程,然后再次重新扫描Flush List。它又发现尾部的两个数据页都不能刷新(因为磁盘很慢,可能还没刷完),直到扫描到倒数第三个数据页。所以,存在一种极端的情况,如果磁盘比较缓慢,刷脏算法性能会从O(N)退化成O(N*N)。
https://dev.mysql.com/worklog/task/?id=7047

Hazard Pointer版本的代码实现(5.7.28)

static
ulint
buf_do_flush_list_batch(
    buf_pool_t*     buf_pool,
    ulint           min_n,
    lsn_t           lsn_limit)
{
    ulint       count = 0;
    ulint       scanned = 0;

    ut_ad(buf_pool_mutex_own(buf_pool));

    /* Start from the end of the list looking for a suitable
    block to be flushed. */
    buf_flush_list_mutex_enter(buf_pool);
    ulint len = UT_LIST_GET_LEN(buf_pool->flush_list);

    /* In order not to degenerate this scan to O(n*n) we attempt
    to preserve pointer of previous block in the flush list. To do
    so we declare it a hazard pointer. Any thread working on the
    flush list must check the hazard pointer and if it is removing
    the same block then it must reset it. */
    for (buf_page_t* bpage = UT_LIST_GET_LAST(buf_pool->flush_list);
         count < min_n && bpage != NULL && len > 0
         && bpage->oldest_modification < lsn_limit;
         bpage = buf_pool->flush_hp.get(),
         ++scanned) {

        buf_page_t* prev;

        ut_a(bpage->oldest_modification > 0);
        ut_ad(bpage->in_flush_list);

        prev = UT_LIST_GET_PREV(list, bpage);
        buf_pool->flush_hp.set(prev);
        buf_flush_list_mutex_exit(buf_pool);

#ifdef UNIV_DEBUG
        bool flushed =
#endif /* UNIV_DEBUG */
        buf_flush_page_and_try_neighbors(
            bpage, BUF_FLUSH_LIST, min_n, &count);

        buf_flush_list_mutex_enter(buf_pool);

        ut_ad(flushed || buf_pool->flush_hp.is_hp(prev));

        --len;
    }

    buf_pool->flush_hp.set(NULL);
    buf_flush_list_mutex_exit(buf_pool);

    if (scanned) {
        MONITOR_INC_VALUE_CUMULATIVE(
            MONITOR_FLUSH_BATCH_SCANNED,
            MONITOR_FLUSH_BATCH_SCANNED_NUM_CALL,
            MONITOR_FLUSH_BATCH_SCANNED_PER_CALL,
            scanned);
    }

    if (count) {
        MONITOR_INC_VALUE_CUMULATIVE(
            MONITOR_FLUSH_BATCH_TOTAL_PAGE,
            MONITOR_FLUSH_BATCH_COUNT,
            MONITOR_FLUSH_BATCH_PAGES,
            count);
    }

    ut_ad(buf_pool_mutex_own(buf_pool));

    return(count);
}

原理

要解决这个问题,最本质的方法就是当刷完一个脏页的时候不要每次都从队尾重新扫描。我们可以使用Hazard Pointer来解决,方法如下:遍历找到一个可刷盘的数据页,在锁释放之前,调整Hazard Pointer使之指向Flush List中下一个节点,注意一定要在持有锁的情况下修改。然后释放锁,进行刷盘,刷完盘后,重新获取锁,读取Hazard Pointer并设置下一个节点,然后释放锁,进行刷盘,如此重复。当这个线程在刷盘的时候,另外一个线程需要刷盘,也是通过Hazard Pointer来获取可靠的节点,并重置下一个有效的节点。通过这种机制,保证每次读到的Hazard Pointer是一个有效的Flush List节点,即使磁盘再慢,刷脏算法效率依然是O(N)。 这个解法同样可以用到LRU List驱逐算法上,提高驱逐的效率。
http://mysql.taobao.org/monthly/2017/05/01/

上一篇下一篇

猜你喜欢

热点阅读