MySQL-Innodb-Hazard Pointer的设计与实
前提
- flush_list是包含dirty page并按修改时间有序的链表,在刷脏时选择从链表的尾部进行遍历淘汰
- 异步IO,当异步写page完成后,io helper线程会调buf_flush_write_complete,将写入的Page从flush list上移除。
调用栈
----fil_aio_wait
----buf_page_io_complete(static_cast<buf_page_t*>(message));
----buf_page_io_complete
--------buf_flush_write_complete
-------------buf_flush_remove
-----------------UT_LIST_REMOVE(buf_pool->flush_list, bpage);
以前的代码实现(版本5.7.1):
刷脏线程刷完一个数据页,就需要回到Flush List末尾,重新扫描新的可刷盘的脏页。
buf_do_flush_list_batch{
do {
/* Start from the end of the list looking for a suitable
block to be flushed. */
buf_flush_list_mutex_enter(buf_pool);
/* We use len here because theoretically insertions can
happen in the flush_list below while we are traversing
it for a suitable candidate for flushing. We'd like to
set a limit on how farther we are willing to traverse
the list. */
len = UT_LIST_GET_LEN(buf_pool->flush_list);
bpage = UT_LIST_GET_LAST(buf_pool->flush_list);
if (bpage) {
ut_a(bpage->oldest_modification > 0);
}
if (!bpage || bpage->oldest_modification >= lsn_limit) {
/* We have flushed enough */
buf_flush_list_mutex_exit(buf_pool);
break;
}
ut_a(bpage->oldest_modification > 0);
ut_ad(bpage->in_flush_list);
buf_flush_list_mutex_exit(buf_pool);
/* The list may change during the flushing and we cannot
safely preserve within this function a pointer to a
block in the list! */
while (bpage != NULL
&& len > 0
&& !buf_flush_page_and_try_neighbors(
bpage, BUF_FLUSH_LIST, min_n, &count)) {
++scanned;
buf_flush_list_mutex_enter(buf_pool);
/* If we are here that means that buf_pool->mutex
was not released in buf_flush_page_and_try_neighbors()
above and this guarantees that bpage didn't get
relocated since we released the flush_list
mutex above. There is a chance, however, that
the bpage got removed from flush_list (not
currently possible because flush_list_remove()
also obtains buf_pool mutex but that may change
in future). To avoid this scenario we check
the oldest_modification and if it is zero
we start all over again. */
if (bpage->oldest_modification == 0) {
buf_flush_list_mutex_exit(buf_pool);
break;
}
bpage = UT_LIST_GET_PREV(list, bpage);
ut_ad(!bpage || bpage->in_flush_list);
buf_flush_list_mutex_exit(buf_pool);
--len;
}
} while (count < min_n && bpage != NULL && len > 0);
}
为什么这样设计?
因为刷脏线程刷的这个脏页之前的脏页可能被其他线程给刷走了,之前的脏页可能已经不在Flush list中。
造成这个现象的原因:
- 用户线程也会进行刷脏,虽然是从lru队列中找到的脏页,当脏页被flush到磁盘后,用户线程也会将写入的Page从flush list上移除。具体的调用栈参见:
https://www.jianshu.com/p/080056afa842 - 同时,为了减少锁占用的时间,InnoDB在进行写盘的时候都会把之前占用的锁给释放掉。
带来了什么后果
我们的某一个刷脏线程拿到队尾最后一个数据页,IO fixed,发送给IO线程,最后再从队尾扫描寻找可刷盘的脏页。在这次扫描中,它发现最后一个数据页(也就是刚刚发送到IO线程中的数据页)状态为IO fixed(磁盘很慢,还没处理完)所以不能刷,跳过,开始刷倒数第二个数据页,同样IO fixed,发送给IO线程,然后再次重新扫描Flush List。它又发现尾部的两个数据页都不能刷新(因为磁盘很慢,可能还没刷完),直到扫描到倒数第三个数据页。所以,存在一种极端的情况,如果磁盘比较缓慢,刷脏算法性能会从O(N)退化成O(N*N)。
https://dev.mysql.com/worklog/task/?id=7047
Hazard Pointer版本的代码实现(5.7.28)
static
ulint
buf_do_flush_list_batch(
buf_pool_t* buf_pool,
ulint min_n,
lsn_t lsn_limit)
{
ulint count = 0;
ulint scanned = 0;
ut_ad(buf_pool_mutex_own(buf_pool));
/* Start from the end of the list looking for a suitable
block to be flushed. */
buf_flush_list_mutex_enter(buf_pool);
ulint len = UT_LIST_GET_LEN(buf_pool->flush_list);
/* In order not to degenerate this scan to O(n*n) we attempt
to preserve pointer of previous block in the flush list. To do
so we declare it a hazard pointer. Any thread working on the
flush list must check the hazard pointer and if it is removing
the same block then it must reset it. */
for (buf_page_t* bpage = UT_LIST_GET_LAST(buf_pool->flush_list);
count < min_n && bpage != NULL && len > 0
&& bpage->oldest_modification < lsn_limit;
bpage = buf_pool->flush_hp.get(),
++scanned) {
buf_page_t* prev;
ut_a(bpage->oldest_modification > 0);
ut_ad(bpage->in_flush_list);
prev = UT_LIST_GET_PREV(list, bpage);
buf_pool->flush_hp.set(prev);
buf_flush_list_mutex_exit(buf_pool);
#ifdef UNIV_DEBUG
bool flushed =
#endif /* UNIV_DEBUG */
buf_flush_page_and_try_neighbors(
bpage, BUF_FLUSH_LIST, min_n, &count);
buf_flush_list_mutex_enter(buf_pool);
ut_ad(flushed || buf_pool->flush_hp.is_hp(prev));
--len;
}
buf_pool->flush_hp.set(NULL);
buf_flush_list_mutex_exit(buf_pool);
if (scanned) {
MONITOR_INC_VALUE_CUMULATIVE(
MONITOR_FLUSH_BATCH_SCANNED,
MONITOR_FLUSH_BATCH_SCANNED_NUM_CALL,
MONITOR_FLUSH_BATCH_SCANNED_PER_CALL,
scanned);
}
if (count) {
MONITOR_INC_VALUE_CUMULATIVE(
MONITOR_FLUSH_BATCH_TOTAL_PAGE,
MONITOR_FLUSH_BATCH_COUNT,
MONITOR_FLUSH_BATCH_PAGES,
count);
}
ut_ad(buf_pool_mutex_own(buf_pool));
return(count);
}
原理
要解决这个问题,最本质的方法就是当刷完一个脏页的时候不要每次都从队尾重新扫描。我们可以使用Hazard Pointer来解决,方法如下:遍历找到一个可刷盘的数据页,在锁释放之前,调整Hazard Pointer使之指向Flush List中下一个节点,注意一定要在持有锁的情况下修改。然后释放锁,进行刷盘,刷完盘后,重新获取锁,读取Hazard Pointer并设置下一个节点,然后释放锁,进行刷盘,如此重复。当这个线程在刷盘的时候,另外一个线程需要刷盘,也是通过Hazard Pointer来获取可靠的节点,并重置下一个有效的节点。通过这种机制,保证每次读到的Hazard Pointer是一个有效的Flush List节点,即使磁盘再慢,刷脏算法效率依然是O(N)。 这个解法同样可以用到LRU List驱逐算法上,提高驱逐的效率。
http://mysql.taobao.org/monthly/2017/05/01/