PostgreSQL 跟踪checkpointer出现死锁
2019-01-07 本文已影响1人
EthanHe
gdb跟踪checkpointer进程,出现死锁,Mark一下.
跟踪checkpointer进程,查看共享内存中的信(heckpointerShmem->requests)
(gdb) p CheckpointerShmem->requests[150]
...
$16 = {rnode = {spcNode = 1663, dbNode = 16402, relNode = 26185}, forknum = MAIN_FORKNUM, segno = 0}
(gdb) p CheckpointerShmem->requests[200]
Cannot access memory at address 0xf9fb18
...
然后,请求checkpoint的进程报错
testdb=# update t_wal_ckpt set c2 = 'C2#'||substr(c2,4,40);
UPDATE 8192
testdb=# checkpoint;
2019-01-07 12:30:32.114 CST [1418] PANIC: stuck spinlock detected at RequestCheckpoint, checkpointer.c:1050
2019-01-07 12:30:32.114 CST [1418] STATEMENT: checkpoint;
2019-01-07 12:30:37.081 CST [1390] PANIC: stuck spinlock detected at FirstCallSinceLastCheckpoint, checkpointer.c:1376
2019-01-07 12:30:38.610 CST [1370] LOG: background writer process (PID 1390) was terminated by signal 6: Aborted
2019-01-07 12:30:38.610 CST [1370] LOG: terminating any other active server processes
2019-01-07 12:30:38.611 CST [1392] WARNING: terminating connection because of crash of another server process
2019-01-07 12:30:38.611 CST [1392] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-01-07 12:30:38.611 CST [1392] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2019-01-07 12:30:38.613 CST [1558] WARNING: terminating connection because of crash of another server process
2019-01-07 12:30:38.613 CST [1558] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-01-07 12:30:38.613 CST [1558] HINT: In a moment you should be able to reconnect to the database and repeat your command.
PANIC: stuck spinlock detected at RequestCheckpoint, checkpointer.c:1050
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-01-07 12:30:54.041 CST [1560] FATAL: the database system is in recovery mode
Failed.
!>
尝试重新连接,发现DB已coredump.
[xdb@localhost ~]$
[xdb@localhost ~]$ psql -d testdb
2019-01-07 14:10:16.114 CST [1629] FATAL: the database system is in recovery mode
psql: FATAL: the database system is in recovery mode
执行恢复
[xdb@localhost ~]$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2019-01-07 14:11:50.821 CST [1632] FATAL: lock file "postmaster.pid" already exists
2019-01-07 14:11:50.821 CST [1632] HINT: Is another postmaster (PID 1370) running in data directory "/data/xdb/pg111db"?
stopped waiting
pg_ctl: could not start server
Examine the log output.
[xdb@localhost ~]$ find /data/xdb -name postmaster.pid
/data/xdb/pg111db/postmaster.pid
[xdb@localhost ~]$ rm -rf /data/xdb/pg111db/postmaster.pid
[xdb@localhost ~]$ pg_ctl start
waiting for server to start....2019-01-07 14:12:44.578 CST [1639] LOG: could not bind IPv6 address "::1": Address already in use
[xdb@localhost ~]$ ps -ef|grep postgres
xdb 1370 1 0 12:01 pts/0 00:00:02 /appdb/atlasdb/pg11.1/bin/postgres
xdb 1389 1370 0 12:01 ? 00:00:00 [postgres] <defunct>
xdb 1641 1332 0 14:12 pts/0 00:00:00 grep --color=auto postgres
[xdb@localhost ~]$ kill -9 1370
[xdb@localhost ~]$ pg_ctl start
waiting for server to start....2019-01-07 14:13:33.125 CST [1648] LOG: listening on IPv6 address "::1", port 5432
2019-01-07 14:13:33.125 CST [1648] LOG: listening on IPv4 address "127.0.0.1", port 5432
2019-01-07 14:13:33.142 CST [1648] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
.2019-01-07 14:13:34.361 CST [1649] LOG: database system was interrupted; last known up at 2019-01-07 12:26:22 CST
2019-01-07 14:13:34.818 CST [1649] LOG: database system was not properly shut down; automatic recovery in progress
2019-01-07 14:13:34.863 CST [1649] LOG: redo starts at 1/48F9ED08
.2019-01-07 14:13:35.467 CST [1649] LOG: invalid record length at 1/4914FF58: wanted 24, got 0
2019-01-07 14:13:35.467 CST [1649] LOG: redo done at 1/4914FF30
2019-01-07 14:13:35.467 CST [1649] LOG: last completed transaction was at log time 2019-01-07 12:28:37.521542+08
2019-01-07 14:13:35.977 CST [1648] LOG: database system is ready to accept connections
done
server started
经分析,是因为共享内存结构中的CheckpointerShmem->ckpt_lck导致的.
在跟踪checkpointer进程时,执行
SpinLockRelease(&CheckpointerShmem->ckpt_lck);
释放lock后,不再出现上述问题.