GOLDILOCKS 分布式数据库备份恢复实践

2020-10-22  本文已影响0人  maozicb

环境信息

数据库版本:

Release 20c 20.1.1 revision(31618)

集群信息

gSQL> select STATUS,LOCAL_MEMBER_NAME from x$instance;
STATUS LOCAL_MEMBER_NAME
------ -----------------
OPEN   G2N2             
OPEN   G1N2             
OPEN   G2N1             
OPEN   G1N1        

开启归档模式

查询集群所有节点是否开启了归档。如果没有开启归档将无法执行备份操作。开启归档的方法是在每个节点内把数据库启动到mount状态,执行alter database archivelog,然后再打开数据库。

写了一个简单脚本查询所有节点的归档信息

[goldilocks@gs05 ~]$ sh select.sh "select ARCHIVELOG_MODE from v\$archivelog;"
192.168.149.131

ARCHIVELOG_MODE
---------------
ARCHIVELOG   

192.168.149.132

ARCHIVELOG_MODE
---------------
ARCHIVELOG     

192.168.149.133

ARCHIVELOG_MODE
---------------
ARCHIVELOG     

192.168.149.134

ARCHIVELOG_MODE
---------------
ARCHIVELOG     

全量备份

备份期间没有数据变动

从最简单的情况开始,假设备份的时候数据库没有数据变动,这样直接复制整个数据库文件就可以实现备份目的。

开始备份

创建一个测试表,插入一行数据

gSQL> create table t1 (id int,time timestamp);
Table created.
gSQL> insert into t1 values (100,sysdate);
1 row created.
gSQL> commit;
Commit complete.
gSQL> select * from t1;
 ID TIME                      
--- --------------------------
100 2020-10-10 23:22:32.000000
1 row selected.

编写备份脚本

#!/bin/bash
source ~/.bash_profile
echo "alter database begin backup;"|gsql  --no-prompt

echo "begin backup"

ssh gs05 "source ~/.bash_profile;cp -r $GOLDILOCKS_DATA/* ~/backup" &
ssh gs06 "source ~/.bash_profile;cp -r $GOLDILOCKS_DATA/* ~/backup" &
ssh gs07 "source ~/.bash_profile;cp -r $GOLDILOCKS_DATA/* ~/backup" &
ssh gs08 "source ~/.bash_profile;cp -r $GOLDILOCKS_DATA/* ~/backup" &

wait
echo "alter database end  backup;"|gsql  --no-prompt

echo "end backup "

其中原理即是执行alter database begin backup 后拷贝数据库的所有的文件,包括数据文件,控制文件,以及wal路径下的其他文件。

执行完备份后,再插入一条测试数据

gSQL> insert into t1 values (200,sysdate);

1 row created.

gSQL> commit;

Commit complete.

gSQL> select * from t1;

 ID TIME                      
--- --------------------------
100 2020-10-10 23:22:32.000000
200 2020-10-11 10:53:14.000000

2 rows selected.

恢复

模拟集群故障。

删除集群所有节点数据文件,模拟最严重的情况。

[goldilocks@gs05 ~]$ run_cmd.sh "rm -r ~/goldilocks_data"

启动报错

[goldilocks@gs05 ~]$ gsql

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)


Connected to an idle instance.

gSQL> startup

ERR-HY000(13025): Property file does not exist (/home/goldilocks/goldilocks_data/conf/goldilocks.properties.conf)

gSQL> 

开始恢复

将备份的数据库文件复制到原位置

run_cmd.sh "mkdir ~/goldilocks_data"
run_cmd.sh "cp -r  ~/backup/* ~/goldilocks_data"
[goldilocks@gs05 goldilocks_data]$ gsqlnet

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)


Connected to an idle instance.

gSQL> \cstartup

Startup success

gSQL> select * from x$instance;

VERSION                            STARTUP_TIME               STATUS OS_USER_ID IS_CLUSTER LOCAL_GROUP_ID LOCAL_MEMBER_ID LOCAL_MEMBER_NAME LOCAL_MEMBER_POSITION
---------------------------------- -------------------------- ------ ---------- ---------- -------------- --------------- ----------------- ---------------------
Release 20c 20.1.1 revision(31618) 2020-10-11 11:06:37.650265 OPEN         1008 TRUE                    1               1 G1N1                                  0
Release 20c 20.1.1 revision(31618) 2020-10-11 11:06:38.239340 OPEN         1007 TRUE                    2               3 G2N2                                  2
Release 20c 20.1.1 revision(31618) 2020-10-11 11:06:38.238691 OPEN         1007 TRUE                    2               5 G2N1                                  1
Release 20c 20.1.1 revision(31618) 2020-10-11 11:06:38.234501 OPEN         1005 TRUE                    1               4 G1N2                                  3

4 rows selected.

gSQL> select * from t1;

 ID TIME                      
--- --------------------------
100 2020-10-10 23:22:32.000000

1 row selected.

可以看到备份前的数据是存在的,备份后的数据不存在。

备份时有数据变动

数据库支持online状态备份,备份期间也允许用户继续使用数据库。根据文档理解,在执行alter database begin backup 后,数据库脏页不会继续刷到数据文件中了。即使手动执行checkpoint,也不会刷脏页。但redo 日志会继续记录和归档。

[2020-10-12 14:07:14.444515 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] begin

[2020-10-12 14:07:14.445851 INSTANCE(GOLDILOCKS) THREAD(62993,140466467100416)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 0, datafile : 0 )

[2020-10-12 14:07:14.474179 INSTANCE(GOLDILOCKS) THREAD(62993,140466467100416)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 1, datafile : 0 )

[2020-10-12 14:07:14.474493 INSTANCE(GOLDILOCKS) THREAD(62993,140466467100416)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 2, datafile : 0 )

[2020-10-12 14:07:14.474706 INSTANCE(GOLDILOCKS) THREAD(62993,140466467100416)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 2, datafile : 1 )

[2020-10-12 14:07:14.475220 INSTANCE(GOLDILOCKS) THREAD(62993,140466467100416)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 5, datafile : 0 )

[2020-10-12 14:07:14.476225 INSTANCE(GOLDILOCKS) THREAD(62993,140466467100416)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 6, datafile : 0 )

[2020-10-12 14:07:14.485063 INSTANCE(GOLDILOCKS) THREAD(62993,140466448217856)] [INFORMATION]
[CHECKPOINT] flush buffer checkpoint list[0] - for checkpoint (1), system min flushed lsn(262521), min flushed lsn (262543), flushed page count(0)

[2020-10-12 14:07:14.495049 INSTANCE(GOLDILOCKS) THREAD(62993,140466485982976)] [INFORMATION]
[PAGE FLUSHER] flushed lsn(262543), flushed page count(2048)]

[2020-10-12 14:07:14.495178 INSTANCE(GOLDILOCKS) THREAD(62993,140466051852032)] [INFORMATION]
[ARCHIVING] stable lsn(262543)

[2020-10-12 14:07:14.547931 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] begin - checkpoint lid(15.2915.13), checkpoint lsn(262544), oldest lsn(262544)

[2020-10-12 14:07:14.548092 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] body - checkpoint lid(-1.0.0), checkpoint lsn(-1), active transaction count(0)

[2020-10-12 14:07:14.548115 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] end - checkpoint lid(15.2915.117), checkpoint lsn(262545)

[2020-10-12 14:07:14.548126 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] flush redo log

[2020-10-12 14:07:14.552784 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] save control file

[2020-10-12 14:07:14.565377 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] end

[2020-10-12 14:09:03.485033 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] begin

[2020-10-12 14:09:03.496058 INSTANCE(GOLDILOCKS) THREAD(62993,140466448217856)] [INFORMATION]
[CHECKPOINT] flush buffer checkpoint list[0] - for checkpoint (1), system min flushed lsn(262545), min flushed lsn (262547), flushed page count(0)

[2020-10-12 14:09:03.506912 INSTANCE(GOLDILOCKS) THREAD(62993,140466485982976)] [INFORMATION]
[PAGE FLUSHER] flushed lsn(262547), flushed page count(0)]

[2020-10-12 14:09:03.507426 INSTANCE(GOLDILOCKS) THREAD(62993,140466051852032)] [INFORMATION]
[ARCHIVING] stable lsn(262547)

[2020-10-12 14:09:03.532187 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] begin - checkpoint lid(15.2917.13), checkpoint lsn(262548), oldest lsn(262548)

[2020-10-12 14:09:03.532302 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] body - checkpoint lid(-1.0.0), checkpoint lsn(-1), active transaction count(0)

[2020-10-12 14:09:03.532318 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] end - checkpoint lid(15.2917.117), checkpoint lsn(262549)

[2020-10-12 14:09:03.532326 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] flush redo log

[2020-10-12 14:09:03.535894 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] save control file

[2020-10-12 14:09:03.589138 INSTANCE(GOLDILOCKS) THREAD(62993,140466936956672)] [INFORMATION]
[CHECKPOINT] end

模拟数据有变动

用一个简单脚本模拟一直有用户使用数据库

for i in {1..100}
do
 echo "insert into t1 values ($i,sysdate);"|gsql sys gliese --no-prompt
 sleep 1
done

在此期间备份数据库

[goldilocks@gs05 ~]$ sh backup.sh 

Database altered.

begin backup

Database altered.

end backup 

插入数据脚本执行完成后,t1表有100条数据

gSQL> select count(*) from t1;

COUNT(*)
--------
     100

1 row selected.

此次备份过程中一直有数据写入。整个备份包含开始备份前已经落盘的数据,和备份过程中新增的redo 日志数据,由于redo日志是实时增加的,所以不能保证每个节点 的redo备份文件完全一致,恢复的时候需要做一些处理。

开始恢复

run_cmd.sh "mv  ~/goldilocks_data/ ~/goldilocks_data_bak "
run_cmd.sh "cp -r  ~/backup/ ~/goldilocks_data"

每个节点启动监听后,用gsqlnet尝试直接启动集群

gSQL> cstartup
ERR-42000(16403): of the total '4' nodes, '1' nodes failed to join the global database
ERR-HY000(40061): currently connected node is inactive
Startup success

发现有一个节点没有加入到集群

gSQL> select * from x$instance;

ERR-HY000(16354): connection of member 'G2N1' is broken

g2n1 没有加入到集群

select a.member_name,a.member_id ,b.LOGICAL_CONNECTION,b.PHYSICAL_CONNECTION ,b.LOCAL_SCN from cluster_member@local a, x$cluster_member@local b where a.MEMBER_ID=b.MEMBER_ID order by a.GROUP_ID, 1;

MEMBER_NAME MEMBER_ID LOGICAL_CONNECTION PHYSICAL_CONNECTION LOCAL_SCN 
----------- --------- ------------------ ------------------- ----------
G1N1                1 ACTIVE             ACTIVE              853.26.489
G1N2                6 ACTIVE             ACTIVE              853.26.0  
G2N1                5 INACTIVE           INACTIVE            -1.-1.-1  
G2N2                3 ACTIVE             ACTIVE              853.36.0  

查看g2n1的状态,g2n1 是local open状态,确实没有加入到集群。

gSQL> select statement_view_scn() from dual;

STATEMENT_VIEW_SCN()
--------------------
853.24.1339         

1 row selected.

gSQL>  select * from x$instance@local;

VERSION                            STARTUP_TIME               STATUS     OS_USER_ID IS_CLUSTER LOCAL_GROUP_ID LOCAL_MEMBER_ID LOCAL_MEMBER_NAME LOCAL_MEMBER_POSITION
---------------------------------- -------------------------- ---------- ---------- ---------- -------------- --------------- ----------------- ---------------------
Release 20c 20.1.1 revision(31618) 2020-10-14 16:14:40.564452 LOCAL OPEN       1007 TRUE                    2               5 G2N1                                  1

1 row selected.

G2N1 节点的scn是 853.24.1339 ,同一个组内的G2N2节点853.36.1493 ,同组内的dcn不同,组内数据不一致,G2N2节点的数据多于G2N1节点,原因应该是复制redo 日志文件时,两个节点的redo文件内容不完全一致。下面我们dump一下redo日志 验证一下

[goldilocks@gs08 wal]$ gdump control control_1.ctl -s log

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)

===========================================================
FILE: control_1.ctl
TYPE: CONTROLFILE
TIME: 2020-10-14 17:20:34.265815
===========================================================

 [LOG SECTION]
-----------------------------------------------------------
   DATABASE CREATION TIME : 2020-09-27 08:51:16.608945

  [CHECKPOINT]
    LID                               :  15,3253,13
    LSN                               :  272885
    RESETLOG LSN                      :  -1
    ARCHIVELOG MODE                   :  ARCHIVELOG
    LAST INACTIVATED LOGFILE SEQUENCE :  14

  [LOG STREAM]
    STATE          :  ACTIVE
    GROUP COUNT    :  4
    BLOCK SIZE     :  512
    FILE SEQUENCE  :  15

    [LOG GROUP #0]
      STATE         : INACTIVE
      SIZE          : 104857600
      MEMBER COUNT  : 1
      FILE SEQUENCE : 12
      PREV LAST LSN : 260402
      MEMBER #0     : (ACTIVE) "/home/goldilocks/goldilocks_data/wal/redo_0_0.log"

    [LOG GROUP #1]
      STATE         : INACTIVE
      SIZE          : 104857600
      MEMBER COUNT  : 1
      FILE SEQUENCE : 13
      PREV LAST LSN : 266354
      MEMBER #0     : (ACTIVE) "/home/goldilocks/goldilocks_data/wal/redo_1_0.log"

    [LOG GROUP #2]
      STATE         : INACTIVE
      SIZE          : 104857600
      MEMBER COUNT  : 1
      FILE SEQUENCE : 14
      PREV LAST LSN : 266357
      MEMBER #0     : (ACTIVE) "/home/goldilocks/goldilocks_data/wal/redo_2_0.log"

    [LOG GROUP #3]
      STATE         : CURRENT
      SIZE          : 104857600
      MEMBER COUNT  : 1
      FILE SEQUENCE : 15
      PREV LAST LSN : 266617
      MEMBER #0     : (ACTIVE) "/home/goldilocks/goldilocks_data/wal/redo_3_0.log"

===========================================================
TIME: 2020-10-14 17:20:34.287068
===========================================================

redo_3_0.log为current日志,checkpoint位置是272885

[goldilocks@gs08 wal]$ gdump log redo_3_0.log -n 272885

 Release 20c 20.1.1 revision(31618)

===========================================================
FILE: redo_3_0.log
TYPE: LOGFILE PAGE
TIME: 2020-10-14 17:22:07.267428
===========================================================

===========================================================
 [LOG FILE HEADER]
-----------------------------------------------------------
 LOG_GROUP_ID    : 3
 BLOCK_SIZE      : 512
 FILE_SIZE       : 104857600
 FILE_SEQUENCE   : 15
 PREV_LAST_LSN   : 266617
 CREATION TIME   : 2020-10-11 22:07:12.247585
 SIGNATURE       : 8C9D5600005B11EB8554BF163FAF3B79
===========================================================

[LOG #0] : BLOCK(3253), LSN(272885), SIZE(64), PIECE_COUNT(1), TRANS_ID(FFFFFFFFFFFFFFFE), TRANS_SEQ(1537), RID(0,-1,0)
[PIECE #0] : TYPE(CHKPT_BEGIN), TIME(2020-10-12 14:25:53.411774), SIZE(48), CLASS(RECOVERY), REDO_TYPE(CONTROL_FILE), PROPAGATE_LOG(NO), RID(0,-1,0)
 F529040000000000 5503000000000000 1000000000000000 C605000000000000    .)...... U....... ........ ........
 F305000019000000 BE26CD5E73B10500                                      ........ .&.^s...                  

[LOG #1] : BLOCK(3253), LSN(272886), SIZE(16), PIECE_COUNT(1), TRANS_ID(FFFFFFFFFFFFFFFE), TRANS_SEQ(0), RID(0,-1,0)
[PIECE #0] : TYPE(CHKPT_END), SIZE(0), CLASS(RECOVERY), REDO_TYPE(CONTROL_FILE), PROPAGATE_LOG(NO), RID(0,-1,0)

[LOG #2] : BLOCK(3254), LSN(272887), SIZE(561), PIECE_COUNT(7), TRANS_ID(3A80031003A), TRANS_SEQ(1537), RID(0,0,1591)
[PIECE #0] : TYPE(INIT_PAGE), SIZE(120), CLASS(PAGE_ACCESS), REDO_TYPE(PAGE), PROPAGATE_LOG(YES), RID(2,0,2790)
 0D00030004000000 5729040000000000 7D4AEDF240B00500 02000400C30A0000    ........ W)...... }J..@... ........
 0000000037060000 0000000000000000 0000000000000000 0000000000000000    ....7... ........ ........ ........
 0000000000000000 0000000000000000 0000000000000000 0000000000000000    ........ ........ ........ ........
 0000000000000000 0000000000000000 02001000E60A0000                     ........ ........ ........         
[PIECE #1] : TYPE(BITMAP_UPDATE_LEAF_STATUS), SIZE(52), CLASS(SEGMENT), REDO_TYPE(PAGE), PROPAGATE_LOG(YES), RID(2,4,2755)
 0400030000000000 0000000000000000 0000000000000000 0000000000000000    ........ ........ ........ ........

... 


[LOG #41] : BLOCK(3293), LSN(272926), SIZE(170), PIECE_COUNT(2), TRANS_ID(3A8006A003A), TRANS_SEQ(1556), RID(0,0,103)
[PIECE #0] : TYPE(SEGMENT_GLOBAL_SCN), SIZE(34), CLASS(SEGMENT), REDO_TYPE(MULTIPLE_PAGE), PROPAGATE_LOG(NO), RID(0,-1,0)
 5503000000000000 2400000000000000 C605000000000000 0100000000003706    U....... $....... ........ ......7.
 0000                                                                   ..                                 
[PIECE #1] : TYPE(COMMIT), TRANS_ID(3A8006A003A), TIME(2020-10-12 14:26:36.482417), SIZE(104), CLASS(TRANSACTION), REDO_TYPE(TRANSACTION), PROPAGATE_LOG(YES), RID(1,106,936)
 3A006A00A8030000 00000000FFFFFFFF 0400050300000000 5503000000000000    :.j..... ........ ........ U.......
 2400000000000000 C605000000000000 5503000000000000 2300000000000000    $....... ........ U....... #.......
 C605000000000000 1D2A040000000000 715B5E6173B10500 0606000019000000    ........ .*...... q[^as... ........
 0000000000000000                                                       ........                           


G2N2 检查点后有41行记录,多于G2N1的记录

接着我们验证一下最后一条commit对应的scn号

G2N1
[LOG #17] : BLOCK(3580), LSN(61540), SIZE(170), PIECE_COUNT(2), TRANS_ID(83D0018003A), TRANS_SEQ(753), RID(0,0,103)
[PIECE #0] : TYPE(SEGMENT_GLOBAL_SCN), SIZE(34), CLASS(SEGMENT), REDO_TYPE(MULTIPLE_PAGE), PROPAGATE_LOG(NO), RID(0,-1,0)
 5503000000000000 1800000000000000 2D05000000000000 010000000000BE06    U....... ........ -....... ........
 0000                                                                   ..                                 
[PIECE #1] : TYPE(COMMIT), TRANS_ID(83D0018003A), TIME(2020-10-12 14:26:07.269999), SIZE(104), CLASS(TRANSACTION), REDO_TYPE(TRANSACTION), PROPAGATE_LOG(YES), RID(1,24,2109)
 3A0018003D080000 00000000FFFFFFFF 0400ED0200000000 5503000000000000    :...=... ........ ........ U.......
 1800000000000000 2D05000000000000 5503000000000000 1700000000000000    ........ -....... U....... ........
 2D05000000000000 63F0000000000000 6F9CA05F73B10500 CF02000011000000    -....... c....... o.._s... ........
 0000000000000000  
 

 G2N2
 [LOG #41] : BLOCK(3293), LSN(272926), SIZE(170), PIECE_COUNT(2), TRANS_ID(3A8006A003A), TRANS_SEQ(1556), RID(0,0,103)
[PIECE #0] : TYPE(SEGMENT_GLOBAL_SCN), SIZE(34), CLASS(SEGMENT), REDO_TYPE(MULTIPLE_PAGE), PROPAGATE_LOG(NO), RID(0,-1,0)
 5503000000000000 2400000000000000 C605000000000000 0100000000003706    U....... $....... ........ ......7.
 0000                                                                   ..                                 
[PIECE #1] : TYPE(COMMIT), TRANS_ID(3A8006A003A), TIME(2020-10-12 14:26:36.482417), SIZE(104), CLASS(TRANSACTION), REDO_TYPE(TRANSACTION), PROPAGATE_LOG(YES), RID(1,106,936)
 3A006A00A8030000 00000000FFFFFFFF 0400050300000000 5503000000000000    :.j..... ........ ........ U.......
 2400000000000000 C605000000000000 5503000000000000 2300000000000000    $....... ........ U....... #.......
 C605000000000000 1D2A040000000000 715B5E6173B10500 0606000019000000    ........ .*...... q[^as... ........
 0000000000000000    
 
 commit redo record size 是104 ,第49-56 byte记录的是gcn(5503000000000000),后面每8个记录dcn(1700000000000000)和lcn(2D05000000000000)
由于x86架构cpu是内存排序是低位排在前面,5503 转换成10进制时应该0355对应的十进制为853,所以g2n1的gcn是853,dcn是23。g2n2的gcn
同样是853,但dcn是35,比g2n1多12个组内事务。也就是g2n2比g2n1多数据。
 

可以看出直接启动集群后,G2N1节点 没有顺利加入到集群是因为group 2 组内数据不一致,需要想办法达成一致才可以。

有两个思路,一、G2N2把比G2N1多的数据丢掉 ,二、G2N1补上缺少数据的数据。正常情况下,补全数据会更好些,毕竟数据是有价值的。本次实践的话,两种方案都尝试一下

丢弃数据恢复方法

将每个节点单独启动到local open状态,检查所有节点的gcn是否相同,组内的dcn是否相同。

有过前面的操作知道group1 组内的数据是一致的,group 2 组内数据需要做到一致,并且G2N2的数据比G2N1多,这样的话,将G2N2做不完全恢复到G2N1最后的redo 记录即可。

查找G2N2需要恢复到的lsn号,即和G2N1最后一个commit相同的gcn+dcn组合(5503000000000000 1700000000000000)

[goldilocks@gs08 wal]$ gdump log redo_3_0.log -n 272885|grep -B  5 "5503000000000000 1700000000000000"

...

[LOG #17] : BLOCK(3269), LSN(272902), SIZE(170), PIECE_COUNT(2), TRANS_ID(3A80046003A), TRANS_SEQ(1544), RID(0,0,103)
[PIECE #0] : TYPE(SEGMENT_GLOBAL_SCN), SIZE(34), CLASS(SEGMENT), REDO_TYPE(MULTIPLE_PAGE), PROPAGATE_LOG(NO), RID(0,-1,0)
 5503000000000000 1800000000000000 C605000000000000 0100000000003706    U....... ........ ........ ......7.
 0000                                                                   ..                                 
[PIECE #1] : TYPE(COMMIT), TRANS_ID(3A80046003A), TIME(2020-10-12 14:26:07.272916), SIZE(104), CLASS(TRANSACTION), REDO_TYPE(TRANSACTION), PROPAGATE_LOG(YES), RID(1,70,936)
 3A004600A8030000 00000000FFFFFFFF 0400ED0200000000 5503000000000000    :.F..... ........ ........ U.......
 1800000000000000 C605000000000000 5503000000000000 1700000000000000    ........ ........ U....... ........

查找出来后这一条是符合条件的,对应的LSN是272902,所以G2N2需要恢复到lsn 272902
[goldilocks@gs08 wal]$ gsql

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)


Connected to an idle instance.

gSQL> startup mount

Startup success

gSQL> alter database recover until change 272902;

Database altered.

gSQL> alter system open local database  resetlogs;

System altered.

gSQL> select statement_view_scn() from dual;

STATEMENT_VIEW_SCN()
--------------------
853.24.1492         

1 row selected.

前面查到G2N1的scn是853.24.1339,此时group2 组的两个节点数据已经一致了。

此时4个节点是local open状态,且所有节点gcn是一致的,组内dcn也是一致的,open global database 应该会成功。

[goldilocks@gs05 ~]$ sh select.sh "select status from x\$instance@local;"|grep -v "^$"
192.168.149.131
STATUS    
----------
LOCAL OPEN
1 row selected.
192.168.149.132
STATUS    
----------
LOCAL OPEN
1 row selected.
192.168.149.133
STATUS    
----------
LOCAL OPEN
1 row selected.
192.168.149.134
STATUS    
----------
LOCAL OPEN
1 row selected.
[goldilocks@gs05 ~]$ sh select.sh "select statement_view_scn() from dual;"|grep -v "^$"
192.168.149.131
STATEMENT_VIEW_SCN()
--------------------
853.26.488          
1 row selected.
192.168.149.132
STATEMENT_VIEW_SCN()
--------------------
853.26.1066         
1 row selected.
192.168.149.133
STATEMENT_VIEW_SCN()
--------------------
853.24.1339         
1 row selected.
192.168.149.134
STATEMENT_VIEW_SCN()
--------------------
853.24.1492         
1 row selected.

g1n1 上执行

gSQL> alter system open global database;

System altered.

gSQL> select LOCAL_MEMBER_NAME,STATUS from x$instance;

LOCAL_MEMBER_NAME STATUS
----------------- ------
G1N1              OPEN  
G2N2              OPEN  
G2N1              OPEN  
G1N2              OPEN  

4 rows selected.

gSQL> select count(*) from t1;

COUNT(*)
--------
      50

1 row selected.

此时数据库集群所有节点都从备份中恢复到正常状态。

补全数据恢复方法

理论上将G2N2 当前的redo 日志文件多于G2N1部分追加到G2N1的redo 日志中即可完成两个节点数据一致性,但这个思路操作起来比较麻烦,我们采用另一种方式,重平衡数据的方式(rebalance database)

恢复环境到之前状态

先关闭数据库集群,然后清空当前数据文件,复制备份数据文件到数据目录
gSQL> \cshutdown
Shutdown success

run_cmd.sh "rm -r ~/goldilocks_data/*"
run_cmd.sh "cp -r ~/backup/*  ~/goldilocks_data/"

尝试一键启动

gSQL> \cstartup

ERR-42000(16403): of the total '4' nodes, '1' nodes failed to join the global database
Startup success

发现有节点没有加入到集群

gSQL> select a.member_name,a.member_id ,b.LOGICAL_CONNECTION,b.PHYSICAL_CONNECTION ,b.LOCAL_SCN from cluster_member@local a, x$cluster_member@local b where a.MEMBER_ID=b.MEMBER_ID order by a.GROUP_ID, 1;

MEMBER_NAME MEMBER_ID LOGICAL_CONNECTION PHYSICAL_CONNECTION LOCAL_SCN 
----------- --------- ------------------ ------------------- ----------
G1N1                1 ACTIVE             ACTIVE              853.26.489
G1N2                6 ACTIVE             ACTIVE              853.26.0  
G2N1                5 INACTIVE           INACTIVE            -1.-1.-1  
G2N2                3 ACTIVE             ACTIVE              853.36.0  

4 rows selected.

G2N1 没有加入到集群,尝试手动将G2N1加入到集群

[goldilocks@gs07 ~]$ gsql

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)


Connected to GOLDILOCKS Database.

gSQL> select * from x$instance@local;

VERSION                            STARTUP_TIME               STATUS     OS_USER_ID IS_CLUSTER LOCAL_GROUP_ID LOCAL_MEMBER_ID LOCAL_MEMBER_NAME LOCAL_MEMBER_POSITION
---------------------------------- -------------------------- ---------- ---------- ---------- -------------- --------------- ----------------- ---------------------
Release 20c 20.1.1 revision(31618) 2020-10-15 21:25:14.099212 LOCAL OPEN       1007 TRUE                    2               5 G2N1                                  1

1 row selected.

gSQL> alter system join database;

ERR-42000(16405): of the total '6' tables in the database, '3' tables need to be rebalanced
System altered.

gSQL> alter database rebalance;

Database altered.

gSQL> select * from x$instance;

VERSION                            STARTUP_TIME               STATUS OS_USER_ID IS_CLUSTER LOCAL_GROUP_ID LOCAL_MEMBER_ID LOCAL_MEMBER_NAME LOCAL_MEMBER_POSITION
---------------------------------- -------------------------- ------ ---------- ---------- -------------- --------------- ----------------- ---------------------
Release 20c 20.1.1 revision(31618) 2020-10-15 21:25:14.099212 OPEN         1007 TRUE                    2               5 G2N1                                  1
Release 20c 20.1.1 revision(31618) 2020-10-15 21:25:14.100691 OPEN         1007 TRUE                    2               3 G2N2                                  2
Release 20c 20.1.1 revision(31618) 2020-10-15 21:25:14.099467 OPEN         1005 TRUE                    1               6 G1N2                                  3
Release 20c 20.1.1 revision(31618) 2020-10-15 21:25:13.453708 OPEN         1008 TRUE                    1               1 G1N1                                  0

4 rows selected.

gSQL> select count(*) from t1;

COUNT(*)
--------
      62

1 row selected.

到此集群所有节点状态恢复正常,且恢复出的数据多于上一种恢复方法。

小结

两种方法比较起来看,第二种可以恢复更多的数据,而且恢复操作更简易些。

增量备份(级别备份)

除了支持全量备份外,goldilocks还支持增量备份,0级备份到4级备份

和backup有关的一些参数

gSQL> select property_name,property_value,init_value,is_deprecated from v$PROPERTY where PROPERTY_NAME like '%BACKUP%';

PROPERTY_NAME                        PROPERTY_VALUE                          INIT_VALUE                              IS_DEPRECATED
------------------------------------ --------------------------------------- --------------------------------------- -------------
BACKUP_DIR                           /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup TRUE         
DEFAULT_REMOVAL_OBSOLETE_BACKUP_LIST NO                                      NO                                      FALSE        
DEFAULT_REMOVAL_BACKUP_FILE          NO                                      NO                                      FALSE        
READABLE_BACKUP_DIR_COUNT            1                                       1                                       FALSE        
BACKUP_DIR_1                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_2                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_3                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_4                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_5                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_6                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_7                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_8                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_9                         /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
BACKUP_DIR_10                        /home/goldilocks/goldilocks_data/backup /home/goldilocks/goldilocks_data/backup FALSE        
INCREMENTAL_BACKUP_SCAN_BUFFER_SIZE  32                                      32                                      FALSE   

设置增量备份存放的路径,BACKUP_DIR_1-BACKUP_DIR_10,(BACKUP_DIR 是废弃参数)。其中BACKUP_DIR_1不可以在session级别设置,system 级别设置需要重启数据库生效。默认的BACKUP_DIR_1在${GOLDILOCKS_DATA}/backup 路径下,正式系统通常需要单独设置备份路径。

alter system set BACKUP_DIR_1 ='/home/goldilocks/increment_backup' scope=file;
cshutdown;cstartup

gSQL> select property_name,property_value,init_value,is_deprecated from v$PROPERTY where PROPERTY_NAME = 'BACKUP_DIR_1';

PROPERTY_NAME PROPERTY_VALUE                    INIT_VALUE                        IS_DEPRECATED
------------- --------------------------------- --------------------------------- -------------
BACKUP_DIR_1  /home/goldilocks/increment_backup /home/goldilocks/increment_backup FALSE        

0级备份

最基础的级别备份,其他级别的备份依赖最基础的0级备份

gSQL> alter database backup incremental level 0 ;

Database altered.

gSQL> select * from v$INCREMENTAL_BACKUP;

BACKUP_NAME                      BACKUP_SCOPE INCREMENTAL_LEVEL INCREMENTAL_TYPE    LSN BEGIN_TIME                 COMPLETION_TIME           
-------------------------------- ------------ ----------------- ---------------- ------ -------------------------- --------------------------
databaseD20201016T110117L0S2.inc database                     0 N/A              263709 2020-10-16 11:01:17.014740 2020-10-16 11:01:20.208249
controlD20201016T110120L0S2.inc  control                      0 N/A              263709 2020-10-16 11:01:20.223898 2020-10-16 11:01:20.226171
2 rows selected.

查看一下具体生成的备份文件

[goldilocks@gs05 ~]$ run_cmd.sh "ls ~/increment_backup"
gs05
controlD20201016T110120L0S2.inc
databaseD20201016T110117L0S2.inc
----------------------------
gs06
controlD20201016T110118L0S1.inc
databaseD20201016T110116L0S1.inc
----------------------------
gs07
controlD20201016T110117L0S2.inc
databaseD20201016T110116L0S2.inc
----------------------------
gs08
controlD20201016T110121L0S2.inc
databaseD20201016T110116L0S2.inc
----------------------------

gdump工具查看一下备份文件的内容

[goldilocks@gs05 increment_backup]$ gdump backup databaseD20201016T110117L0S2.inc 

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)

===========================================================
FILE: databaseD20201016T110117L0S2.inc
TYPE: INCREMENTAL BACKUP
TIME: 2020-10-16 11:17:13.673234
===========================================================

 INCREMENTAL BACKUP FILE HEADER 
----------------------------------------------------------------------------
  Backup object                                      : DATABASE
  Tablespace count                                   : 6
  Backup body size                                   : 185942016
  Last checkpoint lsn of previous incremental backup : 0
  Max page lsn of incremental backup                 : 263709
  Last checkpoint lsn of incremental backup          : 263709
  Last checkpoint lid of incremental backup          : (15, 3710, 13)
  Database signature                                 : 691AA23C005B11EB89DBCBC21F1732DE
----------------------------------------------------------------------------

 INCREMENTAL BACKUP FILE TAIL
----------------------------------------------------------------------------
  Tablespace id(0) : backup page count(13022), start offset(8192)
  Tablespace id(1) : backup page count(2122), start offset(106684416)
  Tablespace id(2) : backup page count(2501), start offset(124067840)
  Tablespace id(4) : backup page count(17), start offset(144556032)
  Tablespace id(5) : backup page count(389), start offset(144695296)
  Tablespace id(6) : backup page count(4647), start offset(147881984)

===========================================================
TIME: 2020-10-16 11:17:13.673502

可以看到,备份文件里包含表空间的一些信息和checkpoint的信息,没看到commit.log ,location file的信息。

恢复

场景一:单个数据文件损坏

模拟故障

单个节点的数据文件损坏,不影响集群的整体使用,使用备份将单个节点恢复即可。

挑一个节点模拟故障,删除某一个数据文件。然后模拟用户继续使用数据库

gSQL> create table t2 (id int,name varchar(100)) sharding by hash (id ) shard count 2;

Table created.

gSQL> insert into t2 values (1,'after rm data file'),(2,'after rm data file');

1 row created.

gSQL> commit;

Commit complete.

数据库集群是可以正常使用的,删除数据文件的节点也没有报错,因为数据都存在内存中,没有落盘的动作,数据库发现不了问题,接着我们手动触发检查点,出现报错,(其他正常的节点checkpoint没有问题)

gSQL> alter system checkpoint;

ERR-HY000(14106): tablespace (MEM_DATA_TBS) is taken offline as the result of a write error
ERR-HY000(11040): No such object (/home/goldilocks/goldilocks_data/db/system_data.dbf) : stfOpen() returned errno(2)
System altered.

gSQL> select IS_ONLINE from v$tablespace where TBS_NAME='MEM_DATA_TBS';

IS_ONLINE
---------
FALSE    

gSQL> select count(*) from t1@local;

ERR-42000(14041): cannot access the OFFLINE tablespace 'MEM_DATA_TBS'

表空间已经处于offline状态,查询该节点数据出现错误。

恢复

启动问题节点到mount状态

gSQL> shutdown 

Shutdown success

gSQL> startup mount

Startup success

然后正常节点模拟继续使用数据库

gSQL> insert into t2 values (3,'during g2n2 recovery'),(4,'during g2n2 recovery');

1 row created.

gSQL> commit;

Commit complete.

单个数据文件损坏,还原数据文件对应的表空间即可

g2n2> alter database restore tablespace mem_data_tbs;

Database altered.
2020-10-19 09:45:27.604274 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[RESTORE] begin (TABLESPACE, -1)

[2020-10-19 09:45:27.612729 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RESTORE] recreate datafile ( /home/goldilocks/goldilocks_data/db/system_data.dbf )

[2020-10-19 09:45:27.612763 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[DATABASE FILE MANAGER] remove database file (/home/goldilocks/goldilocks_data/db/system_data.dbf) - success

[2020-10-19 09:45:27.612773 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[DATABASE FILE MANAGER] register database file (/home/goldilocks/goldilocks_data/db/system_data.dbf) - success

[2020-10-19 09:45:27.728515 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RESTORE] recreate end

[2020-10-19 09:45:27.733664 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[STARTUP-SM] LOAD DATAFILES

[2020-10-19 09:45:27.733726 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
.... datafile '/home/goldilocks/goldilocks_data/db/system_dict.dbf' assigned to PARALLEL_IO_GROUP_1 

[2020-10-19 09:45:27.733741 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
.... datafile '/home/goldilocks/goldilocks_data/db/system_undo.dbf' assigned to PARALLEL_IO_GROUP_1 

[2020-10-19 09:45:27.734024 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
.... datafile '/home/goldilocks/goldilocks_data/db/system_aux.dbf' assigned to PARALLEL_IO_GROUP_1 

[2020-10-19 09:45:27.734269 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD DATAFILE(/home/goldilocks/goldilocks_data/db/system_dict.dbf)

[2020-10-19 09:45:28.275547 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD COMPLETED - DATAFILE(/home/goldilocks/goldilocks_data/db/system_dict.dbf)

[2020-10-19 09:45:28.275651 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD DATAFILE(/home/goldilocks/goldilocks_data/db/system_undo.dbf)

[2020-10-19 09:45:28.312210 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD COMPLETED - DATAFILE(/home/goldilocks/goldilocks_data/db/system_undo.dbf)

[2020-10-19 09:45:28.312297 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD DATAFILE(/home/goldilocks/goldilocks_data/db/system_aux.dbf)

[2020-10-19 09:45:28.520626 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD COMPLETED - DATAFILE(/home/goldilocks/goldilocks_data/db/system_aux.dbf)

[2020-10-19 09:45:28.522207 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[STARTUP-SM] REFINE TABLESPACE AND DATAFILE

[2020-10-19 09:45:28.637237 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
.... datafile '/home/goldilocks/goldilocks_data/db/system_data.dbf' assigned to PARALLEL_IO_GROUP_1 

[2020-10-19 09:45:28.637435 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD DATAFILE(/home/goldilocks/goldilocks_data/db/system_data.dbf)

[2020-10-19 09:45:29.103868 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD COMPLETED - DATAFILE(/home/goldilocks/goldilocks_data/db/system_data.dbf)

[2020-10-19 09:45:29.131161 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RESTORE] merge datafile begin ( backup file - /home/goldilocks/increment_backup/databaseD20201016T234607L0S2.inc )

[2020-10-19 09:45:29.133506 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RESTORE] merge datafile end

[2020-10-19 09:45:29.177209 INSTANCE(GOLDILOCKS) THREAD(19228,139887116277504)] [INFORMATION]
[PAGE FLUSHER] FLUSHED_LSN(-1)]

[2020-10-19 09:45:29.178274 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[RESTORE] end

[2020-10-19 09:45:29.180503 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[EVENT] alter database restore : SUCCESS

resotre tablespace 应该就是从level 0 备份中拿取对应的备份文件,restore 到备份时的状态。

  控制文件中system_data.dbf 对应的checkpoint是37384
 
  NAME          :  MEM_DATA_TBS
   ATTRIBUTES    :  MEMORY | PERSISTENT | DATA
   STATE         :  CREATED
   LOGGING STATE :  LOGGING
   ONLINE STATE  :  OFFLINE
   EXTENT_SIZE   :  32
   RELATION_ID   :  87467008983040

   [DATAFILE #0]
     SIZE      :  209715200
     AUTOEXTEN :  OFF
     NEXT      :  0
     MAXSIZE   :  209715200
     STATE     :  CREATED
     NAME      :  "/home/goldilocks/goldilocks_data/db/system_data.dbf"
     CHKPT LSN : 37384
     CHKPT LID : (1, 1960, 13)
查看resotre 后system_data.dbf对应的checkpoint 36395

[goldilocks@gs08 db]$ gdump data system_data.dbf -h

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)

===========================================================
FILE: system_data.dbf
TYPE: DATAFILE HEADER
TIME: 2020-10-19 09:59:21.891048
===========================================================
  FILE                   : system_data.dbf
  Tablespace Physical Id : 2
  Datafile Id            : 0
  Last Checkpoint Lsn    : 36395
  Last Checkpoint Lid    : (0, 59107, 13)
  Creation TIME          : 2020-10-16 23:25:36.324841
  Database signature     : D5A450520FC311EBA5B049B54ADBD1B3

===========================================================
TIME: 2020-10-19 09:59:21.891889
===========================================================

restore 后需要recover ,recover 需要从36395开始恢复到最新的redo

g2n2> alter database recover tablespace mem_data_tbs;

Database altered.
[2020-10-19 10:08:07.879322 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[STARTUP-SM] LOAD DATAFILES

[2020-10-19 10:08:07.880219 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[STARTUP-SM] REFINE TABLESPACE AND DATAFILE

[2020-10-19 10:08:07.905576 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
.... datafile '/home/goldilocks/goldilocks_data/db/system_data.dbf' assigned to PARALLEL_IO_GROUP_1 

[2020-10-19 10:08:07.905694 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD DATAFILE(/home/goldilocks/goldilocks_data/db/system_data.dbf)

[2020-10-19 10:08:08.012629 INSTANCE(GOLDILOCKS) THREAD(19228,139887097394944)] [INFORMATION]
.... LOAD COMPLETED - DATAFILE(/home/goldilocks/goldilocks_data/db/system_data.dbf)

[2020-10-19 10:08:08.058296 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[RECOVERY MANAGER] recover TABLESPACE - begin (COMPLETE, LSN 0)

[2020-10-19 10:08:08.058600 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RECOVERY MANAGER] analysis begin

[2020-10-19 10:08:08.164847 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RECOVERY MANAGER] read checkpoint log - checkpoint lid(0.59107.13), oldest lsn(36395), local scn(903.0.1046), time(2020-10-16 23:46:07.893176), transaction sequence(532), Grid sequence(12884902414)

[2020-10-19 10:08:08.191845 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RECOVERY MANAGER] analysis done

[2020-10-19 10:08:08.191909 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RECOVERY MANAGER] ready to redo - start lid(0.59107.0), start lsn(36395)

[2020-10-19 10:08:08.199126 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RECOVERY MANAGER] redo has performing - logfile(/home/goldilocks/goldilocks_data/archive_log/archive_0.log), lsn(36395)

[2020-10-19 10:08:08.205028 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
    [RECOVERY MANAGER] redo has performing - logfile(/home/goldilocks/goldilocks_data/wal/redo_1_0.log), lsn(36397)

[2020-10-19 10:08:08.209624 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[RECOVERY MANAGER] recover TABLESPACE - end

[2020-10-19 10:08:08.252564 INSTANCE(GOLDILOCKS) THREAD(19228,139887116277504)] [INFORMATION]
[PAGE FLUSHER] FLUSHED_LSN(-1)]

[2020-10-19 10:08:08.260469 INSTANCE(GOLDILOCKS) THREAD(19228,139887742981952)] [INFORMATION]
[EVENT] alter database recover : SUCCESS

recoer 后开启数据库,将表空间修改为online

g2n2> alter system open local database;

System altered.

g2n2> select IS_ONLINE from v$tablespace where TBS_NAME='MEM_DATA_TBS';

IS_ONLINE
---------
FALSE    

1 row selected.

g2n2> alter tablespace MEM_DATA_TBS online;

Tablespace altered.

g2n2> select IS_ONLINE from v$tablespace where TBS_NAME='MEM_DATA_TBS';

IS_ONLINE
---------
TRUE     

1 row selected.

g2n2> select count(*) from t1@local;

COUNT(*)
--------
      24

1 row selected.

g2n2> select * from t2@local;

ID NAME              
-- ------------------
 1 after rm data file

1 row selected.

可以发现,t1的数据已经恢复出来了,t2 的数据有一条。为什么有一条呢,因为数据文件损坏,但redo日志已经记录了数据库操作,前面建t2表时指定shard count 2,是配合后面同时插入两条测试数据时一定有一条数据落入到出故障的节点。

为什么只有一条数据呢,因为还没有加入到集群中,恢复期间集群的数据还没有同步过来。

此时g2n2还没加入到集群中

g2n2> select statement_view_scn() from dual;

STATEMENT_VIEW_SCN()
--------------------
923.0.1101 

g1n1> select statement_view_scn() from dual;

STATEMENT_VIEW_SCN()
--------------------
925.1.590  

g2n2 已经比集群落后了2个全局事务

将g2n2 加入到集群

g2n2> alter system join database;

ERR-42000(16405): of the total '11' tables in the database, '1' tables need to be rebalanced
System altered.

g2n2> alter database rebalance;

Database altered.

g2n2> select * from t2@local order by 1;

ID NAME                
-- --------------------
 1 after rm data file  
 3 during g2n2 recovery

2 rows selected.

g2n2> select * from t2 order by 1;

ID NAME                
-- --------------------
 1 after rm data file  
 2 after rm data file  
 3 during g2n2 recovery
 4 during g2n2 recovery

g2n2> select statement_view_scn() from dual;

STATEMENT_VIEW_SCN()
--------------------
934.0.1101          

1 row selected.

集群其他节点的scn
g1n1> select statement_view_scn() from dual;

STATEMENT_VIEW_SCN()
--------------------
934.0.590           

1 row selected.

加入集群后,新的数据通过rebalance 同步过来了。

场景:控制文件损坏

控制文件默认是两个副本,存放在相同的路径下$GOLDILOCK_DATA/wal,放在相同路径下冗余效果并不好,可以设置放在不同的路径下

 alter system set  CONTROL_FILE_1 ='/home/goldilocks/goldilocks_data/backup/control_1.ctl' scope =file;

关闭数据库,复制文件,然后重启数据库即可

 run_cmd.sh "cp ~/goldilocks_data/wal/control_1.ctl  ~/goldilocks_data/backup/control_1.ctl"
g1n1>  select PROPERTY_VALUE from v$PROPERTY where  property_name  like '%CONTROL_FILE_1%';

PROPERTY_VALUE                                       
-----------------------------------------------------
/home/goldilocks/goldilocks_data/backup/control_1.ctl

1 row selected.

模拟控制文件丢失

rm /home/goldilocks/goldilocks_data/backup/control_1.ctl

随机
g1n1> alter system checkpoint;

System altered.

缺少控制文件的节点直接宕了

[2020-10-19 14:51:34.611741 INSTANCE(GOLDILOCKS) THREAD(23155,139815192352512)] [INFORMATION]
[CHECKPOINT] begin

[2020-10-19 14:51:34.612304 INSTANCE(GOLDILOCKS) THREAD(23155,139814663874304)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 0, datafile : 0 )

[2020-10-19 14:51:34.640728 INSTANCE(GOLDILOCKS) THREAD(23155,139814663874304)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 1, datafile : 0 )

[2020-10-19 14:51:34.640876 INSTANCE(GOLDILOCKS) THREAD(23155,139814663874304)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 2, datafile : 0 )

[2020-10-19 14:51:34.641185 INSTANCE(GOLDILOCKS) THREAD(23155,139814663874304)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 5, datafile : 0 )

[2020-10-19 14:51:34.641577 INSTANCE(GOLDILOCKS) THREAD(23155,139814663874304)] [INFORMATION]
[IO SLAVE] flush datafile ( tablespace : 6, datafile : 0 )

[2020-10-19 14:51:34.646512 INSTANCE(GOLDILOCKS) THREAD(23155,139814644991744)] [INFORMATION]
[CHECKPOINT] flush buffer checkpoint list[0] - for checkpoint (1), system min flushed lsn(41666), min flushed lsn (41709), flushed page count(0)

[2020-10-19 14:51:34.656874 INSTANCE(GOLDILOCKS) THREAD(23155,139815154587392)] [INFORMATION]
[PAGE FLUSHER] flushed lsn(41709), flushed page count(1024)]

[2020-10-19 14:51:34.656972 INSTANCE(GOLDILOCKS) THREAD(23155,139814317840128)] [INFORMATION]
[ARCHIVING] stable lsn(41709)

[2020-10-19 14:51:34.667498 INSTANCE(GOLDILOCKS) THREAD(23155,139814317840128)] [FATAL]
[SYSTEM FATAL] CAUSE("can't save control file")(GOLDILOCKS) ERR-HY000(11040): No such object (/home/goldilocks/goldilocks_data/backup/control_1.ctl): stfOpen() returned errno(2)

========================================================
(GOLDILOCKS) CALL STACK [Release 20c 20.1.1 revision(31618)]
========================================================
gmaster(kngAddErrorCallStackUnsafe+0xa3) [0xb60353]
gmaster(knlLogCallStackUnsafe+0x2a) [0xb3a13a]
gmaster(knlSystemFatal+0x35) [0xb475c5]
gmaster(smfSaveCtrlFile+0x76) [0xa6e066]
gmaster(smrArchivelogEventHandler+0x38) [0x9c5448]
gmaster(knlExecuteEnvEvent+0xb0) [0xb46740]
gmaster(ztmtArchivelogThread+0x320) [0x520700]
/lib64/libpthread.so.0(+0x7dd5) [0x7f2950d02dd5]
/lib64/libc.so.6(clone+0x6d) [0x7f295082402d]
(GOLDILOCKS) ERR-HY000(11040): No such object (/home/goldilocks/goldilocks_data/backup/control_1.ctl): stfOpen() returned errno(2)

复制正常的控制文件到丢失的位置,重启数据库,然后加入到集群即可恢复。

如果两份控制文件都损坏,出现这种情况一般不是软件造成的了。人为误删除整个目录或者服务器文件系统损坏的可能性更大。这样的话,需要整个节点的恢复了,参考单个节点数据全部丢失情况

场景二:单个节点数据全部丢失

误删除整个数据库目录,服务器文件系统损坏,多块磁盘故障等会导致所有的数据丢失。

有两个思路

使用0级备份恢复

适用场景:备份比较新,恢复的代价小于重新加入集群的代价,且归档一直没有丢失。

模拟故障

删除掉整个数据文件夹

rm -rf ~/goldilocks_data 
g1n1> insert into t2 values (5,'after rm all g1n2 data file'),(6,'after rm all g1n2 data file');

2 rows created.

g1n1> commit;

Commit complete.
恢复
[goldilocks@gs06 ~]$ gsql

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)

Connected to an idle instance.

g1n2> startup nomount;
ERR-HY000(13025): Property file does not exist (/home/goldilocks/goldilocks_data/conf/goldilocks.properties.conf)

没有参数文件,从其他正好节点复制conf 文件夹过来

g1n2> startup nomount;

ERR-HY000(13004): static shared memory segment is corrupted
ERR-HY000(11040): No such object (/home/goldilocks/goldilocks_data/trc/system.trc): stfOpen() returned errno(2)

在$GOLDILOCKS_DATA 路径下创建 db,wal,trc,backup,archive_log 文件夹
mkdir wal db trc backup archive_log

g1n2> startup nomount;

Startup success

g1n2> alter system mount database;

ERR-42000(14046): file does not exist - '/home/goldilocks/goldilocks_data/wal/control_0.ctl'

g1n2> alter database restore controlfile from '/home/goldilocks/increment_backup/controlD20201019T160304L0S4.inc';

Database altered.

g1n2> alter system mount database;

ERR-42000(14046): file does not exist - '/home/goldilocks/goldilocks_data/wal/commit.log'

commit log作用:MEM_TRANS_TBS 表空间用满时,将多出来的transaction record 存在commit log中。恢复过程中不需要考虑这个文件
,制作一个空文件即可
dd if=/dev/zero of=commit.log bs=1048576 count=100

g1n2> alter system mount database;

ERR-HY000(11040): No such object (/home/goldilocks/goldilocks_data/wal/location.ctl) : stfOpen() returned errno(2)
location.ctl 文件不存在,从其他节点复制一个

g1n2> alter system mount database;

System altered.

g1n2> alter database restore;

Database altered.

g1n2> alter database recover;

ERR-42000(14046): file does not exist - '/home/goldilocks/goldilocks_data/wal/redo_0_0.log'

查看当前日志文件
g1n2> select * from v$logfile;

GROUP_ID FILE_NAME                                         GROUP_STATE FILE_SEQ FILE_SIZE
-------- ------------------------------------------------- ----------- -------- ---------
       0 /home/goldilocks/goldilocks_data/wal/redo_0_0.log CURRENT            4 104857600
       1 /home/goldilocks/goldilocks_data/wal/redo_1_0.log INACTIVE           1 104857600
       2 /home/goldilocks/goldilocks_data/wal/redo_2_0.log INACTIVE           2 104857600
       3 /home/goldilocks/goldilocks_data/wal/redo_3_0.log INACTIVE           3 104857600

4 rows selected.

根据归档日志,恢复redo日志

g1n2> ! cp /home/goldilocks/goldilocks_data/archive_log/archive_4.log /home/goldilocks/goldilocks_data/wal/redo_0_0.log

g1n2> ! cp /home/goldilocks/goldilocks_data/archive_log/archive_1.log /home/goldilocks/goldilocks_data/wal/redo_1_0.log

g1n2> ! cp /home/goldilocks/goldilocks_data/archive_log/archive_2.log /home/goldilocks/goldilocks_data/wal/redo_2_0.log

g1n2> ! cp /home/goldilocks/goldilocks_data/archive_log/archive_3.log /home/goldilocks/goldilocks_data/wal/redo_3_0.log


g1n2> alter database begin incomplete recovery;

ERR-01000(14104): Warning: suggestion '/home/goldilocks/goldilocks_data/archive_log/archive_4.log'
ERR-01000(14103): Warning: media recovery needs a logfile including log (Lsn 45702)
Database altered.

g1n2> alter database  recover automatically;

ERR-01000(14104): Warning: suggestion '/home/goldilocks/goldilocks_data/archive_log/archive_5.log'
ERR-01000(14103): Warning: media recovery needs a logfile including log (Lsn 46421)
Database altered.

g1n2> alter database  end incomplete recovery;

Database altered.

g1n2> alter system open local database resetlogs;

System altered.

g1n2> alter system join database;

ERR-42000(16405): of the total '11' tables in the database, '10' tables need to be rebalanced
System altered.

g1n2> alter database rebalance;

Database altered.


g1n2> select * from t2 order by 1;

ID NAME                       
-- ---------------------------
 1 after rm data file         
 2 after rm data file         
 3 during g2n2 recovery       
 4 during g2n2 recovery       
 5 after rm all g1n2 data file
 6 after rm all g1n2 data file

6 rows selected.

恢复完成
g1n2> select a.member_name,a.member_id ,b.LOGICAL_CONNECTION,b.PHYSICAL_CONNECTION ,b.LOCAL_SCN from cluster_member@local a, x$cluster_member@local b where a.MEMBER_ID=b.MEMBER_ID order by a.GROUP_ID, 1;

MEMBER_NAME MEMBER_ID LOGICAL_CONNECTION PHYSICAL_CONNECTION LOCAL_SCN  
----------- --------- ------------------ ------------------- -----------
G1N1                1 ACTIVE             ACTIVE              1138.0.0   
G1N2                6 ACTIVE             ACTIVE              1138.0.1264
G2N1                5 ACTIVE             ACTIVE              1138.0.0   
G2N2                7 ACTIVE             ACTIVE              1138.0.0   

4 rows selected.

新成员加入集群

适用场景:备份和数据一起丢失了,没有途径恢复数据了。

g2n2 上删除数据库文件
rm archive_log/* backup/* db/* wal/*

查看集群状态

g1n1>  select a.member_name,a.member_id ,b.LOGICAL_CONNECTION,b.PHYSICAL_CONNECTION ,b.LOCAL_SCN from cluster_member@local a, x$cluster_member@local b where a.MEMBER_ID=b.MEMBER_ID order by a.GROUP_ID, 1;

MEMBER_NAME MEMBER_ID LOGICAL_CONNECTION PHYSICAL_CONNECTION LOCAL_SCN 
----------- --------- ------------------ ------------------- ----------
G1N1                1 ACTIVE             ACTIVE              1140.0.674
G1N2                6 INACTIVE           INACTIVE            -1.-1.-1  
G2N1                5 ACTIVE             ACTIVE              1140.0.0  
G2N2                7 ACTIVE             ACTIVE              1140.0.0 

g2n2 重新创建数据库

[goldilocks@gs06 goldilocks_data]$ gcreatedb --cluster --db_name='glodilocks' --timezone='+08:00' --character_set="UTF8" --char_length_units="OCTETS" --member='g1n2' --host=192.168.149.132 --port=10120

 Copyright © 2010 SUNJESOFT Inc. All rights reserved.
 Release 20c 20.1.1 revision(31618)

Database created

g1n1> alter database drop inactive cluster  members;

Database altered.

g1n1>  select a.member_name,a.member_id ,b.LOGICAL_CONNECTION,b.PHYSICAL_CONNECTION ,b.LOCAL_SCN from cluster_member@local a, x$cluster_member@local b where a.MEMBER_ID=b.MEMBER_ID order by a.GROUP_ID, 1;

MEMBER_NAME MEMBER_ID LOGICAL_CONNECTION PHYSICAL_CONNECTION LOCAL_SCN 
----------- --------- ------------------ ------------------- ----------
G1N1                1 ACTIVE             ACTIVE              1142.0.674
G2N1                5 ACTIVE             ACTIVE              1142.0.0  
G2N2                7 ACTIVE             ACTIVE              1142.0.0  

3 rows selected.

新节点
g1n2> startup

Startup success

g1n2> alter system open global database;

System altered.


将节点加入到集群
g1n1> alter cluster group g1 add cluster member g1n2 host '192.168.149.132' port 10120;

Cluster Group altered.


新节点需要rebalance 数据

g1n2> alter database rebalance;

Database altered.

g1n2> select * from t2@local order by 1;

ID NAME                       
-- ---------------------------
 2 after rm data file         
 4 during g2n2 recovery       
 6 after rm all g1n2 data file

3 rows selected.

g1n1>  select a.member_name,a.member_id ,b.LOGICAL_CONNECTION,b.PHYSICAL_CONNECTION ,b.LOCAL_SCN from cluster_member@local a, x$cluster_member@local b where a.MEMBER_ID=b.MEMBER_ID order by a.GROUP_ID, 1;

MEMBER_NAME MEMBER_ID LOGICAL_CONNECTION PHYSICAL_CONNECTION LOCAL_SCN 
----------- --------- ------------------ ------------------- ----------
G1N1                1 ACTIVE             ACTIVE              1155.0.675
G1N2                8 ACTIVE             ACTIVE              1155.0.0  
G2N1                5 ACTIVE             ACTIVE              1155.0.0  
G2N2                7 ACTIVE             ACTIVE              1155.0.0  

集群恢复正常。

上一篇下一篇

猜你喜欢

热点阅读