MySQL 5.7复制的一个小bug-XA事务
线上一个5.7从库复制中断:
mysql>show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.0.1
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000006
Read_Master_Log_Pos: 49461560
Relay_Log_File: db-music1-relay-bin.000002
Relay_Log_Pos: 379466837
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1399
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 379466624
Relay_Log_Space: 2575974028
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1399
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 244331
Master_UUID: ce522aa3-6398-11e7-945f-246e96729568
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 170708 13:22:50
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: ce522aa3-6398-11e7-945f-246e96729568:1,
d199707a-5287-11e7-8f35-246e96729f10:1215032565-1218510979
Executed_Gtid_Set: 30472597-527e-11e7-a93f-246e96729ec0:1-47,
d199707a-5287-11e7-8f35-246e96729f10:1-1215544882,
d23b9cfd-5287-11e7-a307-246e96729ff0:1-9
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
查询具体报错:
mysql>select * from performance_schema.replication_applier_status_by_worker \G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 3
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: d199707a-5287-11e7-8f35-246e96729f10:1215544883
LAST_ERROR_NUMBER: 1399
LAST_ERROR_MESSAGE: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123; Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the IDLE state' on query. Default database: ''. Query: 'ROLLBACK'
LAST_ERROR_TIMESTAMP: 2017-07-08 13:22:50
1 rows in set (0.00 sec)
第一感觉很奇怪,为什么会rollback失败呢?于是根据gtid去对应的主库binlog去看了下,并没有任何rollback语句:
# at 379611076
#170707 20:28:38 server id 61481 end_log_pos 379611141 CRC32 0x65240fde GTID last_committed=512318 sequence_number=512319
SET @@SESSION.GTID_NEXT= 'd199707a-5287-11e7-8f35-246e96729f10:1215544883'/*!*/;
# at 379611141
#170707 20:28:38 server id 61481 end_log_pos 379611269 CRC32 0x19e0ec66 Query thread_id=758257 exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=524288/*!*/;
XA START X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379611269
#170707 20:28:38 server id 61481 end_log_pos 379611392 CRC32 0xd37354d4 Table_map: `XXXX`.`XXXX` mapped to number 123
# at 379611392
#170707 20:28:38 server id 61481 end_log_pos 379611910 CRC32 0xec28e0a5 Update_rows: table id 123 flags: STMT_END_F
### UPDATE XXXX.XXXX
### WHERE
### @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
### @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
### @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
### @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
### @5=1499148021048 /* LONGINT meta=0 nullable=1 is_null=0 */
### @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
### @7=114 /* INT meta=0 nullable=1 is_null=0 */
### SET
### @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
### @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */```
### @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
### @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
### @5=1499430518207 /* LONGINT meta=0 nullable=1 is_null=0 */
### @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
### @7=115 /* INT meta=0 nullable=1 is_null=0 */
# at 379611910```
#170707 20:28:38 server id 61481 end_log_pos 379612036 CRC32 0x8f8d4a35 Query thread_id=758257 exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=1436549152/*!*/;
XA END X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379612036
#170707 20:28:38 server id 61481 end_log_pos 379612096 CRC32 0xd4966095 XA PREPARE X'3230375f313533353431363835313835373138305f',X'353034',39187
XA PREPARE X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379614541
#170707 20:28:38 server id 61481 end_log_pos 379614687 CRC32 0x5273d888 Query thread_id=758257 exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=1436549152/*!*/;
XA COMMIT X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
看下本地的relay log,找到这个事务的gtid
# at 379611289
#170707 20:28:38 server id 61481 end_log_pos 379611141 CRC32 0x65240fde GTID last_committed=512318 sequence_number=512319
SET @@SESSION.GTID_NEXT= 'd199707a-5287-11e7-8f35-246e96729f10:1215544883'/*!*/;
# at 379611354
#170707 20:28:38 server id 61481 end_log_pos 379611269 CRC32 0x19e0ec66 Query thread_id=758257 exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=524288/*!*/;
XA START X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379611482
#170707 20:28:38 server id 61481 end_log_pos 379611392 CRC32 0xd37354d4 Table_map: XXXX.XXXX mapped to number 123
# at 379611605
#170707 20:28:38 server id 61481 end_log_pos 379611910 CRC32 0xec28e0a5 Update_rows: table id 123 flags: STMT_END_F
BINLOG '
dn5fWRMp8AAAewAAAABpoBYAAHsAAAAAAAEADGRiX3BsYXlsaXN0MQAOTXVzaWNfUGxheWxpc3QA
KQgPCAgICAMDAwgPAwMIAw8IAQMPCAEPAQMIAQEDCAEBDwMIAQQDAwMIDf0C/QL9AgAMdwGAAQT6
////vwHUVHPT
dn5fWRgp8AAABgIAAAZroBYAAHsAAAAAAAEAKf///////////////wCECAAA/tDzPBwAAAAAGQDm
jL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAADhtLwxdAQAAAAAAAAAAAABy
AAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAA3gv6G10BAAABAAAAgQEAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAHAAAAAAAAAAAAAAAAAAAAAAADhtLwxdAQAAAAAgAGIyYTcxOWE4NjdmZjNkNmJlZWI5
YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAAAACECAAA/tDz
PBwAAAAAGQDmjL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAAL/9BR1dAQAA
AAAAAAAAAABzAAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAAv/0FHV0BAAABAAAAgQEAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAHEAAAAAAAAAAAAAAAAAAAAAAL/9BR1dAQAAAAAgAGIyYTcxOWE4Njdm
ZjNkNmJlZWI5YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAA
AKXgKOw=
'/*!*/;
### UPDATE XXXX.XXXX
### WHERE
### @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
### @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
### @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
### @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
### @5=1499148021048 /* LONGINT meta=0 nullable=1 is_null=0 */
### @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
### @7=114 /* INT meta=0 nullable=1 is_null=0 */
### SET
### @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
### @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
### @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
### @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
### @5=1499430518207 /* LONGINT meta=0 nullable=1 is_null=0 */
### @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
### @7=115 /* INT meta=0 nullable=1 is_null=0 */
# at 379612123
#170707 20:28:38 server id 61481 end_log_pos 379612036 CRC32 0x8f8d4a35 Query thread_id=758257 exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=1436549152/*!*/;
XA END X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379612249
#170708 13:18:53 server id 254331 end_log_pos 379612306 CRC32 0xf2c69619 Rotate to db-music1-relay-bin.000003 pos: 4
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
到这里,这个relay log日志文件结束了。很显然问题也找到了,就是执行
ROLLBACK /* added by mysqlbinlog */ /*!*/;
出错了。
首先 我们看到 这个rollback是MySQL自己加上去的,那么为什么要加呢?
mysql为了保证一个事务只在一个binlog里,所以当Binlog或者relay log发生截断时,最后一个事务要么commit,要么rollback,如果rollback,那么下一个binlog或者relay log会把这个事务重做一遍,保证这个事务不会丢。
由于xa事务无法直接rollback,而需要xa rollback ‘XXX’,所以复制就停了。
怎么修复?是不是直接跳过这个rollback就行了?
我们来试一下,跳过这个rollback:
STOP SLAVE;
SET GTID_NEXT="d199707a-5287-11e7-8f35-246e96729f10:1215544883";
BEGIN; COMMIT;
SET GTID_NEXT="AUTOMATIC";
START SLAVE;
这里GTID_NEXT值不能用show slave status的里executed值,得用具体报错的停止的gtid
但是,show slave status看到,还是有报错:
Error 'XAER_NOTA: Unknown XID' on query. Default database: 'XXXX'. Query: 'XA COMMIT X'3230375f313533353431363835313835373138305f',X'353034',39187'
为什么又报这个事务commit找不到XID呢?
之前说过,在relaylog截断的时候,如果事务没有commit,会自动在最后加rollback,在下一个relay log开始的时候重新做一次这个事务,按理说我们跳过这个rollback,在下个relaylog会被重做,为啥会在commit的时候找不到xid呢?
我们看到我们跳过的gtid原来就是重做这个事务到PREPARE阶段的gtid,原来,rollback是没有gtid的,所以我们实际上就是把这个事务到PREPARE阶段的gtid给跳过了,commit的时候肯定会找不到xid,接着怎么修复?
- 手动把这个事务执行一遍,手动commit。
- 指定 ‘/!/;’为结束符,每个binlog开头都有
BINLOG ' 8HFvWQ+8fQQAdwAAAHsAAAAAAAQANS43LjE4LTE1LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA AUvdRu4= '/*!*/;
- 设置gtid_next模式
SET @@SESSION.GTID_NEXT= 'AUTOMATIC’ ;
- 开始事务
XA START X'3230375f313533353431363835313835373138305f',X'353034',39187 /*!*/;
- 事务内容
BINLOG ' dn5fWRMp8AAAewAAAABpoBYAAHsAAAAAAAEADGRiX3BsYXlsaXN0MQAOTXVzaWNfUGxheWxpc3QA KQgPCAgICAMDAwgPAwMIAw8IAQMPCAEPAQMIAQEDCAEBDwMIAQQDAwMIDf0C/QL9AgAMdwGAAQT6 ////vwHUVHPT dn5fWRgp8AAABgIAAAZroBYAAHsAAAAAAAEAKf///////////////wCECAAA/tDzPBwAAAAAGQDm jL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAADhtLwxdAQAAAAAAAAAAAABy AAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAA3gv6G10BAAABAAAAgQEAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAHAAAAAAAAAAAAAAAAAAAAAAADhtLwxdAQAAAAAgAGIyYTcxOWE4NjdmZjNkNmJlZWI5 YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAAAACECAAA/tDz PBwAAAAAGQDmjL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAAL/9BR1dAQAA AAAAAAAAAABzAAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAAv/0FHV0BAAABAAAAgQEAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAHEAAAAAAAAAAAAAAAAAAAAAAL/9BR1dAQAAAAAgAGIyYTcxOWE4Njdm ZjNkNmJlZWI5YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAA AKXgKOw= '/*!*/;
- 结束 & prepare事务
XA END X'3230375f313533353431363835313835373138305f',X'353034',39187 /*!*/; XA PREPARE X'3230375f313533353431363835313835373138305f',X'353034',39187 /*!*/;
- 提交事务
XA COMMIT X'3230375f313533353431363835313835373138305f',X'353034',39187 /*!*/;
- 指定 ‘/!/;’为结束符,每个binlog开头都有
- 跳过commit这个gtid
- start slave
这样,才算是完整的修复了这个主从复制,并且数据保持一致。
为什么这是5.7的一个bug呢?5.7之前的版本因为relaylog被截断并不会出现这个bug。
5.7对xa事务的binlog记录方式做了修改,把 xa start,xa end,xa prepare放到一个event里,xa commit又是另外一个event。而在之前的MySQL版本中,整个xa事务从start到commit都是在一个event中,所以其他版本并没有问题。
最后一个问题:5.7为啥要把xa事务拆成两个event?简单的讲是为了数据安全性。
5.5或者5.6假设下面一个场景:
- 主库执行 xa事务直到prepare
- kill主库。
- 拉起主库,再xa commit停机前的事务
主库提交成功,修改的数据也是没问题的,再去从库,会发现,从库的数据并没有变化。WHY?
MySQL在某个分布式事务prepare成功后宕机,宕机前操作该事务的连接并没有断开(如果在宕机前断开连接,事务会被MySQL自动回滚),这个时候已经prepare的事务并不会被回滚,所以在MySQL重新启动后,引擎层通过recover机制能恢复该事务。当然该事务的Binlog已经在宕机过程中被丢失,这个时候,如果去提交,则会造成主从数据的不一致, 即提交没有记录Binlog,从上丢失该条数据。
正因为5.7之前版本的xa事务存在这个bug,5.7后做了修复。从XA START到XA PREPARE之间的操作都被记录到了Master的Binlog中,然后通过复制关系传到了Slave上。也就是说5.7开始,MySQL对于XA事务,在prepare的时候就完成了写Binlog的操作,通过新增一种叫 XA_prepare_log_event 的event类型来实现。
其实 MySQL5.7在xa事务上远不止这个bug,后面再来慢慢总结。