MySQL 5.7复制的一个小bug-XA事务

2017-07-21  本文已影响0人  小卢二

线上一个5.7从库复制中断:

mysql>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.0.1
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000006
          Read_Master_Log_Pos: 49461560
               Relay_Log_File: db-music1-relay-bin.000002
                Relay_Log_Pos: 379466837
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1399
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 379466624
              Relay_Log_Space: 2575974028
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1399
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 244331
                  Master_UUID: ce522aa3-6398-11e7-945f-246e96729568
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 170708 13:22:50
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: ce522aa3-6398-11e7-945f-246e96729568:1,
d199707a-5287-11e7-8f35-246e96729f10:1215032565-1218510979
            Executed_Gtid_Set: 30472597-527e-11e7-a93f-246e96729ec0:1-47,
d199707a-5287-11e7-8f35-246e96729f10:1-1215544882,
d23b9cfd-5287-11e7-a307-246e96729ff0:1-9
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

查询具体报错:

mysql>select * from performance_schema.replication_applier_status_by_worker \G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 3
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: d199707a-5287-11e7-8f35-246e96729f10:1215544883
    LAST_ERROR_NUMBER: 1399
   LAST_ERROR_MESSAGE: Worker 3 failed executing transaction 'd199707a-5287-11e7-8f35-246e96729f10:1215544883' at master log mysql-bin.000001, end_log_pos 123; Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  IDLE state' on query. Default database: ''. Query: 'ROLLBACK'
 LAST_ERROR_TIMESTAMP: 2017-07-08 13:22:50
1 rows in set (0.00 sec)

第一感觉很奇怪,为什么会rollback失败呢?于是根据gtid去对应的主库binlog去看了下,并没有任何rollback语句:

# at 379611076
#170707 20:28:38 server id 61481  end_log_pos 379611141 CRC32 0x65240fde        GTID    last_committed=512318   sequence_number=512319
SET @@SESSION.GTID_NEXT= 'd199707a-5287-11e7-8f35-246e96729f10:1215544883'/*!*/;
# at 379611141
#170707 20:28:38 server id 61481  end_log_pos 379611269 CRC32 0x19e0ec66        Query   thread_id=758257        exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=524288/*!*/;
XA START X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379611269
#170707 20:28:38 server id 61481  end_log_pos 379611392 CRC32 0xd37354d4        Table_map: `XXXX`.`XXXX` mapped to number 123
# at 379611392
#170707 20:28:38 server id 61481  end_log_pos 379611910 CRC32 0xec28e0a5        Update_rows: table id 123 flags: STMT_END_F
### UPDATE  XXXX.XXXX
### WHERE
###   @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
###   @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=1499148021048 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @7=114 /* INT meta=0 nullable=1 is_null=0 */
### SET
###   @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */```
###   @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=1499430518207 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @7=115 /* INT meta=0 nullable=1 is_null=0 */
# at 379611910```
#170707 20:28:38 server id 61481  end_log_pos 379612036 CRC32 0x8f8d4a35        Query   thread_id=758257        exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=1436549152/*!*/;
XA END X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379612036
#170707 20:28:38 server id 61481  end_log_pos 379612096 CRC32 0xd4966095        XA PREPARE X'3230375f313533353431363835313835373138305f',X'353034',39187
XA PREPARE X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379614541
#170707 20:28:38 server id 61481  end_log_pos 379614687 CRC32 0x5273d888        Query   thread_id=758257        exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=1436549152/*!*/;
XA COMMIT X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;

看下本地的relay log,找到这个事务的gtid

# at 379611289
#170707 20:28:38 server id 61481  end_log_pos 379611141 CRC32 0x65240fde        GTID    last_committed=512318   sequence_number=512319
SET @@SESSION.GTID_NEXT= 'd199707a-5287-11e7-8f35-246e96729f10:1215544883'/*!*/;
# at 379611354
#170707 20:28:38 server id 61481  end_log_pos 379611269 CRC32 0x19e0ec66        Query   thread_id=758257        exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=524288/*!*/;
XA START X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379611482
#170707 20:28:38 server id 61481  end_log_pos 379611392 CRC32 0xd37354d4        Table_map: XXXX.XXXX mapped to number 123
# at 379611605
#170707 20:28:38 server id 61481  end_log_pos 379611910 CRC32 0xec28e0a5        Update_rows: table id 123 flags: STMT_END_F

BINLOG '
dn5fWRMp8AAAewAAAABpoBYAAHsAAAAAAAEADGRiX3BsYXlsaXN0MQAOTXVzaWNfUGxheWxpc3QA
KQgPCAgICAMDAwgPAwMIAw8IAQMPCAEPAQMIAQEDCAEBDwMIAQQDAwMIDf0C/QL9AgAMdwGAAQT6
////vwHUVHPT
dn5fWRgp8AAABgIAAAZroBYAAHsAAAAAAAEAKf///////////////wCECAAA/tDzPBwAAAAAGQDm
jL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAADhtLwxdAQAAAAAAAAAAAABy
AAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAA3gv6G10BAAABAAAAgQEAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAHAAAAAAAAAAAAAAAAAAAAAAADhtLwxdAQAAAAAgAGIyYTcxOWE4NjdmZjNkNmJlZWI5
YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAAAACECAAA/tDz
PBwAAAAAGQDmjL3mg4UxMTEx5Zac5qyi55qE6Z+z5LmQ1toxFAAAAACYQgVuVwEAAL/9BR1dAQAA
AAAAAAAAAABzAAAAAAAAAAAAAADhyREAAABCAAAAAAAFAAAAv/0FHV0BAAABAAAAgQEAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAHEAAAAAAAAAAAAAAAAAAAAAAL/9BR1dAQAAAAAgAGIyYTcxOWE4Njdm
ZjNkNmJlZWI5YTRlZjI1NGVjNDliAAAAAAAAAAAAAAAAAGYGrEQAAAAAAAAAAAAAAAAAAAAAAAAA
AKXgKOw=
'/*!*/;
### UPDATE  XXXX.XXXX
### WHERE
###   @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
###   @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=1499148021048 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @7=114 /* INT meta=0 nullable=1 is_null=0 */
### SET
###   @1=473756624 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @2='挽情1111喜欢的音乐' /* VARSTRING(765) meta=765 nullable=1 is_null=0 */
###   @3=338811606 /* LONGINT meta=0 nullable=0 is_null=0 */
###   @4=1475019621016 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @5=1499430518207 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @6=0 /* LONGINT meta=0 nullable=1 is_null=0 */
###   @7=115 /* INT meta=0 nullable=1 is_null=0 */
# at 379612123
#170707 20:28:38 server id 61481  end_log_pos 379612036 CRC32 0x8f8d4a35        Query   thread_id=758257        exec_time=59329 error_code=0
SET TIMESTAMP=1499430518/*!*/;
SET @@session.sql_mode=1436549152/*!*/;
XA END X'3230375f313533353431363835313835373138305f',X'353034',39187
/*!*/;
# at 379612249
#170708 13:18:53 server id 254331  end_log_pos 379612306 CRC32 0xf2c69619       Rotate to db-music1-relay-bin.000003  pos: 4
ROLLBACK /* added by mysqlbinlog */ /*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

到这里,这个relay log日志文件结束了。很显然问题也找到了,就是执行

ROLLBACK /* added by mysqlbinlog */ /*!*/;

出错了。

首先 我们看到 这个rollback是MySQL自己加上去的,那么为什么要加呢?
mysql为了保证一个事务只在一个binlog里,所以当Binlog或者relay log发生截断时,最后一个事务要么commit,要么rollback,如果rollback,那么下一个binlog或者relay log会把这个事务重做一遍,保证这个事务不会丢。
由于xa事务无法直接rollback,而需要xa rollback ‘XXX’,所以复制就停了。
怎么修复?是不是直接跳过这个rollback就行了?
我们来试一下,跳过这个rollback:

STOP SLAVE;
SET GTID_NEXT="d199707a-5287-11e7-8f35-246e96729f10:1215544883";
BEGIN; COMMIT;
SET GTID_NEXT="AUTOMATIC";
START SLAVE;

这里GTID_NEXT值不能用show slave status的里executed值,得用具体报错的停止的gtid
但是,show slave status看到,还是有报错:

Error 'XAER_NOTA: Unknown XID' on query. Default database: 'XXXX'. Query: 'XA COMMIT X'3230375f313533353431363835313835373138305f',X'353034',39187'

为什么又报这个事务commit找不到XID呢?
之前说过,在relaylog截断的时候,如果事务没有commit,会自动在最后加rollback,在下一个relay log开始的时候重新做一次这个事务,按理说我们跳过这个rollback,在下个relaylog会被重做,为啥会在commit的时候找不到xid呢?
我们看到我们跳过的gtid原来就是重做这个事务到PREPARE阶段的gtid,原来,rollback是没有gtid的,所以我们实际上就是把这个事务到PREPARE阶段的gtid给跳过了,commit的时候肯定会找不到xid,接着怎么修复?

为什么这是5.7的一个bug呢?5.7之前的版本因为relaylog被截断并不会出现这个bug。
5.7对xa事务的binlog记录方式做了修改,把 xa start,xa end,xa prepare放到一个event里,xa commit又是另外一个event。而在之前的MySQL版本中,整个xa事务从start到commit都是在一个event中,所以其他版本并没有问题。
最后一个问题:5.7为啥要把xa事务拆成两个event?简单的讲是为了数据安全性。
5.5或者5.6假设下面一个场景:

MySQL在某个分布式事务prepare成功后宕机,宕机前操作该事务的连接并没有断开(如果在宕机前断开连接,事务会被MySQL自动回滚),这个时候已经prepare的事务并不会被回滚,所以在MySQL重新启动后,引擎层通过recover机制能恢复该事务。当然该事务的Binlog已经在宕机过程中被丢失,这个时候,如果去提交,则会造成主从数据的不一致, 即提交没有记录Binlog,从上丢失该条数据。

正因为5.7之前版本的xa事务存在这个bug,5.7后做了修复。从XA START到XA PREPARE之间的操作都被记录到了Master的Binlog中,然后通过复制关系传到了Slave上。也就是说5.7开始,MySQL对于XA事务,在prepare的时候就完成了写Binlog的操作,通过新增一种叫 XA_prepare_log_event 的event类型来实现。

其实 MySQL5.7在xa事务上远不止这个bug,后面再来慢慢总结。

上一篇下一篇

猜你喜欢

热点阅读