quartz-scheduler导致InnoDB deadloc

2016-08-16 本文已影响0人 williamlee

最近项目中使用spring+quartz的方式来实现跑批任务，偶然发现日志中存在InnoDB中deadlock，排查一番真是废了很多精力。

先上一炮异常：

2016-08-15 14:51:06.136 [QuartzScheduler_scheduler-WilliamLee1471243617837_ClusterManager] ERROR jdbc.sqlonly - 109. PreparedStatement.executeUpdate() INSERT INTO QRTZ_SCHEDULER_STATE (SCHED_NAME, INSTANCE_NAME, LAST_CHECKIN_TIME, CHECKIN_INTERVAL) VALUES('scheduler', 'WilliamLee1471243617837', 1471243866096, 10000)

com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
---
blablabla
---
    at org.quartz.impl.jdbcjobstore.JobStoreSupport.clusterCheckIn(JobStoreSupport.java:3401) [quartz-2.2.1.jar:na]
    at org.quartz.impl.jdbcjobstore.JobStoreSupport.doCheckin(JobStoreSupport.java:3253) [quartz-2.2.1.jar:na]
    at org.quartz.impl.jdbcjobstore.JobStoreSupport$ClusterManager.manage(JobStoreSupport.java:3858) [quartz-2.2.1.jar:na]
    at org.quartz.impl.jdbcjobstore.JobStoreSupport$ClusterManager.run(JobStoreSupport.java:3895) [quartz-2.2.1.jar:na]

quartz-scheduler中集群模式是通过数据库来交互的，我们使用的是mysql，默认的事务隔离级别是rr，在框架中用来维护scheduler的类是ClusterManager，看报错信息是这个类的clusterCheckIn方法执行的时候发生死锁。

解析下clusterCheckIn()干什么了？
先从它的上一层方法说doCheckin()，关键代码。

Connection conn = getNonManagedTXConnection();
if (!firstCheckIn) {    
    failedRecords = clusterCheckIn(conn);    
    commitConnection(conn);
}
failedRecords = (firstCheckIn) ? clusterCheckIn(conn) : findFailedInstances(conn);
clusterRecover(conn, failedRecords);

首先拿到一个autocommit=false的Connection，然后执行clusterCheckIn()，返回failedRecords，最后会在clusterRecover()删除掉这些failedRecords。接下来看看clusterCheckIn()。

lastCheckin = System.currentTimeMillis();
if(getDelegate().updateSchedulerState(conn, getInstanceId(), lastCheckin) == 0) {    
     getDelegate().insertSchedulerState(conn, getInstanceId(), lastCheckin, getClusterCheckinInterval());
}

先update如果返回结果为0条，那就insert这条记录。死锁就是发生在这个地方。
当在debug的时候，我先提交一个update语句，但是这条语句并未命中，在rr事务隔离级别中这会触发gap锁，防止其他事务中进行插入。这时其他线程（例如其他节点）同样会执行这段代码，先提交update语句，同样触发gap锁，gap与gap相互兼容，这时无论哪个线程提交insert语句之后都会阻塞，当第二个线程提交insert之后就会发生死锁，事务1希望事务2释放gap让自己完成insert操作，事务2希望事务1释放gap让自己完成insert操作。
我个人觉得这个问题quartz框架并没有解决，只是在正常情况下不容易出现，这个bug在1.8.4中被提出过（bug链接）查看源码发现解决的办法就是再执行过程中进行了一次commit。

failedRecords = clusterCheckIn(conn);
commitConnection(conn);

解决办法
将数据库的默认隔离级别修改成rc,不存在gap锁，就不会出现这种问题。
另外我们这个情况是因为在开发环境debug,导致线程全部阻塞,增加了发生死锁的概率,最后把debug级别修改成thread后就没在发生过.

以上是粗读quartz-scheduler源码总结，如果有错误的地方欢迎指点。

quartz-scheduler导致InnoDB deadloc

猜你喜欢

热点阅读