WatchDog运行原理
概述
Android系统中,有硬件WatchDog用于定时检测关键硬件是否征程工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件.WatchDog功能是主要分析系统核心服务和重要线程是否处于Blocked状态.
- 监视reboot广播;
- 监视mMonnitors关键系统服务是否死锁.
启动流程
startOtherServices
[-> SystemServer.java]
private void startOtherServices() {
...
//创建watchdog【见小节2.2】
final Watchdogwatchdog = Watchdog.getInstance();
//注册reboot广播【见小节2.3】
watchdog.init(context, mActivityManagerService);
...
mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY);
...
mActivityManagerService.systemReady(new Runnable() {
@Override
public void run() {
mSystemServiceManager.startBootPhase(
SystemService.PHASE_ACTIVITY_MANAGER_READY);
...
// watchdog启动【见小节3.1】
Watchdog.getInstance().start();
mSystemServiceManager.startBootPhase(
SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);
}
}
}
getInstance
[-> Watchdog.java]
public static WatchdoggetInstance() {
if (sWatchdog == null) {
//单例模式,创建实例对象【见小节2.2.1 】
sWatchdog = new Watchdog();
}
return sWatchdog;
}
创建Watchdog
[->Watchdog.java]
public class Watchdog extends Thread {
...
private Watchdog() {
super("watchdog");
//【见小节2.2.2 】
mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
"foreground thread", DEFAULT_TIMEOUT);
mHandlerCheckers.add(mMonitorChecker);
mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
"main thread", DEFAULT_TIMEOUT));
mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
"ui thread", DEFAULT_TIMEOUT));
mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
"i/o thread", DEFAULT_TIMEOUT));
mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
"display thread", DEFAULT_TIMEOUT));
//【见小节2.2.3】
addMonitor(new BinderThreadMonitor());
}
}
Watchdog继承于Thread,创建的线程名为"watchdog".mHandlerCheckers是记录着所有的HandlerChecker对象的列表.
WatchDog监控的线程有:
线程名 | 对应handler | 含义 |
---|---|---|
main thread | new Handler(MainLooper) | 当前主线程 |
android.fg | FgThread.getHandler | 前台线程 |
android.ui | UiThread.getHandler | UI线程 |
android.io | IoThread.getHandler | IO线程 |
android.display | DisplayThread.getHandler | Display线程 |
DEFALT_TIMEOUT默认为60s,调试时才为10s方便找出潜在的ANR问题.
HandlerChecker
[-> Watchdog.java]
public final class HandlerChecker implements Runnable {
...
HandlerChecker(Handlerhandler, String name, long waitMaxMillis) {
mHandler = handler;
mName = name;
mWaitMax = waitMaxMillis;
mCompleted = true;
}
}
mMonitors记录所有Watchdog目前正在监控的服务.
监控Binder线程
在创建Watchdog时,通过addMonitor(new BinderThreadMonitor())来监控BInder线程,这里拆分成两步:
- addMonitor
- new BinderThreadMonitor
addMonitor
public class Watchdog extends Thread {
public void addMonitor(Monitormonitor) {
synchronized (this) {
if (isAlive()) {
throw new RuntimeException("Monitors can't be added once the Watchdog is running");
}
//此处mMonitorChecker数据类型为HandlerChecker
mMonitorChecker.addMonitor(monitor);
}
}
public final class HandlerChecker implements Runnable {
private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
public void addMonitor(Monitormonitor) {
//将上面的BinderThreadMonitor添加到mMonitors队列
mMonitors.add(monitor);
}
...
}
}
将monitor添加到HandlerChecker的成员变量mMonitors列表中.
BinderThreadMonitor
private static final class BinderThreadMonitorimplements Watchdog.Monitor {
public void monitor() {
Binder.blockUntilThreadAvailable();
}
}
blockUntilThreadAvaiable最终调用的IPCThreadState,等待有空闲的binder线程
void IPCThreadState::blockUntilThreadAvailable()
{
pthread_mutex_lock(&mProcess->mThreadCountLock);
while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
//等待正在执行的binder线程小于进程最大binder线程上限(16个)
pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
}
pthread_mutex_unlock(&mProcess->mThreadCountLock);
}
可见addMonitor(new BinderThreadMOnitor())是将Binder线程添加到android.fg线程handler(mMonitorChecher)来检查是否工作正常.
Monitor
public class Watchdog extends Thread {
public interface Monitor {
void monitor();
}
}
能够被Watchdog监控的服务系统都实现了Watchdog.Monitor接口,实现该接口类:
- AcitvityManagerService
- PowerManagerService
- WindowMangerService
- InputManagerService
- NetworkManagermentService
- MountService
- NativeDaemonConnector
- BinderThreadMonitor
- MediaProjectionMangerService
- MediaRouterService
- MediaSessionService
init
[-> Watchdog.java]
public void init(Contextcontext, ActivityManagerServiceactivity) {
mResolver = context.getContentResolver();
mActivity = activity;
//注册reboot广播接收者【见小节2.3.1】
context.registerReceiver(new RebootRequestReceiver(),
new IntentFilter(Intent.ACTION_REBOOT),
android.Manifest.permission.REBOOT, null);
}
RebootRequestReceiver
[-> Watchdog.java]
final class RebootRequestReceiver extends BroadcastReceiver {
@Override
public void onReceive(Context c, Intentintent) {
if (intent.getIntExtra("nowait", 0) != 0) {
//【见小节2.3.2】
rebootSystem("Received ACTION_REBOOT broadcast");
return;
}
Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
}
}
rebootSystem
[->Watchdog.java]
void rebootSystem(String reason) {
Slog.i(TAG, "Rebooting system because: " + reason);
IPowerManagerpms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
try {
//通过PowerManager执行reboot操作
pms.reboot(false, reason, false);
} catch (RemoteExceptionex) {
}
}
最终通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述.
小节
获取watchdog实例对象,并注册reboot广播
- mHandlerCheckers记录所有的HandlerChecker对象的列表,包括foreground,main,ui,i/o,display线程的handler;
- mMonitors记录所有Watchdog目前正在监控Monitor,此处为BinderThreadMonitor;
- 注册reboot广播,最终是通过PowerMangerService来完成;
- 系统每间隔一分钟执行一侧monitor操作,当系统hang时间超过1分钟则执行run()方法;
下面来看看当系统hang超过1分钟时,进入watchdog.run()的过程:
Watch.dog
run
public void run() {
boolean waitedHalf = false;
while (true) {
final ArrayList<HandlerChecker> blockedCheckers;
final String subject;
final boolean allowRestart;
int debuggerWasConnected = 0;
synchronized (this) {
//timeout=30s
long timeout = CHECK_INTERVAL;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerCheckerhc = mHandlerCheckers.get(i);
//【见小节3.2】
hc.scheduleCheckLocked();
}
if (debuggerWasConnected > 0) {
debuggerWasConnected--;
}
long start = SystemClock.uptimeMillis();
//等待30s
while (timeout > 0) {
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
//获取等待状态【见小节3.3】
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
waitedHalf = false;
continue;
} else if (waitState == WAITING) {
continue;
} else if (waitState == WAITED_HALF) {
if (!waitedHalf) {
//第一次进入等待时间过半的状态
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
//则输出栈信息【见小节3.4】
ActivityManagerService.dumpStackTraces(true, pids, null, null,
NATIVE_STACKS_OF_INTEREST);
waitedHalf = true;
}
continue;
}
//获取被阻塞的checkers
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
ArrayList<Integer> pids = new ArrayList<Integer>();
pids.add(Process.myPid());
if (mPhonePid > 0) pids.add(mPhonePid);
//waitedHalf=true,则追加输出栈信息【见小节3.4】
final Filestack = ActivityManagerService.dumpStackTraces(
!waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
//系统已被阻塞1分钟,也不在乎多等待2s来确保stack trace信息输出
SystemClock.sleep(2000);
if (RECORD_KERNEL_THREADS) {
//输出kernel栈信息【见小节3.5】
dumpKernelStackTraces();
}
//触发kernel来dump所有阻塞线程【见小节3.6】
doSysRq('l');
//输出dropbox信息【见小节3.7】
ThreaddropboxThread = new Thread("watchdogWriteToDropbox") {
public void run() {
mActivity.addErrorToDropBox(
"watchdog", null, "system_server", null, null,
subject, null, stack, null);
}
};
dropboxThread.start();
try {
//等待dropbox线程工作2s
dropboxThread.join(2000);
} catch (InterruptedExceptionignored) {}
IActivityControllercontroller;
synchronized (this) {
controller = mController;
}
if (controller != null) {
//将阻塞状态报告给activity controller,
try {
Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
//返回值为1表示继续等待,-1表示杀死系统
int res = controller.systemNotResponding(subject);
if (res >= 0) {
waitedHalf = false; //继续等待
continue;
}
} catch (RemoteException e) {
}
}
//当debugger没有attach时,才杀死进程
if (Debug.isDebuggerConnected()) {
debuggerWasConnected = 2;
}
if (debuggerWasConnected >= 2) {
Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
} else if (debuggerWasConnected > 0) {
Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
} else if (!allowRestart) {
Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
} else {
Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
//遍历输出阻塞线程的栈信息
for (int i=0; i<blockedCheckers.size(); i++) {
Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
StackTraceElement[] stackTrace
= blockedCheckers.get(i).getThread().getStackTrace();
for (StackTraceElementelement: stackTrace) {
Slog.w(TAG, " at " + element);
}
}
Slog.w(TAG, "*** GOODBYE!");
//杀死进程system_server【见小节3.8】
Process.killProcess(Process.myPid());
System.exit(10);
}
waitedHalf = false;
}
}
scheduleCheckLocked
public final class HandlerChecker implements Runnable {
...
public void scheduleCheckLocked() {
if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
mCompleted = true;
return;
}
if (!mCompleted) {
return; //有一个check正在处理中,则无需重复发送
}
mCompleted = false;
mCurrentMonitor = null;
mStartTime = SystemClock.uptimeMillis();
//发送消息,插入消息队列最开头【见3.2.1】
mHandler.postAtFrontOfQueue(this);
}
}
postAtFrountOfQueue(this),该方法输入参数为Runnable对象,根据消息机制HandlerCHecker中的run方法.
HandlerCheckerRun.run
public final class HandlerChecker implements Runnable {
public void run() {
final int size = mMonitors.size();
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
//回调具体服务的monitor方法
mCurrentMonitor.monitor();
}
synchronized (Watchdog.this) {
mCompleted = true;
mCurrentMonitor = null;
}
}
}
回调的方法,例如BinderTreadMonitor.monitor
evaluaterCHeckerCompletionLOcked
private int evaluateCheckerCompletionLocked() {
int state = COMPLETED;
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerCheckerhc = mHandlerCheckers.get(i);
//【见小节3.3.1】
state = Math.max(state, hc.getCompletionStateLocked());
}
return state;
}
获取mHandlerCheckers列表中等待状态值最大的state
getCompletionStateLocked
public int getCompletionStateLocked() {
if (mCompleted) {
return COMPLETED;
} else {
long latency = SystemClock.uptimeMillis() - mStartTime;
if (latency < mWaitMax/2) {
return WAITING;
} else if (latency < mWaitMax) {
return WAITED_HALF;
}
}
return OVERDUE;
}
- COMPLETED = 0 : 等待完成 ;
- WAITING = 1 : 等待时间小于DEFAULT_TIMEOUT的一半,即30s;
- WAITED_HALF = 2 : 等待时间处于30s-60s之间;
- OVERDUE = 3 : 等待时间大于或者等于60s.
AMS.dumpStackTraces
public static FiledumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
ProcessCpuTrackerprocessCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
//默认为 data/anr/traces.txt
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
if (tracesPath == null || tracesPath.length() == 0) {
return null;
}
FiletracesFile = new File(tracesPath);
try {
//当clearTraces,则删除已存在的traces文件
if (clearTraces && tracesFile.exists()) tracesFile.delete();
//创建traces文件
tracesFile.createNewFile();
// -rw-rw-rw-
FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1);
} catch (IOException e) {
return null;
}
//输出trace内容
dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
return tracesFile;
}
关于trace内容,这里就不细说,直接说说结论:
1调用Process.sendSignal()向目标进程发送信号SIGNAL_QUIT;
2分别调用backtrace.dump_backtrace(),输出/system/bin/mediaserver, /system/bin/sdcard/, /sysatem/bin/surfacelinger这三个进程的backtrace;
3统计CPU使用率;
4调用Process.sendSignal()向其他进程发送信号SIGNAL_QUIT.
dumpKernelStackTrace
private FiledumpKernelStackTraces() {
// 路径为data/anr/traces.txt
String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
if (tracesPath == null || tracesPath.length() == 0) {
return null;
}
native_dumpKernelStacks(tracesPath);
return new File(tracesPath);
}
native_dumpKernelStacks调用到android_server_Watchdog.dumpKernelStacks
doSysRq
private void doSysRq(char c) {
try {
FileWritersysrq_trigger = new FileWriter("/proc/sysrq-trigger");
sysrq_trigger.write(c);
sysrq_trigger.close();
} catch (IOException e) {
Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
}
}
通过向节点/proc/sysrq-trigger写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log.
KillProcess
当杀死system_server进程,从而导致zygote进程自杀,进入触发init执行重启Zygote进程,这便出现了手机framwork重启现象.
小节
watchDog在check过程中出现了阻塞一分钟的情况,则会输出:
- AMS.dumpStackTraces
- kill -3
- backtrace.dump_backtrace()
- dumpKernelStackTrace,输出kernel栈信息
- dropBox,输出文件到/data/system/dropbox