WatchDog运行原理

2018-03-19  本文已影响63人  小庄bb

概述

Android系统中,有硬件WatchDog用于定时检测关键硬件是否征程工作,类似地,在framework层有一个软件WatchDog用于定期检测关键系统服务是否发生死锁事件.WatchDog功能是主要分析系统核心服务和重要线程是否处于Blocked状态.

启动流程

startOtherServices

[-> SystemServer.java]

private void startOtherServices() {
    ...
    //创建watchdog【见小节2.2】
    final Watchdogwatchdog = Watchdog.getInstance();
    //注册reboot广播【见小节2.3】
    watchdog.init(context, mActivityManagerService);
    ...
    mSystemServiceManager.startBootPhase(SystemService.PHASE_LOCK_SETTINGS_READY);
    ...
    mActivityManagerService.systemReady(new Runnable() {
      @Override
      public void run() {
          mSystemServiceManager.startBootPhase(
                  SystemService.PHASE_ACTIVITY_MANAGER_READY);
          ...
          // watchdog启动【见小节3.1】
          Watchdog.getInstance().start();
          mSystemServiceManager.startBootPhase(
                  SystemService.PHASE_THIRD_PARTY_APPS_CAN_START);
        }
    }
}

getInstance

[-> Watchdog.java]

public static WatchdoggetInstance() {
    if (sWatchdog == null) {
        //单例模式,创建实例对象【见小节2.2.1 】
        sWatchdog = new Watchdog();
    }
    return sWatchdog;
}

创建Watchdog

[->Watchdog.java]

public class Watchdog extends Thread {
    ...
 
    private Watchdog() {
        super("watchdog");
        //【见小节2.2.2 】
        mMonitorChecker = new HandlerChecker(FgThread.getHandler(),
                "foreground thread", DEFAULT_TIMEOUT);
        mHandlerCheckers.add(mMonitorChecker);
        mHandlerCheckers.add(new HandlerChecker(new Handler(Looper.getMainLooper()),
                "main thread", DEFAULT_TIMEOUT));
        mHandlerCheckers.add(new HandlerChecker(UiThread.getHandler(),
                "ui thread", DEFAULT_TIMEOUT));
        mHandlerCheckers.add(new HandlerChecker(IoThread.getHandler(),
                "i/o thread", DEFAULT_TIMEOUT));
        mHandlerCheckers.add(new HandlerChecker(DisplayThread.getHandler(),
                "display thread", DEFAULT_TIMEOUT));
        //【见小节2.2.3】
        addMonitor(new BinderThreadMonitor());
    }
}

Watchdog继承于Thread,创建的线程名为"watchdog".mHandlerCheckers是记录着所有的HandlerChecker对象的列表.

WatchDog监控的线程有:

线程名 对应handler 含义
main thread new Handler(MainLooper) 当前主线程
android.fg FgThread.getHandler 前台线程
android.ui UiThread.getHandler UI线程
android.io IoThread.getHandler IO线程
android.display DisplayThread.getHandler Display线程

DEFALT_TIMEOUT默认为60s,调试时才为10s方便找出潜在的ANR问题.

HandlerChecker

[-> Watchdog.java]

public final class HandlerChecker implements Runnable {
    ...
    HandlerChecker(Handlerhandler, String name, long waitMaxMillis) {
        mHandler = handler;
        mName = name;
        mWaitMax = waitMaxMillis;
        mCompleted = true;
    }
}

mMonitors记录所有Watchdog目前正在监控的服务.

监控Binder线程

在创建Watchdog时,通过addMonitor(new BinderThreadMonitor())来监控BInder线程,这里拆分成两步:

addMonitor
public class Watchdog extends Thread {
    public void addMonitor(Monitormonitor) {
        synchronized (this) {
            if (isAlive()) {
                throw new RuntimeException("Monitors can't be added once the Watchdog is running");
            }
            //此处mMonitorChecker数据类型为HandlerChecker
            mMonitorChecker.addMonitor(monitor);
        }
    }
 
    public final class HandlerChecker implements Runnable {
        private final ArrayList<Monitor> mMonitors = new ArrayList<Monitor>();
 
        public void addMonitor(Monitormonitor) {
            //将上面的BinderThreadMonitor添加到mMonitors队列
            mMonitors.add(monitor);
        }
        ...
    }
}

将monitor添加到HandlerChecker的成员变量mMonitors列表中.

BinderThreadMonitor
private static final class BinderThreadMonitorimplements Watchdog.Monitor {
    public void monitor() {
        Binder.blockUntilThreadAvailable();
    }
}

blockUntilThreadAvaiable最终调用的IPCThreadState,等待有空闲的binder线程

void IPCThreadState::blockUntilThreadAvailable()
{
    pthread_mutex_lock(&mProcess->mThreadCountLock);
    while (mProcess->mExecutingThreadsCount >= mProcess->mMaxThreads) {
        //等待正在执行的binder线程小于进程最大binder线程上限(16个)
        pthread_cond_wait(&mProcess->mThreadCountDecrement, &mProcess->mThreadCountLock);
    }
    pthread_mutex_unlock(&mProcess->mThreadCountLock);
}

可见addMonitor(new BinderThreadMOnitor())是将Binder线程添加到android.fg线程handler(mMonitorChecher)来检查是否工作正常.

Monitor

public class Watchdog extends Thread {
    public interface Monitor {
        void monitor();
    }
}

能够被Watchdog监控的服务系统都实现了Watchdog.Monitor接口,实现该接口类:

init

[-> Watchdog.java]

public void init(Contextcontext, ActivityManagerServiceactivity) {
    mResolver = context.getContentResolver();
    mActivity = activity;
    //注册reboot广播接收者【见小节2.3.1】
    context.registerReceiver(new RebootRequestReceiver(),
            new IntentFilter(Intent.ACTION_REBOOT),
            android.Manifest.permission.REBOOT, null);
}

RebootRequestReceiver

[-> Watchdog.java]

final class RebootRequestReceiver extends BroadcastReceiver {
    @Override
    public void onReceive(Context c, Intentintent) {
        if (intent.getIntExtra("nowait", 0) != 0) {
            //【见小节2.3.2】
            rebootSystem("Received ACTION_REBOOT broadcast");
            return;
        }
        Slog.w(TAG, "Unsupported ACTION_REBOOT broadcast: " + intent);
    }
}

rebootSystem

[->Watchdog.java]

void rebootSystem(String reason) {
    Slog.i(TAG, "Rebooting system because: " + reason);
    IPowerManagerpms = (IPowerManager)ServiceManager.getService(Context.POWER_SERVICE);
    try {
        //通过PowerManager执行reboot操作
        pms.reboot(false, reason, false);
    } catch (RemoteExceptionex) {
    }
}

最终通过PowerManagerService来完成重启操作,具体的重启流程后续会单独讲述.

小节

获取watchdog实例对象,并注册reboot广播

Watch.dog

run

public void run() {
    boolean waitedHalf = false;
    while (true) {
        final ArrayList<HandlerChecker> blockedCheckers;
        final String subject;
        final boolean allowRestart;
        int debuggerWasConnected = 0;
        synchronized (this) {
            //timeout=30s
            long timeout = CHECK_INTERVAL;
            for (int i=0; i<mHandlerCheckers.size(); i++) {
                HandlerCheckerhc = mHandlerCheckers.get(i);
                //【见小节3.2】
                hc.scheduleCheckLocked();
            }
 
            if (debuggerWasConnected > 0) {
                debuggerWasConnected--;
            }
 
            long start = SystemClock.uptimeMillis();
            //等待30s
            while (timeout > 0) {
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                try {
                    wait(timeout);
                } catch (InterruptedException e) {
                    Log.wtf(TAG, e);
                }
                if (Debug.isDebuggerConnected()) {
                    debuggerWasConnected = 2;
                }
                timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
            }
            //获取等待状态【见小节3.3】
            final int waitState = evaluateCheckerCompletionLocked();
            if (waitState == COMPLETED) {
                waitedHalf = false;
                continue;
            } else if (waitState == WAITING) {
                continue;
            } else if (waitState == WAITED_HALF) {
                if (!waitedHalf) {
                    //第一次进入等待时间过半的状态
                    ArrayList<Integer> pids = new ArrayList<Integer>();
                    pids.add(Process.myPid());
                    //则输出栈信息【见小节3.4】
                    ActivityManagerService.dumpStackTraces(true, pids, null, null,
                            NATIVE_STACKS_OF_INTEREST);
                    waitedHalf = true;
                }
                continue;
            }
            //获取被阻塞的checkers
            blockedCheckers = getBlockedCheckersLocked();
            subject = describeCheckersLocked(blockedCheckers);
            allowRestart = mAllowRestart;
        }
 
        EventLog.writeEvent(EventLogTags.WATCHDOG, subject);
 
        ArrayList<Integer> pids = new ArrayList<Integer>();
        pids.add(Process.myPid());
        if (mPhonePid > 0) pids.add(mPhonePid);
        //waitedHalf=true,则追加输出栈信息【见小节3.4】
        final Filestack = ActivityManagerService.dumpStackTraces(
                !waitedHalf, pids, null, null, NATIVE_STACKS_OF_INTEREST);
        //系统已被阻塞1分钟,也不在乎多等待2s来确保stack trace信息输出
        SystemClock.sleep(2000);
 
        if (RECORD_KERNEL_THREADS) {
            //输出kernel栈信息【见小节3.5】
            dumpKernelStackTraces();
        }
 
        //触发kernel来dump所有阻塞线程【见小节3.6】
        doSysRq('l');
        //输出dropbox信息【见小节3.7】
        ThreaddropboxThread = new Thread("watchdogWriteToDropbox") {
                public void run() {
                    mActivity.addErrorToDropBox(
                            "watchdog", null, "system_server", null, null,
                            subject, null, stack, null);
                }
            };
        dropboxThread.start();
        try {
            //等待dropbox线程工作2s
            dropboxThread.join(2000);
        } catch (InterruptedExceptionignored) {}
 
        IActivityControllercontroller;
        synchronized (this) {
            controller = mController;
        }
        if (controller != null) {
            //将阻塞状态报告给activity controller,
            try {
                Binder.setDumpDisabled("Service dumps disabled due to hung system process.");
                //返回值为1表示继续等待,-1表示杀死系统
                int res = controller.systemNotResponding(subject);
                if (res >= 0) {
                    waitedHalf = false; //继续等待
                    continue;
                }
            } catch (RemoteException e) {
            }
        }
 
        //当debugger没有attach时,才杀死进程
        if (Debug.isDebuggerConnected()) {
            debuggerWasConnected = 2;
        }
        if (debuggerWasConnected >= 2) {
            Slog.w(TAG, "Debugger connected: Watchdog is *not* killing the system process");
        } else if (debuggerWasConnected > 0) {
            Slog.w(TAG, "Debugger was connected: Watchdog is *not* killing the system process");
        } else if (!allowRestart) {
            Slog.w(TAG, "Restart not allowed: Watchdog is *not* killing the system process");
        } else {
            Slog.w(TAG, "*** WATCHDOG KILLING SYSTEM PROCESS: " + subject);
            //遍历输出阻塞线程的栈信息
            for (int i=0; i<blockedCheckers.size(); i++) {
                Slog.w(TAG, blockedCheckers.get(i).getName() + " stack trace:");
                StackTraceElement[] stackTrace
                        = blockedCheckers.get(i).getThread().getStackTrace();
                for (StackTraceElementelement: stackTrace) {
                    Slog.w(TAG, "    at " + element);
                }
            }
            Slog.w(TAG, "*** GOODBYE!");
            //杀死进程system_server【见小节3.8】
            Process.killProcess(Process.myPid());
            System.exit(10);
        }
 
        waitedHalf = false;
    }
}

scheduleCheckLocked

public final class HandlerChecker implements Runnable {
    ...
    public void scheduleCheckLocked() {
        if (mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling()) {
            mCompleted = true;
            return;
        }
 
        if (!mCompleted) {
            return; //有一个check正在处理中,则无需重复发送
        }
 
        mCompleted = false;
        mCurrentMonitor = null;
        mStartTime = SystemClock.uptimeMillis();
        //发送消息,插入消息队列最开头【见3.2.1】
        mHandler.postAtFrontOfQueue(this);
    }
}

postAtFrountOfQueue(this),该方法输入参数为Runnable对象,根据消息机制HandlerCHecker中的run方法.

HandlerCheckerRun.run

public final class HandlerChecker implements Runnable {
    public void run() {
        final int size = mMonitors.size();
        for (int i = 0 ; i < size ; i++) {
            synchronized (Watchdog.this) {
                mCurrentMonitor = mMonitors.get(i);
            }
            //回调具体服务的monitor方法
            mCurrentMonitor.monitor();
        }
 
        synchronized (Watchdog.this) {
            mCompleted = true;
            mCurrentMonitor = null;
        }
    }
}

回调的方法,例如BinderTreadMonitor.monitor

evaluaterCHeckerCompletionLOcked

private int evaluateCheckerCompletionLocked() {
    int state = COMPLETED;
    for (int i=0; i<mHandlerCheckers.size(); i++) {
        HandlerCheckerhc = mHandlerCheckers.get(i);
        //【见小节3.3.1】
        state = Math.max(state, hc.getCompletionStateLocked());
    }
    return state;
}

获取mHandlerCheckers列表中等待状态值最大的state

getCompletionStateLocked

public int getCompletionStateLocked() {
    if (mCompleted) {
        return COMPLETED;
    } else {
        long latency = SystemClock.uptimeMillis() - mStartTime;
        if (latency < mWaitMax/2) {
            return WAITING;
        } else if (latency < mWaitMax) {
            return WAITED_HALF;
        }
    }
    return OVERDUE;
}

AMS.dumpStackTraces

public static FiledumpStackTraces(boolean clearTraces, ArrayList<Integer> firstPids,
        ProcessCpuTrackerprocessCpuTracker, SparseArray<Boolean> lastPids, String[] nativeProcs) {
    //默认为 data/anr/traces.txt
    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
    if (tracesPath == null || tracesPath.length() == 0) {
        return null;
    }
 
    FiletracesFile = new File(tracesPath);
    try {
        //当clearTraces,则删除已存在的traces文件
        if (clearTraces && tracesFile.exists()) tracesFile.delete();
        //创建traces文件
        tracesFile.createNewFile();
        // -rw-rw-rw-
        FileUtils.setPermissions(tracesFile.getPath(), 0666, -1, -1);
    } catch (IOException e) {
        return null;
    }
    //输出trace内容
    dumpStackTraces(tracesPath, firstPids, processCpuTracker, lastPids, nativeProcs);
    return tracesFile;
}

关于trace内容,这里就不细说,直接说说结论:
1调用Process.sendSignal()向目标进程发送信号SIGNAL_QUIT;
2分别调用backtrace.dump_backtrace(),输出/system/bin/mediaserver, /system/bin/sdcard/, /sysatem/bin/surfacelinger这三个进程的backtrace;
3统计CPU使用率;
4调用Process.sendSignal()向其他进程发送信号SIGNAL_QUIT.

dumpKernelStackTrace

private FiledumpKernelStackTraces() {
    // 路径为data/anr/traces.txt
    String tracesPath = SystemProperties.get("dalvik.vm.stack-trace-file", null);
    if (tracesPath == null || tracesPath.length() == 0) {
        return null;
    }
 
    native_dumpKernelStacks(tracesPath);
    return new File(tracesPath);
}

native_dumpKernelStacks调用到android_server_Watchdog.dumpKernelStacks

doSysRq

private void doSysRq(char c) {
    try {
        FileWritersysrq_trigger = new FileWriter("/proc/sysrq-trigger");
        sysrq_trigger.write(c);
        sysrq_trigger.close();
    } catch (IOException e) {
        Slog.w(TAG, "Failed to write to /proc/sysrq-trigger", e);
    }
}

通过向节点/proc/sysrq-trigger写入字符,触发kernel来dump所有阻塞线程,输出所有CPU的backtrace到kernel log.

KillProcess

当杀死system_server进程,从而导致zygote进程自杀,进入触发init执行重启Zygote进程,这便出现了手机framwork重启现象.

小节

watchDog在check过程中出现了阻塞一分钟的情况,则会输出:

  1. AMS.dumpStackTraces
  1. dumpKernelStackTrace,输出kernel栈信息
  2. dropBox,输出文件到/data/system/dropbox

文章来自:
http://www.qingpingshan.com/rjbc/az/147237.html

上一篇下一篇

猜你喜欢

热点阅读