Android ANR触发原理

2018-11-27 本文已影响0人 CyanStone

原理简介

Android中的ANR，是Application Not Responding的简称。在Android系统中，ActivityManagerService和WindowManagerService会检测APP的响应时间，在应用进程的主线程处理特定的事件之前，用AMS/BroadcastQueue等相关的Handler像系统进程的Looper发送一个延时消息，在延时的时间之内，如果特定事件被执行完，则会移除掉MessageQueue中加入的那个延时消息；否则，如果特定的事件没有执行完，则不会移除那个消息，相应的Looper会取出该消息进行处理，从而触发ANR。这就是触发ANR的原理。

触发ANR的条件

InputDipatching TimeOut：5秒内无法响应屏幕触发事件或者键盘事件；
BroadcastQueue TimeOut：在执行前台广播（BroadcastReceiver）的onReceive()方法时10秒没有处理完成，后台广播的超时时间为60s；
Service TimeOut：前台服务20秒内没有执行完毕；后台服务200秒内没有执行完毕；
ContentProvider TimeOut：ContentProvider的publish方法在10秒内没有执行完；

源码分析（基于Android 8.0）

1.Service

ActiveServices是AMS管理的一个对象，它主要负责Service的启动、停止、绑定等相关的工作。具体的Service启动流程这里暂时不做分析，现在主要来看在我们调用了ContextImpl.startService()方法后的真正启动Service的方法realStartServiceLocked：

private final void realStartServiceLocked(ServiceRecord r, ProcessRecord app,
    boolean execInFg) throws RemoteException {
    ...
    //发送延时消息的方法
    bumpServiceExecutingLocked(r, execInFg, "create");
    ...
    //创建Service并执行onCreate方法，这里不再进一步分析
    app.thread.scheduleCreateService(r, r.serviceInfo, 
        mAm.compatibilityInfoForPackageLocked(r.serviceInfo.applicationInfo),
        app.repProcState);
   ...
}

下面来看下bumpServiceExecutingLocked方法

private final void bumpServiceExecutingLocked(ServiceRecord r, boolean fg, String why) {
  ...
  scheduleServiceTimeoutLocked(r.app);
  ...
}

void scheduleServiceTimeoutLocked(ProcessRecord proc) {
    if (proc.executingServices.size() == 0 || proc.thread == null) {
        return;
    }
    Message msg = mAm.mHandler.obtainMessage(
            ActivityManagerService.SERVICE_TIMEOUT_MSG);
    msg.obj = proc;
   //execServicesFg是是否需要Service在前台执行的标志位
    mAm.mHandler.sendMessageDelayed(msg,
                proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
    }

可以看出，这里使用mAm.Handler向AMS所在的线程的MessageQueue发送了一个延时消息（消息的what值是ActivityManagerService.SERVICE_TIMEOUT_MSG），根据是否需要在前台执行，延时的时间是不一样的：

//定义在ActivityManagerService中，Service超时消息的what值
static final int SERVICE_TIMEOUT_MSG = 12;

//ActiveServices文件
// How long we wait for a service to finish executing.
//等待前台Service执行完毕，超时时间20秒
static final int SERVICE_TIMEOUT = 20*1000;
// How long we wait for a service to finish executing.
//后台广播的执行时间是前台广播执行的10倍，200秒
static final int SERVICE_BACKGROUND_TIMEOUT = SERVICE_TIMEOUT * 10;

这样便向主线程的MessageQueue中发送了延时消息，并开启了Service。那么，如果Service在延时时间到达前如果执行完毕，应该把入队的这个延时消息给移除掉，移除的逻辑是在哪儿呢？通过调用链的层层调用，发现答案就在ActivityThread的handleCreateService方法中：

private void handleCreateService(CreateServiceData data) {
    ...
    Service service = null;
    java.lang.ClassLoader cl = loadedApk.getClassLoader();
    //通过反射创建Service的实例对象
    service = (Service) cl.loadClass(data.info.name).newInstance();
    ...
    //执行Service的onCreate方法
    service.onCreate();
    ...
    //
    ActivityManager.getService().serviceDoneExecuting(data.token, SERVICE_DONE_EXECUTING_ANON, 0, 0);
   ...
}

可以看到，在执行完Service的onCreate方法后，通过Binder调用了AMS中的serviceDoneExecuting方法去通知Service已经启动。下面来看AMS中的serviceDoneExecuting方法：

public void serviceDoneExecuting(IBinder token, int type, int startId, int res) {
    synchronized(this) {
        ...
        mServices.serviceDoneExecutingLocked((ServiceRecord) token, type, startId, res);
    }
}

AMS中的serviceDoneExecuting方法直接回调了ActiveServices中的serviceDoneExecutingLocked方法：

 void serviceDoneExecutingLocked(ServiceRecord r, int type, int startId, int res) { 
    ...
    serviceDoneExecutingLocked(r, inDestroying, inDestroying);
   ...
}

private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
      boolean finishing) {
    ...
   //在这里，把加入的延时消息给移除掉了
   mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);
   ...
}

到此，我们就把添加延时消息和移除延时消息的逻辑分析清楚了，那么，假如在延时时间内，Service没有执行完，会发生什么呢？熟悉Android异步消息机制的同学应该明白，我们应该去mAm.Handler中查看对SERVICE_TIMEOUT_MSG消息的处理了，mAm.Handler是AMS中定义的一个内部类：

//ActivityManagerService.java
final class MainHandler extends Handler {
    public MainHandler(Looper looper) {
        super(looper, null, true);
    }
    @Override 
    public void handleMessage(Message msg) {
        switch (msg.what) {
              case SERVICE_TIMEOUT_MSG:{
                 mServices.serviceTimeout((ProcessRecord)msg.obj);
            }
        }
    }
}

可以看到，超时处理，最后又交给了ActiveServices对象进行处理：

 void serviceTimeout(ProcessRecord proc) {
    ...
    if (anrMessage != null) {
         mAm.mAppErrors.appNotResponding(proc, null, null, false, anrMessage);
    }
}

最后，利用AppErrors对象去进行ANR通知用户，具体ANR执行操作的方法就不再进行分析了；
至此，关于Service的整个ARN的源码就分析完了，可以看出流程就是：1.事件执行前添加延时消息；2.事件执行完毕后移除延时消息； 3.延时时间内事件为执行完，延时消息被处理，发生ANR。

2. BroadcastReceiver

这里不具体分析Broadcast的注册、接收等整个流程，需要知道的是，我们注册广播的时候，其实是注册进了AMS中，当AMS接收到发送来的广播后，最后对广播进行处理的方法其实是在BroadcastQueue文件的中的processNextBroadcast方法：

final void processNextBroadcast(boolean fromMsg) {
   ...
        do {
            r = mOrderedBroadcasts.get(0);
            //获取所有该广播所有的接收者
            int numReceivers = (r.receivers != null) ? r.receivers.size() : 0;
            if (mService.mProcessesReady && r.dispatchTime > 0) {
                long now = SystemClock.uptimeMillis();
                if ((numReceivers > 0) &&
                        (now > r.dispatchTime + (2*mTimeoutPeriod*numReceivers))) {
                    //当广播处理时间超时，则强制结束这条广播
                    broadcastTimeoutLocked(false);
                    ...
                }
            }
            if (r.receivers == null || r.nextReceiver >= numReceivers
                    || r.resultAbort || forceReceive) {
                if (r.resultTo != null) {
                    //处理广播消息消息
                    performReceiveLocked(r.callerApp, r.resultTo,
                        new Intent(r.intent), r.resultCode,
                        r.resultData, r.resultExtras, false, false, r.userId);
                    r.resultTo = null;
                }
                //执行完毕，取消超时处理
                cancelBroadcastTimeoutLocked();
                ...
                mOrderedBroadcasts.remove(0);
               ...
            }
        } while (r == null);
        ...

        //获取下条有序广播
        r.receiverTime = SystemClock.uptimeMillis();
        if (!mPendingBroadcastTimeoutMessage) {
            long timeoutTime = r.receiverTime + mTimeoutPeriod;
            //添加延迟消息，延时的时间为mTimeoutPeriod
            setBroadcastTimeoutLocked(timeoutTime);
        }
        ...
}

从上述代码可以知道，调用setBroadcastTimeoutLocked方法把延时消息加进去，在所有注册的广播接收器的逻辑执行完了以后，再把延时消息给移除掉，下面我们来看setBroadcastTimeoutLocked方法和cancelBroadcastTimeoutLocked方法:

final void setBroadcastTimeoutLocked(long timeoutTime) {
    if (!mPendingBroadcastTimeoutMessage) {
        Message msg = mHandler.obtainMessage(BROADCAST_TIMEOUT_MSG, this);
        mHandler.sendMessageAtTime(msg, timeoutTime);
        mPendingBroadcastTimeoutMessage = true;
    }
}

final void cancelBroadcastTimeoutLocked() {
    if (mPendingBroadcastTimeoutMessage) {
        mHandler.removeMessages(BROADCAST_TIMEOUT_MSG, this);
        mPendingBroadcastTimeoutMessage = false;
    }
}

可以看到两个方法就是添加消息和移除消息，其中timeoutTime是 r.receiverTime + mTimeoutPeriod得到的，receiverTime是当前系统时间，而mTimeoutPeriod则是在初始化BroadcastQueue初始化的时候传进来的，而BroadcastQueue则是在AMS中初始化的：

//ActivityManagerService.java
 //前台广播超时时间
 static final int BROADCAST_FG_TIMEOUT = 10*1000;
//后台广播超时时间
 static final int BROADCAST_BG_TIMEOUT = 60*1000;

//前台广播队列
BroadcastQueue mFgBroadcastQueue;
//后台广播队列
BroadcastQueue mBgBroadcastQueue;

public ActivityManagerService(Context systemContext) {
    mFgBroadcastQueue = new BroadcastQueue(this, mHandler,
                "foreground", BROADCAST_FG_TIMEOUT, false);
    mBgBroadcastQueue = new BroadcastQueue(this, mHandler,
                "background", BROADCAST_BG_TIMEOUT, true);
}

从上述可以知道，在AMS中分别维护了前台广播队列和后台广播队列，两者的超时时间分别为10秒和60秒。下面我们看看对超时消息的处理，发送消息的mHandler是BroadcastQueue内部类BroadcastHandler的对象：

 final BroadcastHandler mHandler;
 private final class BroadcastHandler extends Handler {
    public BroadcastHandler(Looper looper) {
        super(looper, null, true);
    }
    @Override
    public void handleMessage(Message msg) {
        switch (msg.what) {
            case BROADCAST_INTENT_MSG: {
                if (DEBUG_BROADCAST) Slog.v(
                        TAG_BROADCAST, "Received BROADCAST_INTENT_MSG");
                processNextBroadcast(true);
            } break;
            case BROADCAST_TIMEOUT_MSG: {
                synchronized (mService) {
                    broadcastTimeoutLocked(true);
                }
            } break;
        }
    }
}

超时后会执行broadcastTimeoutLocked方法，从而触发ANR。

final void broadcastTimeoutLocked(boolean fromMsg) {
    ...
       if (anrMessage != null) {
       // Post the ANR to the handler since we do not want to process ANRs while
       // potentially holding our lock.
        mHandler.post(new AppNotResponding(app, anrMessage));
    }
}

通过上述流程，我们就把Broadcast触发ANR的源码分析清楚了，流程同样跟Service是一样的：1.事件执行前添加延时消息；2.事件执行完毕后移除延时消息； 3.延时时间内事件为执行完，延时消息被处理，发生ANR。

3.ContentProvider

ContentProvider Timeout是位于ActivityManager线程中的AMS.MainHandler收到CONTENT_PROVIDER_PUBLISH_TIMEOUT_MSG消息时触发。具体逻辑同Service和BroadcastReceiver，具体源码逻辑这里就不做分析了，感兴趣的同学可以自行去查看。

同样，在AMS启动Activity的时候，对启动和暂停相关Activity，也加入了类似超时处理，超时时间设定为500毫秒，所以在onPause方法中，最好不要做耗时的操作，而要放到onStop中，因为onStop和onDestroy的超时时间都是10s。

    // How long we wait until giving up on the last activity to pause.  This
    // is short because it directly impacts the responsiveness of starting the
    // next activity.
    private static final int PAUSE_TIMEOUT = 500;

    // How long we wait for the activity to tell us it has stopped before
    // giving up.  This is a good amount of time because we really need this
    // from the application in order to get its saved state.
    private static final int STOP_TIMEOUT = 10 * 1000;

    // How long we wait until giving up on an activity telling us it has
    // finished destroying itself.
    private static final int DESTROY_TIMEOUT = 10 * 1000;

如何避免ANR

Android系统增加的ANR机制的本质，其实都是监控主线程是否发生阻塞，所以要避免ANR，记住一条，就是：

避免在主线程执行耗时的操作
在Service、BroadcastReceiver、ContentProvider中如果需要执行耗时的操作，请采用合适的多线程技术进行异步调用

参考链接

Android ANR：原理分析及解决办法