稳定性专题 -- FileDescriptor LEAK

2019-05-21 本文已影响0人 lynn_doo

Flyme Stability debug

概述

FD 即 file descriptor,Linux限制了每个进程打开fd(file descriptor)的最大数量，如果超过这个阈值则会打开文件失败。当文件句柄分配失败的模块抛出异常时就有可能造成应用或者系统崩溃，这是系统中非常常见的一种稳定性问题。本文主要讲述：

如何知道程序的崩溃是由fd leak引起的
Flyme的稳定性团队针对该问题增加了哪些调试feature
如何定位fdleak的root cause
案例分享

FD Leak

fd leak引起的error有两种:

#define ENFILE      23  /* Too many open files in system */
#define EMFILE      24  /* Too many open files */

我们这里讨论的是 EMFILE，即进程的fd资源限制。单个进程可以打开的最大fd个数可以通过命令

 $cat /proc/[pid]/limits 
或者  
$ulimit -n

也可以通过代码rlimit = android.system.Os.getrlimit(OsConstants.RLIMIT_NOFILE); 或者 sysconf(_SC_OPEN_MAX)。
通过以下命令可以查看当前fd:

m1882:/ # ls -l /proc/3302/fd/                                                 
total 0
lrwx------ 1 root root 64 2018-01-01 00:03 0 -> /dev/null
lrwx------ 1 root root 64 2018-01-01 00:03 1 -> /dev/null
lrwx------ 1 root root 64 2018-01-01 00:03 10 -> socket:[58872]
l-wx------ 1 root root 64 2018-01-01 00:03 100 -> pipe:[63567]
lrwx------ 1 root root 64 2018-01-01 00:03 101 -> socket:[55277]
lr-x------ 1 root root 64 2018-01-01 00:03 102 -> pipe:[55278]
l-wx------ 1 root root 64 2018-01-01 00:03 103 -> pipe:[55278]
lrwx------ 1 root root 64 2018-01-01 00:03 104 -> socket:[63568]

m1882:/ # ls -l /proc/3302/fd/* | wc -l                                        
309

这里注意lsof的使用:
lsof -p [pid] : lsof的结果包含了并非以fd形式打开的文件，比如用mmap方式访问文件（FD一栏显示为mem），实际并不占用fd。

m1882:/ # lsof -p 3302                                                         
COM...      PID       USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
netmgrd    3302      radio  cwd       DIR                0,2       800          2 /
netmgrd    3302      radio  rtd       DIR                0,2       800          2 /
netmgrd    3302      radio  txt       REG              253,0   1535320        250 /vendor/bin/netmgrd
netmgrd    3302      radio  mem       REG              253,0   1535320        250 /vendor/bin/netmgrd
netmgrd    3302      radio  mem       CHR              10,52                23594 /dev/hwbinder
netmgrd    3302      radio  mem       REG              253,1     15176       3220 /system/lib64/libnetd_client.so
netmgrd    3302      radio  mem       REG              253,1     27776       3365 /system/lib64/libvendorconn.so
...
netmgrd    3302      radio   20u     sock                          0t0      62631 socket:[62631]
netmgrd    3302      radio   21r     FIFO                0,9       0t0      62632 pipe:[62632]
netmgrd    3302      radio   22w     FIFO                0,9       0t0      62632 pipe:[62632]
netmgrd    3302      radio   23u     sock                          0t0      59001 socket:[59001]
netmgrd    3302      radio   24u     sock                          0t0      59002 socket:[59002]

FD 具体编号的如20u, 21u的才真正占用fd, 而fd为mem则是使用mmap映射的，包括so文件。

如何定位是fdleak的问题

当watchdog fd monitor未打开

通常来讲我们能直接从日志中发现 java crash或native crash，并通过调用栈来定位问题。但是fd 引起的崩溃有时并没有那么明显，因为android中部分代码在文件打开失败而抛出的异常中并没有打印error no。比如下面这样(bug#843217)：

1386  1467 E AndroidRuntime: *** FATAL EXCEPTION IN SYSTEM PROCESS: android.fg
1386  1467 E AndroidRuntime: java.lang.RuntimeException: failed to set system property
1386  1467 E AndroidRuntime:     at android.os.SystemProperties.native_set(Native Method)
1386  1467 E AndroidRuntime:     at android.os.SystemProperties.set(SystemProperties.java:155)
1386  1467 E AndroidRuntime:     at ...ager$UsbHandler.setUsbConfig(UsbDeviceManager.java:752)

static void SystemProperties_set(JNIEnv *env, jobject clazz,
                                      jstring keyJ, jstring valJ)
{
...
   err = property_set(key, val);
    if (err < 0) {
        jniThrowException(env, "java/lang/RuntimeException",
                          "failed to set system property");
    }
...
}

但是大部分情况来说，当fd打开失败时通常会打印errno = 24，比如:

10-17 14:05:02.497  1386  1627 E InputTransport: channel '2272391 Splash Screen com.hexin.plat.android' ~ Could not create socket pair.  errno=24

虽然没有引起崩溃，但是能知道此时1386进程fd已达到上线，property_set需要分配文件句柄所以此时err为24.
上述是java层的崩溃，native堆栈道理相同，也就是说当crash信息中不能直接发现是EMFILE的时候，我们仍然可以在日志中搜索errno为24这样的关键信息并结合crash stack信息来定位是否fd leak引起的崩溃。

当打开 watchdog fd monitor

另外一种情况是我们打开了fd monitor，这是android O在Watchdog.java中新增功能监听systemserver进程fd的使用情况,仅在Build.IS_DEBUGGABLE打开的时候enable(AOSP 上，Flyme上做了优化)。参考代码Watchdog.java 中的OpenFdMonitor类实现，基本原理是：
在watchdog中每30秒监测/proc/self/fd/[rlimit-FD_HIGH_WATER_MARK]是否存在，如果存在则认为dump fd信息并重启systemserver。

打开 fd monitor功能时，watchdog会打印以下日志：

1384  2044 W Watchdog: *** WATCHDOG KILLING SYSTEM PROCESS: Open FD high water mark reached
1384  2044 W Watchdog: *** GOODBYE!

同时将"/system/bin/lsof -p [pid]"命令的结果保存在"/data/anr/anr_fd_XXX"中。
以上是aosp代码，但是这在实际工作中仍然不能快速定位fd问题：
1 fd monitor 仅在DEBUGGABLE版本上才打开，且system/bin/lsof也仅在DEBUGGABLE版本上才被build，而稳定性大多数场景都是使用user版本
2 aosp仅仅保存了fd info，根据它可以定位fd类型，比如ashmem， pipe或者socket等，但是使用这些FD的场景众多，无法进一步分析。
所以针对以上弊端，我们做了一些优化。

Flyme定制功能

fd enable的属性控制
ro.flyme.published = false OR
ro.monkey = true

public static OpenFdMonitor create() {
    boolean isPublishedVersion = SystemProperties.getBoolean("ro.flyme.published", false);
    boolean isMonkey = SystemProperties.getBoolean("ro.monkey", false);
    if (isPublishedVersion && !isMonkey) {
        return null;
    }
    ... ... 
}

优化fd判断条件
由于aosp代码是判断水位的file是否存在，但是当有程序快速分配释放fd时仍然满足/proc/self/fd/[mark]存在但fd总数较小的情况，所以与上当fd总数>900的情况

        private int fdNumber() {
            final File fdFile = new File("/proc/self/fd/");
            int num = 0;
            if (fdFile.exists()) {
                File[] fdFileList = fdFile.listFiles();
                for (File fs : fdFileList) {
                    num++;
                }
            }

            return num;
        }

结合ErrorMonitor抓取更多的信息
由于aosp仅在DEBUGGABLE版本才build lsof, 而原生watchdog用的lsof捕捉fd信息，所以Flyme在user版本调用EM 最终通过ls -al /proc/1415/fd来获取fd信息。

public boolean monitor() {
    if (mFdHighWaterMark.exists() && fdNumber() > 900) {
        if (!Build.IS_DEBUGGABLE) {
            ActivityManagerService.dumpErrorInfo("system_server", Process.myPid(), 5, 12); // Flyme
            SystemClock.sleep(2000);
        } else {
            dumpOpenDescriptors();　　// AOSP
        }
        return true;
    }
    return false;
}

4 抓取HPROF
虽然有可能fd不是从java层分配的，hprof不能覆盖直接从native层打开fd的情况，但是我们仍然可以抓去hprof来定位java层打开fd过多的问题案例戳这里
代码

if ("1".equals(SystemProperties.get("persist.debug.wdt.hprof")) && (!SystemProperties.getBoolean("ro.flyme.published", false))) {
    Slog.e(TAG, "persist.debug.wdt.hprof == 1");
    throw new OutOfMemoryError("Do hprof dump before WDT");　//利用flyme的OomCrashHandler抓取hprof
}

总结
Flyme在aosp的FdMonitor的基础上修改了判断是否为fd leak的策略:增加fd 数量>900的判断，且使得user版本也可以使能，并结合Flyme的EM和OomCrashHandler抓取fdinfo,dump window input, hprof等日志。
您需要确保flyme代码合入，且user版本以下开个打开状态(debugable版本走aosp流程)：

ro.flyme.published = false OR　ro.monkey = true (使能watchdog监听功能)
persist.debug.wdt.hprof = true　（使能抓取hprof,　否则只抓取EM 日志）

生产的日志路径保存在：
errormonitor日志:/sdcard/Android/snapshot/systemFD_system_server_[time].tar.gz
hprof日志: /data/system/oom_hprof

FD leak detector(qualcomｍ)

高通也有自己的fd leak的监测工具，其基本原理是在malloc_debug中增加了记录每次打开关闭fd的调用栈。需要搭配脚本判断fd达到最大阈值时将保存的调用栈打印到日志中。我们通常用来监测Native层打开文件句柄引起的泄露，patch请戳这里。　　
保证代码合入leakdetect之后，调试fdleak的流程如下：　　

fd_leak_detector.png
　
脚本请戳
第二次发送sig28输出到日志fd堆栈示例如下：

12-11 17:40:19.693  1661  1661 E fd_leak_debug: receive signal
12-11 17:40:19.698  1661  8478 E fd_leak_debug: +++ system_server(1661) dumping leak information started. +++
12-11 17:40:19.698  1661  8478 E fd_leak_debug: { [  file socket   mmap] }
12-11 17:40:19.698  1661  8478 E fd_leak_debug: { [     4      1     78] }
12-11 17:40:19.699  1661  8478 E fd_leak_debug: Leak Type: file, count = 2, fd = 0, path = /dev/null
12-11 17:40:19.743  1661  8478 E fd_leak_debug:           #00  pc 0000000000000000  <unknown>
12-11 17:40:19.743  1661  8478 E fd_leak_debug: java stack start .........................
12-11 17:40:19.743  1661  8478 E fd_leak_debug:         Tid : 1687
12-11 17:40:19.743  1661  8478 E fd_leak_debug:           at java.lang.UNIXProcess.forkAndExec(Native method)
12-11 17:40:19.743  1661  8478 E fd_leak_debug:           at java.lang.UNIXProcess.<init>(UNIXProcess.java:133)
12-11 17:40:19.744  1661  8478 E fd_leak_debug:           at java.lang.ProcessImpl.start(ProcessImpl.java:128)
12-11 17:40:19.744  1661  8478 E fd_leak_debug:           at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
12-11 17:40:19.744  1661  8478 E fd_leak_debug:           at com.android.server.am.ActivityManagerService$23.run(ActivityManagerService.java:15747)
12-11 17:40:19.744  1661  8478 E fd_leak_debug:           at com.android.server.am.ActivityManagerService.addErrorToDropBox(ActivityManagerService.java:15807)
12-11 17:40:19.744  1661  8478 E fd_leak_debug:           at com.android.server.am.ActivityManagerService.handleApplicationWtfInner(ActivityManagerService.java:15469)
12-11 17:40:19.744  1661  8478 E fd_leak_debug:           at com.android.server.am.ActivityManagerService$22.run(ActivityManagerService.java:15439)
12-11 17:40:19.744  1661  8478 E fd_leak_debug:           at android.os.Handler.handleCallback(Handler.java:790)
12-11 17:40:19.745  1661  8478 E fd_leak_debug:           at android.os.Handler.dispatchMessage(Handler.java:99)
12-11 17:40:19.745  1661  8478 E fd_leak_debug:           at android.os.Looper.loop(Looper.java:164)
12-11 17:40:19.745  1661  8478 E fd_leak_debug:           at android.os.HandlerThread.run(HandlerThread.java:65)
12-11 17:40:19.745  1661  8478 E fd_leak_debug:           at com.android.server.ServiceThread.run(ServiceThread.java:46)
... ...

日志中的关键信息有两部分：
第一部分汇总数量: 如上面是一共打开了４个file 1个socket 78个mmap，由于mmap不占用FD所以我们只关注file和socket类型。(由于我是在重启之后发送signla28打开记录，所以在次之前包括systemserver重启过程中打开的fd未被记录，所以看到的数量是比较少的)。
第二部分是每一笔打开fd的调用堆栈，包括native和java堆栈。

案例分享

案例一　bug#834461

，通过ErrorMonitor保存的FD info我们知道该case泄露的类型是ashmem，定位的方法前面有讲到，此处不再赘述，本案例用来演示使用hprof查fd leak的方法。
根据经验ashmem在Binder传输数据时被普遍用到，尤其是RemoteView,　因此我们搜索bitmap的查看object数量：

1.png

从搜索结果看到RemoteViews$BitmapReflectionAction 与Bitmap的数量几乎匹配，与猜测的结果符合。
我们接着看下BitmapReflectionAction的实现

private class BitmapReflectionAction extends Action {
    int bitmapId;
    Bitmap bitmap;
    public void writeToParcel(Parcel dest, int flags) {
        dest.writeInt(BITMAP_REFLECTION_ACTION_TAG);
        dest.writeInt(bitmapId);
    }
    BitmapReflectionAction(int viewId, String methodName, Bitmap bitmap) {
        this.bitmap = bitmap;
        bitmapId = mBitmapCache.getBitmapId(bitmap);
    }
｝

BitmapReflectionAction继承Action，而action实现了Parcelable接口，也就是说需要实现writeToParcel接口以进行进程间数据传递。但是看writeToParcel只是写了很小的数据，并没有用到ashmem。但是我们注意到 bitmapId = mBitmapCache.getBitmapId(bitmap);这个的实现

private static class BitmapCache {
    ArrayList<Bitmap> mBitmaps;
    public Bitmap getBitmapForId(int id) {
        if (id == -1 || id >= mBitmaps.size()) {
            return null;
        } else {
        return mBitmaps.get(id);
        }
    }
    public void writeBitmapsToParcel(Parcel dest, int flags) {
        int count = mBitmaps.size();
        dest.writeInt(count);
        for (int i = 0; i < count; i++) {
           mBitmaps.get(i).writeToParcel(dest, flags);
        }
    }
}

getBitmapForId方法实际会把Bitmap对象存进mBitmaps，在binder通讯传递数据时把mBitmaps中的每一个对象都会writeToParcel，而它的实现就是通过 writeBlob 这种方式存入 Parcel 对象中，这样每个 bitmap 传输都要就要打开一个ashmem类型的fd，如果 mBitmaps 中 bitmap 数量很多的话，就会导致fd泄露。

接着看下BitmapCache对应的对象是什么？writeBitmapsToParcel又是怎么被调用呢？

public class RemoteViews implements Parcelable, Filter {   
    private BitmapCache mBitmapCache;
    private ArrayList<Action> mActions;
　public void setBitmap(int viewId, String methodName, Bitmap value) {
        addAction(new BitmapReflectionAction(viewId, methodName, value));
    }
    ... ...
    public void writeToParcel(Parcel dest, int flags) {
            if (mIsRoot) {
                mBitmapCache.writeBitmapsToParcel(dest, flags);
            }
}

在应用每调用setBitmap时，就会创建一个BitmapReflectionAction对象存在RemoteViews的mActions数组中的，同时在mBitmapCache的mBitmaps中查到是否已经存在，如果不存在则add 到mBitmaps。
当跨进程传递RemoteViews时就会调用writeToParcel进而遍历mBitmaps中的所有Bitmap通过ashmem进行传递。
所以，如果用户频繁调用setBitmap设置不同的bitmap实际上是Add，在跨进程传递RemoteView对象时有可能造成too many open files。

经过上面的源码解读，我们已经找到猜测的方向，就是会不会是某个RemoteViews对象中有包含很多个BitmapReflectionAction或者该对象的mBitmapCache中会有很多Bitmap呢？基于这个猜想我们继续分析我们的Hprof，　对BitmapReflectionAction 操作List objects => with incomming references

2.png
对于BitmapReflectionAction@0x1425e15这个对象里面有>=426个对象是很可疑的，　我们再看下它的引用关系List objects => with incomming references

3.png

我们找到了引用他的RemoteViews对象是0x12c80010，我们接着查看该对象的成员List objects => with outgoing references

4.png

5.png

果然，验证了我们的猜测该RemoteViews对象中的mBimapCache.mBitmaps保存了104个bitmap对象，通过查看RemoteViews.mApplication.processName知道这是腾讯的海豚智音。

总结: 通过本案例的分析步骤能够看出，虽然hprof能够查到某些FD泄露的Root Cause，但是，它也有很大的局限性，你要对此有一些经验以便能够准确的嗅到查找反向；要结合代码逻辑才能找到泄露的对象；hprof是虚拟机的堆栈镜像，如果native层直接打开FD在java层没有对应的对象我们仍然无法找到。所以我们还有第二种调试手段，那就是malloc_debug的fd detector，目前还没有找到比较好的素材，等找到之后再做补充。

869713

稳定性专题 -- FileDescriptor LEAK

Flyme Stability debug

概述

FD Leak

如何定位是fdleak的问题

当watchdog fd monitor未打开

当打开 watchdog fd monitor

Flyme定制功能

FD leak detector(qualcomｍ)

案例分享

案例一　bug#834461

猜你喜欢

热点阅读

稳定性专题 -- FileDescriptor LEAK

Flyme Stability debug

概述

FD Leak

如何定位是fdleak的问题

当watchdog fd monitor未打开

当打开 watchdog fd monitor

Flyme定制功能

FD leak detector(qualcomｍ)

案例分享

案例一 bug#834461

猜你喜欢

热点阅读

案例一　bug#834461