崩溃时，你还可以做什么

2017-08-09 本文已影响851人 vedon_fu

程序崩溃的时候还可以做哪些操作，以前都是看别人博客多。发现理解还是不够深入，因此记录学习过程。写得不对的，请不吝赐教😁

知识背景

源码版本：xnu-3789.51.2
下面是针对源码整理出来的一份时序图：

异常触发：EXC_BAD_ACCESS，通过task_exception_notify 触发异常处理程序。
接着是Exception 一系列调用。
- Mach异常处理程序exception_triage()，负责将异常转换成Mach 消息
- exception_triage()通过调用exception_deliver()尝试把异常投递到thread、task最后是host。
- 首先尝试将异常抛给thread端口，然后尝试抛给task端口，最后再抛给host端口(默认端口)。
- exception_deliver 通过调用mach_exception_raise ，触发异常。
ux_exception 捕抓到异常信号，把异常转为Unix signal ，并投递到出错线程。

那么ux_exception 是如何捕抓到信号的呢？查看一下ux_exception 的源码。整理如下：

第一个BSD进程调用bsdinit_task()函数启动时，这函数还调用了ux_handler_init()函数设置了一个Mach内核线程跑ux_handler。而ux_handler 设置了一个Message loop用于监听异常。

__attribute__((noreturn))
static void
ux_handler(void)
{
    task_t      self = current_task();
    mach_port_name_t    exc_port_name;
    mach_port_name_t    exc_set_name;
    /* self->kernel_vm_space = TRUE; */
    ux_handler_self = self;
    /*
     *  Allocate a port set that we will receive on.
     */
    if (mach_port_allocate(get_task_ipcspace(ux_handler_self), MACH_PORT_RIGHT_PORT_SET,  &exc_set_name) != MACH_MSG_SUCCESS)
        panic("ux_handler: port_set_allocate failed");
    /*
     *  Allocate an exception port and use object_copyin to
     *  translate it to the global name.  Put it into the set.
     */
    if (mach_port_allocate(get_task_ipcspace(ux_handler_self), MACH_PORT_RIGHT_RECEIVE, &exc_port_name) != MACH_MSG_SUCCESS)
    panic("ux_handler: port_allocate failed");
    if (mach_port_move_member(get_task_ipcspace(ux_handler_self),
                exc_port_name,  exc_set_name) != MACH_MSG_SUCCESS)
    panic("ux_handler: port_set_add failed");

    if (ipc_object_copyin(get_task_ipcspace(self), exc_port_name,
            MACH_MSG_TYPE_MAKE_SEND, 
            (void *) &ux_exception_port) != MACH_MSG_SUCCESS)
        panic("ux_handler: object_copyin(ux_exception_port) failed");

    proc_list_lock();
    thread_wakeup(&ux_exception_port);
    proc_list_unlock();

    /* Message handling loop. */
    for (;;) {
    struct rep_msg {
        mach_msg_header_t Head;
        NDR_record_t NDR;
        kern_return_t RetCode;
    } rep_msg;
    struct exc_msg {
        mach_msg_header_t Head;
        /* start of the kernel processed data */
        mach_msg_body_t msgh_body;
        mach_msg_port_descriptor_t thread;
        mach_msg_port_descriptor_t task;
        /* end of the kernel processed data */
        NDR_record_t NDR;
        exception_type_t exception;
        mach_msg_type_number_t codeCnt;
        mach_exception_data_t code;
        /* some times RCV_TO_LARGE probs */
        char pad[512];
    } exc_msg;
    mach_port_name_t    reply_port;
    kern_return_t    result;

    exc_msg.Head.msgh_local_port = CAST_MACH_NAME_TO_PORT(exc_set_name);
    exc_msg.Head.msgh_size = sizeof (exc_msg);
#if 0
    result = mach_msg_receive(&exc_msg.Head);
#else
    result = mach_msg_receive(&exc_msg.Head, MACH_RCV_MSG,
                 sizeof (exc_msg), exc_set_name,
                 MACH_MSG_TIMEOUT_NONE, MACH_PORT_NULL,
                 0);
#endif
    if (result == MACH_MSG_SUCCESS) {
        reply_port = CAST_MACH_PORT_TO_NAME(exc_msg.Head.msgh_remote_port);

        if (mach_exc_server(&exc_msg.Head, &rep_msg.Head)) {
        result = mach_msg_send(&rep_msg.Head, MACH_SEND_MSG,
            sizeof (rep_msg),MACH_MSG_TIMEOUT_NONE,MACH_PORT_NULL);
        if (reply_port != 0 && result != MACH_MSG_SUCCESS)
            mach_port_deallocate(get_task_ipcspace(ux_handler_self), reply_port);
        }

    }
    else if (result == MACH_RCV_TOO_LARGE)
        /* ignore oversized messages */;
    else
        panic("exception_handler");
    }
}

重点来看这一段：

#if 0
    result = mach_msg_receive(&exc_msg.Head);
#else
    result = mach_msg_receive(&exc_msg.Head, MACH_RCV_MSG,
                 sizeof (exc_msg), exc_set_name,
                 MACH_MSG_TIMEOUT_NONE, MACH_PORT_NULL,
                 0);
#endif
 if (result == MACH_MSG_SUCCESS) {
        reply_port = CAST_MACH_PORT_TO_NAME(exc_msg.Head.msgh_remote_port);
        if (mach_exc_server(&exc_msg.Head, &rep_msg.Head)) {
        result = mach_msg_send(&rep_msg.Head, MACH_SEND_MSG,
            sizeof (rep_msg),MACH_MSG_TIMEOUT_NONE,MACH_PORT_NULL);
        if (reply_port != 0 && result != MACH_MSG_SUCCESS)
            mach_port_deallocate(get_task_ipcspace(ux_handler_self), reply_port);
        }
    }

当从port 收到发过来的消息(mach_msg_receive)，通过调用mach_exc_server，触发handlers :

catch_mach_exception_raise()
catch_mach_exception_raise_state()
catch_mach_exception_raise_state_identity()

具体调用哪一个由behavior决定。

# define EXCEPTION_DEFAULT  1  // Send a catch_exception_raise message including the identity.

# define EXCEPTION_STATE  2 // Send a catch_exception_raise_state message including the thread state.

# define EXCEPTION_STATE_IDENTITY    3 // Send a catch_exception_raise_state_identity message including the thread identity and state.

另外Mach层在BSD层上，当异常发生时，如果Mach没有相应的处理程序，那么就会转到BSD层处理，也就是上面的流程。那么Mach层的处理是怎样的？

Mach Port

使用Xcode进行debug，其中主要的模块是lldb的debugserver。那么它是怎样获取到程序的exceptions 的呢？进程间的通讯，在这里用的是：Mach Port。通过Mach Port,debugserver可以截获程序的异常。

以下是debugserver的开源代码,地址在这里

截图高亮的位置就是关键的地方。里面的注释写得很清楚，就不多说了。通过m_exception_port ，debugserver可以截获程序的异常消息。

为程序添加接收exception的port 步骤简化如下：

mach_port_t server_port;
kern_return_t kr = mach_port_allocate(mach_task_self(), MACH_PORT_RIGHT_RECEIVE, &server_port);
assert(kr == KERN_SUCCESS);

kr = mach_port_insert_right(mach_task_self(), &server_port, &server_port, MACH_MSG_TYPE_MAKE_SEND);
assert(kr == KERN_SUCCESS);

kr = task_set_exception_ports(task, EXC_MASK_BAD_ACCESS, server_port, EXCEPTION_DEFAULT|MACH_EXCEPTION_CODES, THREAD_STATE_NONE);

函数说明,建议看一下，说得很清楚。

如果把task的负责处理EXC_MASK_BAD_ACCESS的port 关闭，是不是就接收不了异常了？？？

答案：是的

重点来了，崩溃来了，还可以做什么？

重要信息保存
强制救回

重要信息保存
崩溃前信息做最后的保存工作，这里说得很清楚。信息保存应该根据各自需要而定，重点看一下如何把程序强制救回。

强制救回

这里首先要了解ucontext_t 是什么？简单来说，它是线程运行的一个上下文。具体如下图所示：

通过它可以知道崩溃时线程的上下文。把程序救回来关键一步在：如何调整当前崩溃线程的上下文！包括修改当前寄存器的值等都是可以的。

下面的Demo很好的展示了这一过程（必须在真机上运行），其主要过程分3步：

设置处理崩溃信号的handler
触发崩溃
修改崩溃时的线程上下文，即：ucontext

让程序可以继续跑下去，可以设置当前线程的pc。处理了崩溃的信号后，程序可以继续执行，但是有一点要注意，崩溃处理函数不能做太多事情，因为系统认为这个signal还在处理状态，此时如果再触发崩溃，就再也进不来了，程序会被kill掉。

ucontext->uc_mcontext->__ss.__pc = 崩溃处理函数地址。

demo 代码是把lr寄存器的值，直接赋给pc。程序直接执行下一条指令，需要注意的是，这个时候程序的状态是不稳定的，因为部分寄存器的值已经被污染了，可能随时发生崩溃。

* pc 是当前运行的指令地址
* lr 保存的是函数返回后的下一条指令的地址。
ucontext->uc_mcontext->__ss.__pc = ucontext->uc_mcontext->__ss.__lr;

//
//  ViewController.m
//  UContext
//
//  Created by vedon on 04/08/2017.
//  Copyright © 2017 vedon. All rights reserved.
//

#import "ViewController.h"
#include <mach/task.h>
#include <mach/mach_init.h>
#include <mach/mach_port.h>

void sig_handler(int sig, siginfo_t *info, void *context)
{
    ucontext_t *ucontext = context;
    NSMutableString *str = [NSMutableString stringWithFormat:@"Signal caught: %d \n",sig];
    
    [str appendString:[NSString stringWithFormat:@"pc 0x%llx\n", ucontext->uc_mcontext->__ss.__pc]];
    [str appendString:[NSString stringWithFormat:@"lr 0x%llx\n", ucontext->uc_mcontext->__ss.__lr]];
    [str appendString:[NSString stringWithFormat:@"fp 0x%llx\n", ucontext->uc_mcontext->__ss.__fp]];
    [str appendString:[NSString stringWithFormat:@"sp 0x%llx\n", ucontext->uc_mcontext->__ss.__sp]];
    [str appendString:[NSString stringWithFormat:@"uc_stack size 0x%lx\n", sizeof(ucontext->uc_stack.ss_size)]];
    [str appendString:[NSString stringWithFormat:@"uc_stack ss_sp 0x%llx\n", (long long)ucontext->uc_stack.ss_sp]];
    
    
    if (ucontext->uc_link != NULL)
    {
        [str appendString:@"uc_link : \n"];
        [str appendString:[NSString stringWithFormat:@"pc 0x%llx\n", ucontext->uc_link->uc_mcontext->__ss.__pc]];
        [str appendString:[NSString stringWithFormat:@"lr 0x%llx\n", ucontext->uc_link->uc_mcontext->__ss.__lr]];
        [str appendString:[NSString stringWithFormat:@"fp 0x%llx\n", ucontext->uc_link->uc_mcontext->__ss.__fp]];
        [str appendString:[NSString stringWithFormat:@"sp 0x%llx\n", ucontext->uc_link->uc_mcontext->__ss.__sp]];
    }
    else
    {
        [str appendString:@"uc_link is null"];
    }
    NSLog(@"%@",str);
    ucontext->uc_mcontext->__ss.__pc = ucontext->uc_mcontext->__ss.__lr;
}


@interface ViewController ()

@end

@implementation ViewController

- (void)viewDidLoad {
    [super viewDidLoad];

    //把当前程序的task 接受EXC_BAD_ACCESS的exception port  设置为空
    //这样，debugserver就不会捕抓到异常。当然这只是为了可以在
    //debug环境下也可以单步调试sig_handler。如果没有设置，那就只能通
    //过device log 看sig_handler的输出了。

    int ret = task_set_exception_ports(
                                       mach_task_self(),
                                       EXC_MASK_BAD_ACCESS,
                                       MACH_PORT_NULL,//m_exception_port,
                                       EXCEPTION_DEFAULT,
                                       0);
    if (ret == 0)NSLog(@"Disable lldb to catch exceptions");
    struct sigaction sa;
    memset(&sa, 0, sizeof(struct sigaction));
    sa.sa_flags = SA_SIGINFO;
    sa.sa_sigaction = sig_handler;
    
    sigaction(SIGSEGV, &sa, NULL);
    sigaction(SIGINT, &sa, NULL);
    sigaction(SIGABRT, &sa, NULL);
    sigaction(SIGKILL, &sa, NULL);
    sigaction(SIGBUS, &sa, NULL);
    
    [self invokeCrash];
    
    NSLog(@"Resume after crash");
}
- (void)invokeCrash
{
    void *a = calloc(1, sizeof(void *));
    NSLog(@"Crash Addr of a: 0x%llx", (long long)a);
    ((void(*)())a)();
}

@end

exception port 已经关掉了，但是只是在Mach 层不处理异常。例如
EXC_BAD_ACCESS 这种异常最后会转换为BSD signal, lldb 的
debugger 还需要屏蔽处理BSD signal 。

下面屏蔽处理SIGBUS，至此，sig_handler 就可以顺利的在debug的情况下调试了。

(lldb) pro handle SIGBUS -s false

结论

针对某些特殊的崩溃，例如：WebCore内的某些系统bug,可以使用这种极端的修复方法。

😄😄😄😄😄😄
另外，个人觉得通过记录或者是分享，能大大加深对相关知识的理解。

Ref：
Handling unhandled exceptions and signals
Kernel Architecture Overview
Apple Open Source
LLDB ---debugserver

以下与主题无关，仅做记录。

@startuml
actor CPU
CPU -> exception : task_exception_notify
note left
通知的类型可能是：
EXC_CRASH/EXC_GUARD/
EXC_BAD_ACCESS 等等。
其中如果vm_fault调用了
kern_cs的cs_invalid_page，
那么，最终会通过threadsignal 
发出EXC_BAD_ACCESS。
end note
exception -> exception : exception_triage
exception -> exception : exception_triage_thread
exception -> exception : exception_deliver
exception -> exception : mach_exception_raise
...Message Loop...
ux_exception -> ux_exception : catch_mach_exception_raise
note right
catch_mach_exception_raise 会在mach_exc_server调用后触发
end note
ux_exception -> ux_exception : ux_exception
note right
把exception转换为unix signal。
EXC_BAD_ACCESS ： SIGSEGV/SIGBUS
EXC_BAD_INSTRUCTION ： SIGILL
etc.
end note

ux_exception -> kern_sig : threadsignal
note right 
Send a signal caused by a trap to a specific thread.
end note
@enduml

@startuml
bsd_init -> bsd_init : bsdinit_task
bsd_init -> ux_exception : ux_handler_init
ux_exception -> ux_exception : ux_handler
activate ux_exception
note right
Message handling loop
end note
......
@enduml

崩溃时，你还可以做什么

知识背景

Mach Port

重点来了，崩溃来了，还可以做什么？

结论

猜你喜欢

热点阅读