通过gdb调试崩溃的代码

2018-03-05  本文已影响365人  bigonelby

最近公司有个需求,就是做出一个框架,可以为camera出来的视频加上水印,并且进行编码。是一个很常见的需求,但是我想设计的通用性好一些。由于框架层完全是自己开发,很多零件需要自己实现,这篇文章记录了我在做模块调试的过程中,遇到的一次崩溃的bug,通过这个bug,我对引用又有进一步的认识。

背景

公司的需求是接收原始音视频流,并进行编码。因此最简单的解决方案,显然就是用ffmpeg了,但是为了做的通用性更好一些,我希望能设计一套通用的架子。具体框架的构思,接口的设计,将来有机会再和大家分享。这里分享一下自己在做模块测试的时候遇到的一个崩溃。

模块介绍

其实这个模块非常简单,非常基础。是什么模块呢?就是加锁的处理,也就是linux中的mutex。
类图设计是这样的


image.png

基本的思想是这样,在BaseLock中的构造函数中初始化pthread_mutex_t,析构函数中释放这个锁,在BaseLock中提供两个方法Lock和UnLock,分别实现获取锁和释放锁。AutoLock是一个helper类,通过构造函数,可以出入与之关联的BaseLock,并且在构造函数中Lock,析构函数中UnLock,这样,就可以依据对象的声明周期,自动的获取锁和释放锁。

代码如下

// Lock.h
#ifndef __LOCK__
#define __LOCK__

#include <pthread.h>

class BaseLock {
public:
    BaseLock();
    ~BaseLock();

    void Lock();
    void UnLock();
private:
    pthread_mutex_t mutex_;
};

class AutoLock {
public:
    AutoLock(BaseLock& lock);
    ~AutoLock();
private:
    BaseLock lock_;
};

#endif

其实现为

// Lock.cpp
#include "Lock.h"

BaseLock::BaseLock() {
    pthread_mutex_init(&mutex_, NULL);
}

BaseLock::~BaseLock() {
    pthread_mutex_destroy(&mutex_);
}

void BaseLock::Lock() {
    pthread_mutex_lock(&mutex_);
}

void BaseLock::UnLock() {
    pthread_mutex_unlock(&mutex_);
}

AutoLock::AutoLock(BaseLock& lock) {
    lock_ = lock;
    lock_.Lock();
}

AutoLock::~AutoLock() {
    lock_.UnLock();
}

这是我的第一版代码,没想到已经埋下了隐患。其实后面在调试构成中出现的bug,都是由这个模块造成的。其实本来的思想是想模拟Android中的自动锁,没想到还是有自己理解不深的地方。

这里有一个重点,就是如何向AutoLock类中,传入关联的BaseLock实体。显然,这里不能直接传入参数BaseLock lock,因为形参只是拷贝而已,因此我选择了引用。

做了简单的模块测试,并没有出现什么问题,放心的开始开发其他模块。
直到我开发到了ThreadLoop模块

问题显现

出现问题的时候,是我在测试ThreadLoop模块。这个模块,顾名思义,也是仿照Android中做的一个简易版。目的就是实现一个线程,不断的接收消息,处理消息。由于涉及到不同线程之间通信,所以不可避免的需要解决同步互斥问题。奇怪的是,开发完毕进行模块测试的时候,居然随机有死锁或者崩溃的现象!

整个相关框架的类图如下


image.png

简单的说,就是testThreadLoop这个测试函数,主要是为了测试ThreadLoop这个模块,我们可以通过SendData,向这个thread中传入数据,这里,传入的是Person。而MyThreadLoop,则是通过OnHandleNewData来处理参数。
两个线程之间的互斥同步,通过EventWithData实现。由于不同线程之间,生产者消费者速度可能不同,所以在EventWithData中,有一个成员datas_,其类型为std::queue,用来存储速度不同所产生的数据。

注意:为了方便绘制,我在绘制UML图的时候省略了参数和返回值,请不要被成员属性或方法后面的类型int所迷惑。

不幸的是,在测试过程中大部分运行是没有问题的,Person数据成功的通过Main线程,投递给了ThreadLoop线程。但是偶尔出现了死锁的现象!
这是为什么呢?

问题初步调试

log是这样的

[11:36:59  test -> newtest] $ make run
./test
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-742  main in
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-743  in main function
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-722  will send person0
03-05 11:37:00 25292 0x6000174c0 I RecordStream: Thread.cpp-46  Start Thread : TestThreadLoop
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-726  03-05 11:37:00 25292 0x6000174c0 I RecordStream: main.cpp-711 will send person1
Handle new data
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-730 03-05 11:37:00 25292 0x6000174c0 I RecordStream: main.cpp-700   will send person2Hello, my name is xiao-bai, i am 2 years old

03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-734  will send person3
03-05 11:37:00 25292 0x600000060 W RecordStream: Event.cpp-53  Event data is already full, ignore

很明显,程序并没有正常退出,我们向thread中发送了person0和person1,线程仅仅接收到了person1就阻塞了。究竟是什么原因呢,让我们通过gdb来调试一下。

进入gdb,并且通过ps命令查找我们的测试程序

[11:24:08  test -> newtest] $ gdb test.exe
GNU gdb (GDB) (Cygwin 7.10.1-1) 7.10.1
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-cygwin".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from test.exe...done.
(gdb) !ps
      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
    30748       1   30748      30748  ?         197108 11:33:30 /usr/bin/mintty
    20080   30748   20080      34204  pty2      197108 11:33:31 /usr/bin/bash
    15196   27528   15196      27840  pty0      197108 09:38:54 /usr/bin/bash
    25292   18740   18740      23768  pty0      197108 11:36:59 /cygdrive/d/source/gitcenter/StreamRecord/app/src/main/jni/test/test
    27528       1   27528      27528  ?         197108 09:38:52 /usr/bin/mintty
    18740   15196   18740      34000  pty0      197108 11:36:59 /usr/bin/make
    32088   15196   32088      30968  pty0      197108 10:58:23 /cygdrive/c/Program Files/Notepad++/notepad++
    33924   33220   33220      25708  pty2      197108 11:37:16 /usr/bin/ps
    33220   20080   33220       2488  pty2      197108 11:34:07 /usr/bin/gdb
(gdb)

显然,我们的test程序,进程号是25292,attach一下

(gdb) attach 25292
Attaching to program `/cygdrive/d/source/gitcenter/StreamRecord/app/src/main/jni/test/test.exe', process 23768
[New Thread 23768.0x2e80]
[New Thread 23768.0x85cc]
[New Thread 23768.0x4350]
[New Thread 23768.0x8324]
[New Thread 23768.0x3984]
[New Thread 23768.0x8348]
[New Thread 23768.0x7774]
0x00007ffedbf68d61 in ntdll!DbgBreakPoint () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll

从attach的结果来看,应该有七个线程,我们通过gdb命令详细的查看一下

(gdb) info threads
  Id   Target Id         Frame
* 7    Thread 23768.0x7774 0x00007ffedbf68d61 in ntdll!DbgBreakPoint () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  6    Thread 23768.0x8348 0x00007ffedbf65ef4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  5    Thread 23768.0x3984 0x00007ffedbf65464 in ntdll!ZwReadFile () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  4    Thread 23768.0x8324 0x00007ffedbf68c34 in ntdll!ZwWaitForWorkViaWorkerFactory () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  3    Thread 23768.0x4350 0x00007ffedbf68c34 in ntdll!ZwWaitForWorkViaWorkerFactory () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  2    Thread 23768.0x85cc 0x00007ffedbf68c34 in ntdll!ZwWaitForWorkViaWorkerFactory () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  1    Thread 23768.0x2e80 0x00007ffedbf65ef4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll

我们看到,确实有七个线程,其中线程1和线程6都在阻塞状态,是不是这两个线程死锁了呢?
让我们进入到具体的线程看一看

(gdb) thread 6
[Switching to thread 6 (Thread 23768.0x8348)]
#0  0x00007ffedbf65ef4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
(gdb) bt
#0  0x00007ffedbf65ef4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ffed8d4dcb0 in WaitForMultipleObjectsEx () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x00007ffed8d4dbae in WaitForMultipleObjects () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#3  0x00000001800458ac in cygwait (object=<optimized out>, timeout=timeout@entry=0x0, mask=mask@entry=36)
    at /usr/src/debug/cygwin-2.8.0-1/winsup/cygwin/cygwait.cc:74
#4  0x000000018013e571 in pthread_mutex::lock (this=0x6000179b0) at /usr/src/debug/cygwin-2.8.0-1/winsup/cygwin/thread.cc:1778
#5  0x0000000180116d5b in _sigfe () at sigfe.s:35
#6  0x000000010041bdec in BaseLock::Lock (this=0xffdfcab0) at ../util/Lock.cpp:12
#7  0x000000010041be50 in AutoLock::AutoLock (this=0xffdfcab0, lock=...) at ../util/Lock.cpp:21
#8  0x000000010041b6ec in EventWithData::GetNextData (this=0x6000177d0) at ../util/Event.cpp:62
#9  0x000000010041b4d7 in EventWithData::WaitWithData (this=0x6000177d0, data=0xffdfcb48) at ../util/Event.cpp:40
#10 0x000000010041d46a in ThreadLoop::Run (this=0x6000177a0, threaddata=0x0) at ../util/ThreadLoop.cpp:29
#11 0x000000010041d06d in Thread::ThreadLoop (wrapperdata=0x600012c60) at ../util/Thread.cpp:50
#12 0x000000018013e969 in pthread::thread_init_wrapper (arg=0x6000174c0) at /usr/src/debug/cygwin-2.8.0-1/winsup/cygwin/thread.cc:1993
#13 0x00000001800bdf31 in pthread_wrapper (arg=<optimized out>) at /usr/src/debug/cygwin-2.8.0-1/winsup/cygwin/miscfuncs.cc:459
#14 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

我们看到了,通过ThreadLoop的主函数进入到Run中,并且在WaitWithData,在GetNextData的时候AutoLock锁住了,让我们看看这个GetNextData究竟在做什么,从堆栈上看是第8帧

#8  0x000000010041b6ec in EventWithData::GetNextData (this=0x6000177d0) at ../util/Event.cpp:62

因此我们可以通过gdb命令查看具体的情况

(gdb) frame 8
#8  0x000000010041b6ec in EventWithData::GetNextData (this=0x6000177d0) at ../util/Event.cpp:62
62          AutoLock autolock(baselock_);
(gdb) list
57          used_++;
58          event_.Notify();
59      }
60
61      void* EventWithData::GetNextData() {
62          AutoLock autolock(baselock_);
63          if (datas_.empty()) {
64              return NULL;
65          }
66          void* nextdata = datas_.front();

通过frame 8命令,我们跳转到了第8帧,通过list命令,我们可以看到附近的源码
从源码上看,是AutoLock autolock(baselock_)这行出问题了,我们看看究竟出现什么问题,看一看后几帧的情况

(gdb) down
#7  0x000000010041be50 in AutoLock::AutoLock (this=0xffdfcab0, lock=...) at ../util/Lock.cpp:21
21          lock_.Lock();
(gdb) list
16          pthread_mutex_unlock(&mutex_);
17      }
18
19      AutoLock::AutoLock(BaseLock& lock) {
20          lock_ = lock;
21          lock_.Lock();
22      }
23
24      AutoLock::~AutoLock() {
25          lock_.UnLock();
(gdb) down
#6  0x000000010041bdec in BaseLock::Lock (this=0xffdfcab0) at ../util/Lock.cpp:12
12          pthread_mutex_lock(&mutex_);
(gdb) list
7       BaseLock::~BaseLock() {
8           pthread_mutex_destroy(&mutex_);
9       }
10
11      void BaseLock::Lock() {
12          pthread_mutex_lock(&mutex_);
13      }
14
15      void BaseLock::UnLock() {
16          pthread_mutex_unlock(&mutex_);
(gdb)

其中命令down可以快速的切换frame,配合list看附近的源码,果然是pthread_mutex_lock出问题了
我们继续移动frame并查找线索

(gdb) down
#5  0x0000000180116d5b in _sigfe () at sigfe.s:35
35      sigfe.s: No such file or directory.
(gdb) down
#4  0x000000018013e571 in pthread_mutex::lock (this=0x6000179b0) at /usr/src/debug/cygwin-2.8.0-1/winsup/cygwin/thread.cc:1778
1778          cygwait (win32_obj_id, cw_infinite, cw_sig | cw_sig_restart);
(gdb) p *this
warning: can't find linker symbol for virtual table for `pthread_mutex' value
warning:   found `_gm_' instead
$1 = warning: can't find linker symbol for virtual table for `pthread_mutex' value
warning:   found `_gm_' instead
{<verifyable_object> = {_vptr.verifyable_object = 0x1802e4fa8 <_gm_+232>, magic = 2150518696}, next = 0x600018920,
  lock_counter = 1, win32_obj_id = 0x0, owner = 0x600000060, static _new_mutex = 0x1, static _unlocked_mutex = 0x2,
  static _destroyed_mutex = 0x3, recursion_counter = 1, condwaits = 0, type = 2, pshared = 0, static mutexes = {mx = {
      lock_counter = 0, win32_obj_id = 0xa0}, head = 0x6000189a0}, static mutex_initialization_lock = {lock_counter = 0,
    win32_obj_id = 0xd8}}
(gdb)

这里,用到了gdb的命令p,即print,在打印this对象的内容时,我们发现有下面的信息

  lock_counter = 1, win32_obj_id = 0x0, owner = 0x600000060, static _new_mutex = 0x1, static _unlocked_mutex = 0x2,

即owner为0x600000060也就是说,持有这个锁的线程是0x600000060
再重新看看输出的log

03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-742  main in
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-743  in main function
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-722  will send person0
03-05 11:37:00 25292 0x6000174c0 I RecordStream: Thread.cpp-46  Start Thread : TestThreadLoop
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-726  03-05 11:37:00 25292 0x6000174c0 I RecordStream: main.cpp-711 will send person1
Handle new data
03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-730 03-05 11:37:00 25292 0x6000174c0 I RecordStream: main.cpp-700   will send person2Hello, my name is xiao-bai, i am 2 years old

03-05 11:37:00 25292 0x600000060 I RecordStream: main.cpp-734  will send person3
03-05 11:37:00 25292 0x600000060 W RecordStream: Event.cpp-53  Event data is already full, ignore

线程0x600000060正是主线程,让我们在gdb中再看看这个线程的调用堆栈

(gdb) info threads
  Id   Target Id         Frame
  7    Thread 23768.0x7774 0x00007ffedbf68d61 in ntdll!DbgBreakPoint () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
* 6    Thread 23768.0x8348 0x00007ffedbf65ef4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  5    Thread 23768.0x3984 0x00007ffedbf65464 in ntdll!ZwReadFile () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  4    Thread 23768.0x8324 0x00007ffedbf68c34 in ntdll!ZwWaitForWorkViaWorkerFactory () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  3    Thread 23768.0x4350 0x00007ffedbf68c34 in ntdll!ZwWaitForWorkViaWorkerFactory () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  2    Thread 23768.0x85cc 0x00007ffedbf68c34 in ntdll!ZwWaitForWorkViaWorkerFactory () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
  1    Thread 23768.0x2e80 0x00007ffedbf65ef4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
(gdb) thread 1
[Switching to thread 1 (Thread 23768.0x2e80)]
#0  0x00007ffedbf65ef4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
(gdb) bt
#0  0x00007ffedbf65ef4 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ffed8d4dcb0 in WaitForMultipleObjectsEx () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x00007ffed8d4dbae in WaitForMultipleObjects () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#3  0x00000001800458ac in cygwait (object=<optimized out>, timeout=timeout@entry=0x0, mask=mask@entry=37)
    at /usr/src/debug/cygwin-2.8.0-1/winsup/cygwin/cygwait.cc:74
#4  0x000000018014084d in pthread::join (thread=thread@entry=0xffffbb10, return_val=0x0)
    at /usr/src/debug/cygwin-2.8.0-1/winsup/cygwin/thread.cc:2434
#5  0x00000001800efd83 in pthread_join (thread=0x6000174c0, return_val=<optimized out>)
    at /usr/src/debug/cygwin-2.8.0-1/winsup/cygwin/pthread.cc:47
#6  0x0000000180116d5b in _sigfe () at sigfe.s:35
#7  0x000000010041cdb9 in Thread::Stop (this=0x6000177a0) at ../util/Thread.cpp:34
#8  0x00000001004176ef in testThreadLoop () at ../test/main.cpp:737
#9  0x0000000100417bbb in main (argc=1, argv=0xffffcc60) at ../test/main.cpp:744
(gdb)

我们发现主线程实际上也是在阻塞,而阻塞的地点是pthread::join,也就是在回收子线程时阻塞了,从函数堆栈上看,这个线程并没有持有锁啊!
多么奇怪的现象!

柳暗花明

通过上面的gdb排查,我基本确定了是锁出现了问题。但是通过gdb,却没有发现问题所在。因此我决定采取简单粗暴的方式,也就是在加锁的时候和释放锁的时候都增加打印,看看是否成对出现。
具体的过程不再展示了,但是结果是令人失望的。居然在阻塞的时候,锁的操作确实也是成对出现的!这就奇怪了,难道是cygwin出现了问题?

百思不得其解。后来再不断的复现问题的时候,居然程序还崩溃了!我的内心也是崩溃的啊!
看一看崩溃的log吧!

对于window的崩溃log,对应了一个stampdump文件,即test.exe.stampdump,遗憾的是在这个文件里,全是地址,而不是human-read的代码和行数。在网上查找了相关的资料,有的给出,通过objdump反汇编一下,然后看看地址对应的具体函数。

这个方法比较复杂,我再次用到了gdb这个利器,通过list指令就可以找到具体的函数和行数,具体的用法是这样的

list 目标地址

就会出现目标地址所对应的具体函数和行数

在一步一步看堆栈信息的时候,我发现了一个重大的bug,也就是在调用AutoLock的析构函数之后,居然还调用BaseLock的析构函数!这让我恍然大悟,终于发现问题所在了。原来之前传到AutoLock实例中的是BaseLock的引用,因此AutoLock的成员变量,实际上就是BaseLock本身了,因此在AutoLock析构的时候,也会释放其成员变量,因此BaseLock就被释放掉了,也就是mutex被destroy了。这样,当然会引起崩溃或者错乱的bug!

解决方案

知道问题的原因了,修改起来就容易了,就是把之前传入AutoLock中的BaseLock的引用,改为BaseLock的指针就可以了
通过这次惨重的教训,我们也可以体会出引用和指针的细微差别。

小结

本篇文章记录了我在调试一次崩溃问题的完整过程。不得不说,gdb在其中起到了核心的作用。用好了gdb,确实调试过程可以大大缩短!gdb非常强大,不但能调试运行时死锁,还可以调试崩溃。

所以,下次在遇到类似的问题的时候,先不要急于加log,先用利器gdb试试吧!

上一篇下一篇

猜你喜欢

热点阅读