pthread_kill引发的争论
最近提测的一段代码中有一个,遇到一个诡异的bug,总是崩溃在pthread_kill这个函数上,并且不是每次比现。
调用逻辑大致如下,利用pthread_kill判断一个线程是否在运行
pthread_kill(pthread_[i], 0)
在最终的崩溃的栈上检查上述pthread_指针,i均正常,可以正常访问,
执行文件的编译也是正常的,
代码整体运行逻辑也没什么问题,针对pthread_的访问该加锁的也都加锁了
相当郁闷,好在及时找同事帮忙分析,同事立马指出pthread_t值在内部使用时很可能转换为指针使用,如果其值作为地址的内存区域被释放了很可能导致这个pthread_kill的crash(即使外部对pthread_t这个值的保存的内存区域未被释放),且代码流程上存在对一个pthread_t调用pthread_try_join后又对其调用pthread_kill的情况,同时内存被释放了如果没有被立即占用就可能还可以使用,这也就导致不是每次必现的crash,相当给力呀
简单画个图示意一下
pthread_t有了这个方向和简单的在崩溃栈上验证了了一下
gdb>p *(pthread_ + i)
0x**********
gdb>x/10a 0x**********
can't access
bingo差不多就是这个问题了,问题找到后,解决问题倒是比较简单(不在对pthread_try_join之后的pthread_t进行pthread_kill即可),重新提测验证通过~
问题解决了,本来一直想自己写段用例验证一下上述pthread_t的问题,即pthread_kill的man page描述着如果对于一个invalid pthread_t,将返回ESRCH,而不是crash才对。后面google后,发现好多人遇到了这个问题,stackoverflow的这个问题Segmentation fault caused by pthread_kill的回答很好的解释了这个问题,并且给了这个问题的两个相关链接,相当有意思
1、pthread_kill() Segmentation Fault when TID is invalid
2、pthread_t and similar types(blog内容贴在了最后,避免后续访问不到)
这个pthread_kill() Segmentation Fault when TID is invalid链接是之前有人针对这个问题向glibc nptl的开发者提了这个bug,然后作者的回答也比较直接拒了这个bug,然后两人就开始撕,单看争论的内容感觉这个确实是一个bug,大概是说文档和标准都没有这么说明,它怎么就崩溃了呢?后面又看了作者为这个问题写的一篇blogpthread_t and similar types,从posix的设计角度谈了一下为什么这么设计,写得也挺有意思。
看完后总体感觉设计者和使用者的角度还是不同的,使用者角度上接口提供的功能应该和文档描述一致,不应该隐藏一些晦涩的逻辑或者至少将功能情景描述清楚;设计者的角度则在于提供更大的灵活性和通用性,接口内不去实现特殊情况下的特定逻辑。这里实现上个人还是比较倾向设计者的角度,但对于man page的说明还是不太认同,确实写得不太清楚,不过后续还是有更新相关说明的
http://man7.org/linux/man-pages/man3/pthread_kill.3.html
The glibc implementation returns this error in the cases where an invalid thread ID can be detected. But note also that POSIX says that an attempt to use a thread ID whose lifetime has ended produces undefined behavior, and an attempt to use an invalid thread ID in a call to pthread_kill() can, for example cause a segmentation fault.
针对pthread_kill,其意思是如果内部检测到pthred_t是无效的则返回ESRCH,但这并不表明所有无效的pthread_t内部都能检测到,其原因是因为标准并未对pthread_t的实现类型进行明确的限制。找了glibc的pthread_kill的实现版本,发现只有tid<=0时才返回ESRCH,至于什么实时tid<=0待查(关于tid pthread_t pid tgid的区别可参考Difference between pid and tid),同时不同的实现的版本也有可能有区别,因此从这个角度看通过pthread_kill判断线程是否在运行貌似没有意义。。。
19 #include <errno.h>
20 #include <signal.h>
21 #include <pthreadP.h>
22 #include <tls.h>
23 #include <sysdep.h>
24 #include <unistd.h>
25
26
27 int
28 __pthread_kill (pthread_t threadid, int signo)
29 {
30 struct pthread *pd = (struct pthread *) threadid;
31
32 /* Make sure the descriptor is valid. */
33 if (DEBUGGING_P && INVALID_TD_P (pd))
34 /* Not a valid thread handle. */
35 return ESRCH;
36
37 /* Force load of pd->tid into local variable or register. Otherwise
38 if a thread exits between ESRCH test and tgkill, we might return
39 EINVAL, because pd->tid would be cleared by the kernel. */
40 pid_t tid = atomic_forced_read (pd->tid);
41 if (__glibc_unlikely (tid <= 0))
42 /* Not a valid thread handle. */
43 return ESRCH;
44
45 /* Disallow sending the signal we use for cancellation, timers,
46 for the setxid implementation. */
47 if (signo == SIGCANCEL || signo == SIGTIMER || signo == SIGSETXID)
48 return EINVAL;
49
50 /* We have a special syscall to do the work. */
51 INTERNAL_SYSCALL_DECL (err);
52
53 pid_t pid = __getpid ();
54
55 int val = INTERNAL_SYSCALL_CALL (tgkill, err, pid, tid, signo);
56 return (INTERNAL_SYSCALL_ERROR_P (val, err)
57 ? INTERNAL_SYSCALL_ERRNO (val, err) : 0);
58 }
59 strong_alias (__pthread_kill, pthread_kill)
end~
后续
之后又看来几篇udrepper的blog,才发现udrepper是How to Write Shared Libraries这篇论文的作者,惊了。。。
blog地址mark一下https://www.akkadia.org/drepper/
pthread_t and similar types
Constantly people complain that the runtime does not catch their mistakes. They are hiding behind this requirement in the POSIX specification (for pthread_join in this case, also applies to pthread_kill and similar functions):
The pthread_join() function shall fail if:
[...]
ESRCH No thread could be found corresponding to that specified by the given thread ID.
The glibc implementation follows this requirement to the letter. IFF we can detect that the thread descriptor is invalid we do return ESRCH.
But: the above does not mean that all uses of invalid thread descriptors must result in ESRCH errors. The reason is simple: the standard does not restrict the implementation in any way in the definition of the type pthread_t. It does not even have to be an arithmetic type. This means it is valid to use a pointer type and this is just what NPTL does.
Nobody argues that functions like strcpy should not dump a core in case the buffer is invalid. The same for pthread_attr_t references passed to pthread_attr_init etc. The use of pthread_t when defined as a pointer is no different. The only complication is in the understanding that pthread_t can be a pointer type. This is obvious for void* etc.
In the POSIX committee we discussed several times changing the pthread_join and pthread_kill man pages. The ESRCH errors could be marked as may fail. But
this really is not necessary, see above.
it would mean we have to go through the entire specification and treat every other place where this is an issue the same way.
If somebody wants to do the work associated with the second step above and we have confidence in the results, we (= Austin Group) might make the change at some later date. But it is a rather high risk for no real gain. Programmers have to educate themselves anyway.
What remains is the question: how can programs avoid these mistakes? It is actually pretty simple: the program should make sure that no calls to pthread_kill, for instance, can happen when the thread is exiting. One way to solve this problem is:
1、Associate a variable running of some sort and a mutex with each thread.
2、In the function started by pthread_create (the thread function) set running to true.
3、Before returning from the thread function or calling pthread_exit or in a cancellation handler acquire the mutex, set running to false, unlock the mutex, and proceed.
4、Any thread trying to use pthread_kill etc first must get the mutex for the target thread, if running is true call pthread_kill, and finally unlock the mutex.
This ensures that no invalid descriptor is used. But I can already hear people complain:
This is too expensive!
That is ridiculous. The implementation would have to do something similar if it would try to catch bad thread descriptors. In fact, it would have to do more. What is important is to recognize that this price would have to be paid by every program, not just the buggy ones. This is wrong. Only those people who need this extra protection should pay the price.
But I don't have control over the code calling pthread_create!
Boo hoo, cry me a river. Don't expect sympathy for using proprietary software. I will never allow good free software to be shackled because of proprietary code. If you cannot get this changed in the code you pay good money for this just means it is time to find a new supplier or, even better, use free software.
In summary, this is entirely a problem of the programs which experience them. Existing Linux systems are proof that it is possible to write complex programs without requiring the implementation to help incompetent programmers. We will have a few more words in the next revision of the POSIX specification which talk about this issue. But I expect they will be ignored anyway and all focus remains on the shall fail errors of pthread_kill etc.