关于sample
2019年12月16日:
之前只是知道有一个算sample size的calculator(https://www.surveysystem.com/sscalc.htm#one)
后面,看到了<Determining sample size for research activities>文章的时候,才知道这个calculator计算的方式就是来自这篇论文。也算是有依据了。直接给了个table做ref,不容易。再大的population size,也基本会在380+上收敛。是一个喜人的发现。只是,这个文章中的公示,怎么得来的(所引论文没有下载到),想要知道。
ICSE 2019:Software Documentation Issues Unveiled
这个里面sample的数量跟这个对上了,https://www.surveysystem.com/sscalc.htm
然后,第一次知道confidence level和confidence interval不是要加起来等于100?
想怎么sample就怎么sample么。。
需要check下。
原文如下:
2) Manual Classification of Documentation Issues: Once we collected the candidate artifacts, we manually analyzed a statistically significant sample ensuring a 99% confidence level ± 5%. This resulted in the selection of 665 artifacts for our manual analysis, out of the 805,939 artifacts collected from the four sources.
Since the number of collected artifacts is substantially different between the four sources (Table II), we decided to randomly select the 665 artifacts by considering these proportions. A simple proportional selection would basically discard SO and mailing lists from our study, since issues and pull requests account for over 90% of our dataset. Indeed, this would result in the selection of 311 pull requests, 326 issues, 24 SO discussions and 6 mailing list threads.
For this reason, we adopted the following sampling procedure: for SO and mailing lists, we targeted the analysis of 96 artifacts each, ensuring a 95% confidence level ± 10% within those two sources. For issues and pull requests, we adopted the proportional selection as explained above. This resulted in 829 artifacts to be manually analyzed (99% confidence ± 4.5%).