生信入门参考资料

多批次WES数据该肿么办

2018-07-01  本文已影响46人  因地制宜的生信达人

批次很多时候无法避免,比如文章 Biomed Res Int. 2014 . doi: 10.1155/2014/319534 就提到:

In large WES studies, some samples are occasionally sequenced twice or even more times due to a variety of reasons, for example, insufficient coverage in the first experiment, sample duplication, and the rest. It is challenging how to best utilize these duplicated exomes for SNP discovery and genotype calling, especially with batch effects taken into consideration.

正好作者有这样的数据,来源于 Shanghai Breast Cancer Study (SBCS) 数据集的 92 subjects (51 cases and 41 controls) 的外显子数据,建库策略是 QIAmp DNA kit + Illumina TruSeq 得到fastq数据后走标准的 GATK 流程得到 184个BAM文件

可以分3个策略来进行比较

找SNP也是GATK,后续的SNP过滤策略是:

最后得到的SNP数量是:46,860, 44,806, and 43,664 for the M, H, and L groups,

对找到SNP做的比较有点简单:

测序数据评价指标

包括

作者并没有上传测试原始数据,简单的给了一些测序及分析总结后的结果而已

Table S1: Data production by 92 duplicated WES subjects.

Table S2: Number of variants observed across the on-target and off-target regions.

Click here to view.(41K, xlsx)

首先是测序详情

MQ <10 (%) MQ >= 20 (%) Mapped bases (e9) Mapping rate (%) Target mapping (%) Total reads (e6) Group
10.71 85.75 1.83 98.26 49.14 45.18 L
11.08 86.41 2.24 99.1 49.27 55.01 H
10.62 86.74 2.63 98.93 53.49 59.8 L
10.57 86.68 3.53 98.86 53.87 79.93 H
9.88 87.53 2.5 98.95 54.97 55.71 L
9.86 87.49 3.61 98.91 55.37 79.82 H
10.59 85.86 1.82 98.26 49.33 44.76 L
10.92 86.59 2.25 99.11 49.48 54.89 H
10.13 87.37 2.11 99.03 52.87 48.31 L
10 87.35 3.01 98.91 53.38 68.63 H
10.95 86.49 2.15 99.02 53.39 48.78 L
10.82 86.51 3.04 98.95 53.87 68.62 H

可以看到测序数据量其实都还可以,不管是L还是H组!

然后是找到的SNP详情

Sample index # SNPs on-target # SNPs off-target # unique SNPs Calling strategies *
1 46645 102680 24554 Merge
1 44377 75767 1310 High
1 42880 64470 1139 Low
2 47409 105395 18742 Merge
2 46259 85611 1445 High
2 44916 74509 1076 Low
3 47100 103724 20247 Merge
3 46087 82940 1424 High
3 44681 69129 1051 Low

可以看到把同一个样本的L和H两个数据合并后的确能找到更多的SNP,但是这个观点不是很容易推理吗,为什么需要这样的分析来证明呢?

上一篇 下一篇

猜你喜欢

热点阅读