生物信息数据科学

71. 《Bioinformatics Data Skills》

2021-10-07  本文已影响0人  DataScience

现在我们已经具备采用pysam模块统计比对信息的知识了,这里我们编写一个程序来统计BAM/SAM文件的信息并使用samtools flagstat进行验证。

NA12891_CEU_sample.bam文件下载

程序代码如下:

import pysam
import sys
from collections import Counter ## 1

if len(sys.argv) < 2: ## 2
    sys.exit("usage: alnsta.py bamfile")

bamfile = pysam.AlignmentFile(sys.argv[1]) ## 3
stat = Counter()
for read in bamfile:
    stat['total'] += 1
    stat['qcfail'] += int(read.is_qcfail) ## 4
    stat['paired'] += int(read.is_paired)
    stat['read1'] += int(read.is_read1)
    stat['read2'] += int(read.is_read2)

    if read.is_unmapped:
        stat['unmapped'] += 1
        continue ## 5
    
    stat['mapped'] += 1
    if read.mapping_quality <= 30:
        stat['mapping quality <= 30'] += 1
    stat["proper pair"] += int(read.is_proper_pair)

output_order = ("total", "mapped", "unmapped", "paired", "read1", "read2", "proper pair", "qcfail", "mapping quality <= 30") ## 6
for key in output_order:
    format = (key, stat[key], 100*float(stat[key]/stat['total']))
    sys.stdout.write(("%s: %d (%0.2f%%)\n") % format) ## 7

解释:

## 1,Counter对象在统计信息的时候很常用,类似于一个字典

## 2,简单判断是否存在输入文件名

## 3,pysam.AlignmentFile用于读取BAM/SAM文件,自动判断文件类型不需要指定

## 4,根据read的属性判断read是否符合条件,使用int函数将逻辑值转为0,1值

## 5,如果一个read没有比对到参考基因组上,跳过下面的内容统计

## 6,由于Counter对象存储的元素没有顺序,这里指定输出顺序

## 7,通过控制台以给定的格式顺序输出内容

该代码的运行结果为:

$ python3 alnsta.py NA12891_CEU_sample.bam

total: 636207 (100.00%)
mapped: 630875 (99.16%)
unmapped: 5332 (0.84%)
paired: 435106 (68.39%)
read1: 217619 (34.21%)
read2: 217487 (34.18%)
proper pair: 397247 (62.44%)
qcfail: 0 (0.00%)
mapping quality <= 30: 90982 (14.30%)

使用samtools flagstat进行检验:

$ samtools flagstat NA12891_CEU_sample.bam
636207 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
29826 + 0 duplicates
630875 + 0 mapped (99.16% : N/A)
435106 + 0 paired in sequencing
217619 + 0 read1
217487 + 0 read2
397247 + 0 properly paired (91.30% : N/A)
424442 + 0 with itself and mate mapped
5332 + 0 singletons (1.23% : N/A)
5383 + 0 with mate mapped to a different chr
2190 + 0 with mate mapped to a different chr (mapQ>=5)

可见我们的统计结果基本都是正确的,不过mapping quality <= 30数目没有统计到。我们可以这样验证:mapping quality <= 30数目 = 能够比对到参考基因组的read数目 - 比对质量大于30的read数目。

由于unmap的flag数字为4:

$ samtools flags UNMAP
0x4     4       UNMAP

所以mapped read数目为(注:这里-F表示取反):

$ samtools view -c -F 4 NA12891_CEU_sample.bam
630875

通过-q参数可以查看比对质量大于30的read数目:

$ samtools view -c -q 31 NA12891_CEU_sample.bam
539893

综合这两个信息,可知read比对质量小于等于30的read数目为:630875 - 539893 = 90982,与脚本的结果一致。

上一篇 下一篇

猜你喜欢

热点阅读