求 bam文件单个位点碱基比例
2019-07-01 本文已影响0人
caokai001
IGV 上看碱基分布
目的:给你一个
bam
文件,如何查看一个位点的碱基情况,直接IGV
就ok
,如果需要看多个点,就需要使用软件了。(一般情况vcfDP4
会告诉此位置ref base
和非ref base
碱基个数)
在群里面大佬提示下,samtools mpileup
和bam-readcount
可以实现统计碱基分布。
1.简单的安装
conda install -c bioconda bam-readcount
2.bam-readcount 帮助信息
[kcao@h1-lgl genome_index_bowtie2]$ bam-readcount -h
Usage: bam-readcount [OPTIONS] <bam_file> [region]
Generate metrics for bam_file at single nucleotide positions.
Example: bam-readcount -f ref.fa some.bam
Available options:
-h [ --help ] produce this message
-v [ --version ] output the version number
-q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used
for counting.
-b [ --min-base-quality ] arg (=0) minimum base quality at a position to
use the read for counting.
-d [ --max-count ] arg (=10000000) max depth to avoid excessive memory
usage.
-l [ --site-list ] arg file containing a list of regions to
report readcounts within.
-f [ --reference-fasta ] arg reference sequence in the fasta format.
-D [ --print-individual-mapq ] arg report the mapping qualities as a comma
separated list.
-p [ --per-library ] report results by library.
-w [ --max-warnings ] arg maximum number of warnings of each type
to emit. -1 gives an unlimited number.
-i [ --insertion-centric ] generate indel centric readcounts.
Reads containing insertions will not be
included in per-base counts
注:-w 限制warnings 输出的条数,不然好多warning!
-p 碱基质量 1
-q 比对质量 20
3.测试效果
[kcao@h1-lgl dup]$ bam-readcount -w 1 -f /home/kcao/data/Genome/genome_index_bowtie2/Homo_sapiens.GRCh38.dna.chromosome.fa /home/kcao/test/HHHHHHHHHHHH/SRR6213130.uniq.bam 1:89085865-89085865|sed -n '$p' >tmpjjj
Minimum mapping quality is set to 0
WARNING: In read SRR6213130.15145147: Couldn't find single-end mapping quality. Check to see if the SM tag is in BAM.
The previous warning has been emitted 1 times and will be disabled.
输出结果:
1 89085886 G 43 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:43:60.00:37.51:2.79:33:10:0.43:0.01:15.51:33:0.63:101.00:0.56 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00
tips: 列名依次是,染色体,位置,参考碱基,深度,“=”(没理解等号含义),“A”,“C”,“G”,T“”,“N”
4 统计 A,G,C,T 碱基个数
[kcao@h1-lgl dup]$ less tmpjjj |awk 'BEGIN{FS=OFS="\t"}{split($6,A,":");split($7,C,":");split($8,G,":");split($9,T,":");print $1,$2,$3,$4,"A:"A[2],"C:"C[2],"G:"G[2],"T:"T[2]}'
1 89085886 G 43 A:0 C:0 G:43 T:0
5.使用 参数-l region_list 查看多个位点碱基分布
[kcao@h1-lgl dup]$ cat region_list
1 89085886 89085886
1 89085975 89085975
[kcao@h1-lgl dup]$ bam-readcount -w 1 -f /home/kcao/data/Genome/genome_index_bowtie2/Homo_sapiens.GRCh38.dna.chromosome.fa /home/kcao/test/HHHHHHHHHHHH/SRR6213130.uniq.bam -l region_list|awk 'BEGIN{FS=OFS="\t"}{split($6,A,":");split($7,C,":");split($8,G,":");split($9,T,":");print $1,$2,$3,$4,"A:"A[2],"C:"C[2],"G:"G[2],"T:"T[2]}' >test.txt
[kcao@h1-lgl dup]$ cat test.txt
1 89085886 G 43 A:0 C:0 G:43 T:0
1 89085975 C 64 A:0 C:64 G:0 T:0
-
有兴趣的可以试试samtools mpileup 结果~~
samtools mpileup结果 -
有时候,IGV reads 数目和bam-readcount 不对应,可能因为标记了dup 的缘故.