生物信息学生信工具

比对工具大揭秘

2018-07-01  本文已影响374人  因地制宜的生信达人

之所以有这个需求,是因为在做一个单细胞转录组测序数据的分析,发现一些非常诡异的比对情况。比如:尽管是全局比对率达到80%以上,但是过半数居然左右失衡,也就是说只有双端测序reads的其中一条成功比对了,另一条莫名其妙的比对失败。这种情况可以发生,只是超过50%就太夸张了,非常值得探究,究竟是reads本身的特性呢,还是比对工具的选择不够合适。

5个比对软件

我这里探究了hisat2,subjunc,star,bwa,bowtie2这5个比对工具,它们的安装方式文件不赘述了,使用代码如下:

hisat='/home/jianmingzeng/biosoft/HISAT/current/hisat2';
star="/home/jianmingzeng/biosoft/STAR/STAR-2.5.3a/bin/Linux_x86_64/STAR";
subjunc="/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subjunc";
tophat2="/home/jianmingzeng/biosoft/TopHat/current/tophat2";
bowtie="/home/jianmingzeng/biosoft/bowtie/bowtie2-2.2.9/bowtie2";


hisat2_mm10_index='/home/jianmingzeng/reference/index/hisat/mm10/genome'; 
star_mm10_index='/home/jianmingzeng/reference/index/star/mm10/'; 
subjunc_mm10_index='/home/jianmingzeng/reference/index/subread/mm10'; 
bowtie2_mm10_index='/home/jianmingzeng/reference/index/bowtie/mm10';
bwa_mm10_index='/home/jianmingzeng/reference/index/bwa/mm10';

fq1=SC2-01_S1_L002_R1_001.fastq.gz
fq2=SC2-01_S1_L002_R2_001.fastq.gz
sample=SC2-01_S1_L002 

$hisat -p 5 -x $hisat2_mm10_index -1 $fq1 -2 $fq2 -S $sample.sam 2>$sample.hisat.log
samtools sort -O bam -@ 5  -o ${sample}_hisat.bam $sample.sam

$subjunc -T 5  -i $subjunc_mm10_index -r $fq1  -R $fq2 -o ${sample}_subjunc.bam
## 很有趣,虽然subjunc直接输出bam,但是是按照reads名字排序,不是按照基因组坐标排序。

bwa mem -t 5 -M  $bwa_mm10_index $fq1 $fq2 1>$sample.sam 2>/dev/null 
samtools sort -O bam -@ 5  -o ${sample}_bwa.bam $sample.sam

$bowtie -p 5 -x $bowtie2_mm10_index -1 $fq1  -2 $fq2 | samtools sort  -O bam  -@ 5 -o - >${sample}_bowtie.bam

## star软件载入参考基因组非常耗时,约10分钟,也比较耗费内存,但是比对非常快,5M的序列就两分钟即可
$star --runThreadN  5 --genomeDir $star_mm10_index --readFilesCommand zcat --readFilesIn  $fq1 $fq2 --outFileNamePrefix  ${sample}_star 
## --outSAMtype BAM  可以用这个参数设置直接输出排序好的bam文件
samtools sort -O bam -@ 5  -o ${sample}_star.bam ${sample}_starAligned.out.sam

把比对后的bam文件统一进行samtools flagstat统计,整理一下比对情况如下表:

bowtie2 bwa subjunc hisat star
4,457,852 5,028,723 4,457,852 4,636,906 3,732,166 in total (QC-passed reads + QC-failed reads)
0 570,871 0 179,054 327,880 secondary
0 0 0 0 0 supplementary
0 0 0 0 0 duplicates
2,846,896 4,898,291 3,694,805 3,418,483 3,732,166 mapped(比对率)
4,457,852 4,457,852 4,457,852 4,457,852 3,404,286 paired in sequencing
2,228,926 2,228,926 2,228,926 2,228,926 1,702,143 read1
2,228,926 2,228,926 2,228,926 2,228,926 1,702,143 read2
1,977,832 3,741,658 2,284,400(太低了) 3,014,912 3,404,286 properly paired(双端reads比对率)
2,568,660 4,320,816 3,531,396 3,150,876 3,404,286 with itself and mate mapped
278,236 6,604 163,409 88,553 0 singletons
65,730 102,160 41,864 8,860 0 with mate mapped to a different chr
25,582 77,062 41,864 6,474 0 with mate mapped to a different chr

让我更不能理解的是,这个是RNA-seq数据,可是居然用BWA可以成功比对率是最高的,虽然比对文件里面有着大量的H/S情况,但是不影响它仍然可以比对成功。高达97.41% 的比对率,还有83.93% 左右两端reads匹配的比对率,遥遥领先与其它比对工具。完全颠覆了我以前的想象,一直以为对转录组数据不能用bwa来比对。(大家可以思考一下其中的原理) 可以看到bwa默认是容许多比对情况的,一条reads可以输出多个比对记录。

## samtools flagstat SC2-01_S1_L002_bwa.bam
5028723 + 0 in total (QC-passed reads + QC-failed reads)
570871 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
4898291 + 0 mapped (97.41% : N/A)
4457852 + 0 paired in sequencing
2228926 + 0 read1
2228926 + 0 read2
3741658 + 0 properly paired (83.93% : N/A)
4320816 + 0 with itself and mate mapped
6604 + 0 singletons (0.15% : N/A)
102160 + 0 with mate mapped to a different chr
77062 + 0 with mate mapped to a different chr (mapQ>=5)

而subjunc的比对结果里面的双端reads比对率与全局reads的比对率始终是太诡异了,如下,明明还有82.88%的比对率,但是左右两端匹配的比对率下降到51.24% ,理论上我应该好好探究一下各种原因。不过,本文先放过这个细节。(欢迎大家留言提出自己的见解)

## samtools flagstat SC2-01_S1_L002_subjunc.bam
4457852 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
3694805 + 0 mapped (82.88% : N/A)
4457852 + 0 paired in sequencing
2228926 + 0 read1
2228926 + 0 read2
2284400 + 0 properly paired (51.24% : N/A)
3531396 + 0 with itself and mate mapped
163409 + 0 singletons (3.67% : N/A)
41864 + 0 with mate mapped to a different chr
41864 + 0 with mate mapped to a different chr (mapQ>=5)

而且这里star软件取了一个巧,只在bam文件里面输出那些成功比对的reads,这样比对率就都是100%啦,不过也可能是我对star软件的说明书没有吃透,参数不恰当。

而且我还应该再探究一下那些被subjunc比对不上的序列却能被bwa软件成功比对到参考基因组的是写什么情况。这个也以后再分享。

表达量探索

因为转录组测序最重要是得到表达量,我们并不关心这些reads具体比对到参考基因组是被切掉了还是跨越了内含子的成功比对 (如果要考虑转录本定量,或者可变剪切,就需要真正的跨越外显子的比对)

这里我们还是选用最开始便捷的featureCounts来进行基于基因的定量,代码如下;

featureCounts='/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts';
mm10_gtf='/home/jianmingzeng/reference/gtf/gencode/gencode.vM12.annotation.gtf';
# Summarize paired-end reads and count fragments (instead of reads):
$featureCounts -T 5 -p -t exon -g gene_id \
-a $mm10_gtf -o counts.txt   *.bam

这里只有两个reads都成功比对到参考基因组才计数,所以数的是fragment数量,比对总结如下:

Status bowtie.bam bwa.bam hisat.bam star.bam subjunc.bam
Assigned 1,180,215 2,183,516 1,295,795 1,311,956 1,518,380
Unassigned_Unmapped 666,360 61,914 564,935 0 299,819
Unassigned_MappingQuality 0 0 0 0 0
Unassigned_Chimera 0 0 0 0 0
Unassigned_FragmentLength 0 0 0 0 0
Unassigned_Duplicate 0 0 0 0 0
Unassigned_MultiMapping 0 0 165,425 254,209 0
Unassigned_Secondary 0 0 0 0 0
Unassigned_Nonjunction 0 0 0 0 0
Unassigned_NoFeatures 282,738 408,715 228,180 232,006 307,639
Unassigned_Overlapping_Length 0 0 0 0 0
Unassigned_Ambiguity 99,613 145,544 73,526 67,912 103,088

可以看到bwa软件把几乎所有的RNA-seq测序的reads都成功的比对到了基因组,并且成功的注释到了基因,计数成为了基因的表达量。从这一点来说,BWA最大限度的保留了这个RNA-seq测序得到的基因表达信息,现在只需要探究它们的表达量的相关性,看看BWA那种截取reads的部分片段进行比对,是否会出现大幅度的基因定量的偏差。

简单的R代码探究如下:

a=read.table("tmp.txt",header=T) ## 49585 genes in gencode 
head(a)
cor(a) 
write.table(cor(a),file="tmp.sum",quote=F,sep="\t")
cor(a[order(apply(a,1,mad), decreasing = T)[1:50],])
cor(a[order(apply(a,1,mad), decreasing = T)[1:500],])
dim(a[rowSums(a)>10,]) ## 5823 expressed genes in single cells
cor(a[rowSums(a)>10,])

相关性结果如下:

bowtie.bam bwa.bam hisat.bam star.bam subjunc.bam
bowtie.bam 1 0.885440896 0.879128734 0.862654678 0.910767788
bwa.bam 0.885440896 1 0.921680804 0.923104636 0.954310013
hisat.bam 0.879128734 0.921680804 1 0.996418718 0.971852796
star.bam 0.862654678 0.923104636 0.996418718 1 0.970620816
subjunc.bam 0.910767788 0.954310013 0.971852796 0.970620816 1

发现只有bowtie2这个软件的比对跟其它工具得到的表达量相关性比较差。hisat,star,subjunc这3款软件都是设计来专门做转录组数据分析,所以它们之间的相关性非常高。

虽然相关性非常高,也还是有些基因区域被这5款软件计数后的表达量差异却非常大,如下:

ENSMUSG00000077463.1 chr1 24548841 24548977 - 137 0 0 0 0 0
ENSMUSG00000099997.1 chr1 24594087 24595423 - 1337 1 9 1 2 2
ENSMUSG00000100131.1 chr1 24611535 24612410 - 876 95 141 0 11 13
ENSMUSG00000067736.2 chr1 24612407 24612700 - 294 49 66 0 3 8
ENSMUSG00000101939.1 chr1 24612775 24613119 - 345 25 67 11 16 18
ENSMUSG00000101111.1 chr1 24613189 24613971 - 783 1581 2842 174 154 524
ENSMUSG00000100862.1 chr1 24613974 24614651 - 678 1088 1789 8 37 270
ENSMUSG00000102070.1 chr1 24614885 24615565 - 681 881 1468 10 60 276
ENSMUSG00000101249.1 chr1 24615706 24616197 - 492 1946 2883 34 101 664

上表中最后五列是不同比对得到的表达量,分别是bowie2,bwa,hisat2,star,subjunc。 可以看到对于这种单外显子基因来说,bwa和bowtie2比较类似,但是跟hisat2,star,subjunc这3款转录组比对工具完全不一样。真奇怪。

先用sickle把fastq过滤

过滤代码很简单

cat $config |while read id
do
    arr=($id)
    fq1=${arr[1]}
    fq2=${arr[2]}
    sample=${arr[0]}
    echo $sample
    ~/biosoft/sickle/sickle-master/sickle pe -q 30 -l 36 -g -f $fq1 -r $fq2 -t sanger -o $sample.clean.R1.fq.gz -p $sample.clean.R2.fq.gz -s $sample.trash.fq.gz  1 > $sample.log 2>&1
done

再用本文最开头的5个比对软件把它们再比对一次,总结比对情况,如下表:

bowtie2 bwa hisat2 star subjunc
3,618,770 35,014 3,775,457 2,925,028 3,618,770 in total (QC-passed reads + QC-failed reads)
0 416,244 156,687 277,104 0 secondary
0 0 0 0 0 supplementary
0 0 0 0 0 duplicates
2,367,830(65.43% 3,934,849(97.52% 2,853,784(84.90% 2,925,028 3,078,617(85.07% mapped
3,618,770 3,618,770 3,618,770 2,647,924 3,618,770 paired in sequencing
1,809,385 1,809,385 1,809,385 1,324,102 1,809,385 read1
1,809,385 1,809,385 1,809,385 1,323,822 1,809,385 read2
1,642,810(45.40% 3,072,416(84.90% 2,414,588(75.59% 2,645,342 1,941,150(53.64% properly paired (双端reads比对率)
2,138,294 3,516,004 2,637,368 2,645,342 2,991,132 with itself and mate mapped
229,536 2,601 59,729 2,582 87,485 singletons
53,050 62,750 10,862 0 34,144 with mate mapped to a different chr
19,185 52,158 6,479 0 34,144 with mate mapped to a different chr

可以看到之前的2.228M序列被过滤成了1.809M的序列,然后比对率都略微有提升。

同样的进行featureCounts定量,结果如下:

Status SC2-01_bowtie.bam SC2-01_bwa.bam SC2-01_hisat.bam SC2-01_star.bam SC2-01_subjunc.bam
Assigned 979,423 1,737,888 1,067,257 1,016,223 1,241,309
Unassigned_Unmapped 510,702 48,782 430,972 0 226,334
Unassigned_MappingQuality 0 0 0 0 0
Unassigned_Chimera 0 0 0 0 0
Unassigned_FragmentLength 0 0 0 0 0
Unassigned_Duplicate 0 0 0 0 0
Unassigned_MultiMapping 0 0 146,667 213,261 0
Unassigned_Secondary 0 0 0 0 0
Unassigned_Nonjunction 0 0 0 0 0
Unassigned_NoFeatures 239,597 325,603 194,491 182,369 259,568
Unassigned_Overlapping_Length 0 0 0 0 0
Unassigned_Ambiguity 79,663 113,301 59,189 52,063 82,174
上一篇 下一篇

猜你喜欢

热点阅读