2023-07-20 转录组数据比对-hisat2(需补充)
2023-08-07 本文已影响0人
麦冬花儿
软件下载与安装
# Installing HISAT2 (https://daehwankimlab.github.io/hisat2/download/)
#wget https://cloud.biohpc.swmed.edu/index.php/s/oTtGWbWjaxsQ2Ho/download -O ~/software/hisat2-2.2.1-Linux_x86_64.zip
unzip ~/software/hisat2-2.1.0-Linux_x86_64.zip -d /opt/biosoft/
echo 'PATH=$PATH:/opt/biosoft/hisat2-2.1.0/' >> ~/.bashrc
source ~/.bashrc
文件准备
train@MiWiFi-R3P-srv hisat2]$ ls
A.1.fastq A.2.fastq B.1.fastq B.2.fastq C.1.fastq C.2.fastq D.1.fastq D.2.fastq E.1.fastq E.2.fastq F.1.fastq F.2.fastq G.1.fastq G.2.fastq genome.fasta genome.gtf
genome.gtf#该文件包含详细的基因结构信息
#构建hisat2数据库文件
perl -e 'while (<>) { if (m/^#/) { print; next } $number ++; $num = "ms" . "0" x (7 - length($number)) . $number; s/\t\.\t/\t$num\t/; print; }' ~/00.incipient_data/data_for_variants_calling/variants.vcf > variants.vcf #可选
hisat2_extract_splice_sites.py genome.gtf > splitSites.txt #记录剪切位点信息,相当于intron信息
hisat2_extract_exons.py genome.gtf > exonSites.txt #记录外显子位点信息
hisat2_extract_snps_VCF.pl variants.vcf > variants.snp #记录snp位点信息
hisat2-build -p 8 --snp variants.snp --ss splitSites.txt --exon exonSites.txt genome.fasta genome #事实上以上文件都是可选项
hisat2 -x genome -p 8 --min-intronlen 20 --max-intronlen 4000 --dta --dta-cufflinks -1 A.1.fastq -2 A.2.fastq -S A.sam --new-summary --summary-file A.hisat2.summary" #参考基因组超过500M,建议将--max-intronlen 20000-100000之间,
也可以利用脚本计算intron长度
perl -e 'while ($_=<>{ @aa = split /t/,$_; push @intron_length, abs ($aa[1]) - $aa[2]) - })'
然后批量操作
for i in `ls *.1.fastq`
do
i=${i/.1.fastq/}
echo "hisat2 -x genome -p 8 --min-intronlen 20 --max-intronlen 4000 --dta --dta-cufflinks -1 $i.1.fastq -2 $i.2.fastq -S $i.sam --new-summary --summary-file $i.hisat2.summary --score-min L,-0.4,-0.4"
done > command.hisat2.list
sh command.hisat2.list # 可以使用ParaFly -c command.hisat2.list -CPU 8进行并行化计算
j结果如下
[train@MiWiFi-R3P-srv hisat2]$ ParaFly -c command.hisat2.list -CPU 8
Number of Commands: 7
HISAT2 summary stats:
Total pairs: 1287752
Aligned concordantly or discordantly 0 time: 29414 (2.28%)
Aligned concordantly 1 time: 997143 (77.43%)
Aligned concordantly >1 times: 257239 (19.98%)
Aligned discordantly 1 time: 3956 (0.31%)
Total unpaired reads: 58828
Aligned 0 time: 43806 (74.46%)
Aligned 1 time: 10321 (17.54%)
Aligned >1 times: 4701 (7.99%)
Overall alignment rate: 98.30%
HISAT2 summary stats:
Total pairs: 1417604
Aligned concordantly or discordantly 0 time: 376301 (26.54%)
Aligned concordantly 1 time: 213476 (15.06%)
Aligned concordantly >1 times: 827431 (58.37%)
Aligned discordantly 1 time: 396 (0.03%)
Total unpaired reads: 752602
Aligned 0 time: 742719 (98.69%)
Aligned 1 time: 1667 (0.22%)
Aligned >1 times: 8216 (1.09%)
Overall alignment rate: 73.80%
HISAT2 summary stats:
Total pairs: 1422090
Aligned concordantly or discordantly 0 time: 405241 (28.50%)
Aligned concordantly 1 time: 198821 (13.98%)
Aligned concordantly >1 times: 817669 (57.50%)
Aligned discordantly 1 time: 359 (0.03%)
Total unpaired reads: 810482
Aligned 0 time: 800919 (98.82%)
Aligned 1 time: 1552 (0.19%)
Aligned >1 times: 8011 (0.99%)
Overall alignment rate: 71.84%
HISAT2 summary stats:
Total pairs: 1423417
Aligned concordantly or discordantly 0 time: 399863 (28.09%)
Aligned concordantly 1 time: 203001 (14.26%)
Aligned concordantly >1 times: 820177 (57.62%)
Aligned discordantly 1 time: 376 (0.03%)
Total unpaired reads: 799726
Aligned 0 time: 790007 (98.78%)
Aligned 1 time: 1598 (0.20%)
Aligned >1 times: 8121 (1.02%)
Overall alignment rate: 72.25%
HISAT2 summary stats:
Total pairs: 1421496
Aligned concordantly or discordantly 0 time: 347667 (24.46%)
Aligned concordantly 1 time: 201568 (14.18%)
Aligned concordantly >1 times: 871884 (61.34%)
Aligned discordantly 1 time: 377 (0.03%)
Total unpaired reads: 695334
Aligned 0 time: 684191 (98.40%)
Aligned 1 time: 1814 (0.26%)
Aligned >1 times: 9329 (1.34%)
Overall alignment rate: 75.93%
HISAT2 summary stats:
Total pairs: 1369754
Aligned concordantly or discordantly 0 time: 21871 (1.60%)
Aligned concordantly 1 time: 468062 (34.17%)
Aligned concordantly >1 times: 878564 (64.14%)
Aligned discordantly 1 time: 1257 (0.09%)
Total unpaired reads: 43742
Aligned 0 time: 28517 (65.19%)
Aligned 1 time: 3481 (7.96%)
Aligned >1 times: 11744 (26.85%)
Overall alignment rate: 98.96%
HISAT2 summary stats:
Total pairs: 1378862
Aligned concordantly or discordantly 0 time: 19572 (1.42%)
Aligned concordantly 1 time: 459700 (33.34%)
Aligned concordantly >1 times: 898631 (65.17%)
Aligned discordantly 1 time: 959 (0.07%)
Total unpaired reads: 39144
Aligned 0 time: 27029 (69.05%)
Aligned 1 time: 2831 (7.23%)
Aligned >1 times: 9284 (23.72%)
Overall alignment rate: 99.02%
All commands completed successfully. :-)
[train@MiWiFi-R3P-srv hisat2]$ ls
A.1.fastq B.2.fastq C.hisat2.summary D.2.fastq E.hisat2.summary F.hisat2.summary genome.2.ht2 genome.7.ht2 G.sam
A.2.fastq B.hisat2.summary command.hisat2.list D.hisat2.summary E.sam F.sam genome.3.ht2 genome.8.ht2 splitSites.txt
A.hisat2.summary B.sam command.hisat2.list.completed D.sam exonSites.txt G.1.fastq genome.4.ht2 genome.fasta variants.snp
A.sam C.1.fastq C.sam E.1.fastq F.1.fastq G.2.fastq genome.5.ht2 genome.gtf variants.vcf
B.1.fastq C.2.fastq D.1.fastq E.2.fastq F.2.fastq genome.1.ht2 genome.6.ht2 G.hisat2.summary