通过简单数据熟悉Linux下生物信息学各种操作4
原地址
几点说明
1.非简单翻译,所有代码均可运行,为了辅助理解,基本每步代码都有结果,需要比较的进行了整合
2.原文中的软件都下载最新版本
3.原文中有少量代码是错误的,这里进行了修正
4.对于需要的一些知识背景,在这里进行了注释或链接到他人博客
一共4部分
通过简单数据熟悉Linux下生物信息学各种操作1
通过简单数据熟悉Linux下生物信息学各种操作2
通过简单数据熟悉Linux下生物信息学各种操作3
通过简单数据熟悉Linux下生物信息学各种操作4
20 Pileup和Coverage
上面已经得到的mutation files
需要用到前面的内容
20.1 安装bcftools suite以便处理vcf文件(variant call formats)
cd src
curl -OL http://sourceforge.net/projects/samtools/files/samtools/1.9/bcftools-1.9.tar.bz2
tar jxvf bcftools-1.9.tar.bz2
cd bcftools-1.9
make
ln -s ~/src/bcftools-1.9/bcftools ~/bin
每个run的平均coverage是什么
默认,samtools depth只输出非0 coverage的区域
samtools depth bwa.bam | awk '{ sum += $3 } END { print sum/NR } '
73.8133
以下内容部分和前面重复
20.2 计算genome大小
samtools view -h bow.bam |head| samtools view -h bow.bam |head|grep @SQ
@SQ SN:NC_002549 LN:18959
samtools depth bwa.bam | awk '{ sum += $3 } END { print sum/18959 } '
结果仍然是
73.7938
20.3 分别看有何无参考基因组的pileups
前已述及,不再展示图
samtools mpileup -Q 0 bwa.bam | more
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | more
20.4samtools tview(samtools文本基因组浏览器)
samtools tview bwa.bam
1 11 21 31 41 51 61 71
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
CACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
CACACAAAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTG
CACAAAAAGAAAGAAGAATTTTTAGGATCTTTAGTGTGCGAATAACTATGAGGAATATTAATAATTTACCTCTC
AAAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
AAAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
AAGAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
GAAAGAAGAATTTTTAGGATCTTTTGTGTGCGAATAAGTATGAGGAAGATTAATAATTTTCCTCTC
AGAAGAATTTTTAGGATCTTTTGTGTTCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
AAGAATTTTTAGGATCTTTTGTGTGGGAATAACTATGAGGAAGATTCAAAATTTTCCTCTC
AGAATTTTTAGTATCTTTTGAGTGCGACTAACTATGAGGAAGATTAATAATTTTCCTCTC
AATTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCAC
AATTTTTAGGATCTTTTGTGTGCGAATCACTATGAGGAAGATTAATAATTTTCCTCTC
ATTATTAGGATCTTTTGTGTGCGAATAACTATGAGGACGATTAATAATTTTCCTCTC
TTTTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATACTTTTCCTCTC
TTTAGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
TTAGGATCTTTTGTGTGCGAATAACTATGATGAAGATTAATAATTTTCCTCTC
AGGATCTTTTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
TATTGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
TTTTGTGTGCGAATAAGTATGAGGAAGATTAATAATTTTCCTCTC
TTTGTGTGCGAATAACTATGAGGAAGATTAATCATTTTCCTCTC
ATGTGTGCGAATAACTATGAGGAAGATTAATAATTTTCCTCTC
GTGCGAATAAGTATGCGGCAGATTAATAATTTTCCTCTC
TAACTATGAGGAAGATTAATAATTTTCCTCTC
TAACTATGAGGAAGATTAATAATTTTCCTCTC
CTATGAGGAAGATTAATAATTTTGCTCTC
ATGAGGAAGATTAATAATTTTCCTCTC
TGCGGAAGATTAATAATTTTCCTCTC
AGGAAGATAAATAATTTTCCTCTC
AGATTAATAATTTTCCTCTC
AGATTAATAATTTTCCTCTC
TTAATAATTTTCCTCTC
ATTTTCCTCTC
TTTTCCTCTC
TTTTCCTCTC
TTTCCTCTC
TCTC
CTG
C
20.5 对整个genome生成vcf
samtools mpileup -Q 0 -f ~/refs/852/NC.fa -uv bwa.bam | more
部分结果如下
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT liver
NC_002549 5 . C <*> 0 . DP=1;I16=1,0,0,0,17,289,0,0,60,3600,0,0,0,0,0,0;QS=1,0;MQ0F=0 PL 0,3,17
NC_002549 6 . A <*> 0 . DP=1;I16=1,0,0,0,17,289,0,0,60,3600,0,0,1,1,0,0;QS=1,0;MQ0F=0 PL 0,3,17
NC_002549 7 . C <*> 0 . DP=2;I16=2,0,0,0,34,578,0,0,120,7200,0,0,2,4,0,0;QS=1,0;MQ0F=0 PL 0,6,31
NC_002549 8 . A <*> 0 . DP=2;I16=2,0,0,0,34,578,0,0,120,7200,0,0,4,10,0,0;QS=1,0;MQ0F=0 PL 0,6,31
NC_002549 9 . C <*> 0 . DP=2;I16=2,0,0,0,34,578,0,0,120,7200,0,0,6,20,0,0;QS=1,0;MQ0F=0 PL 0,6,31
NC_002549 10 . A <*> 0 . DP=2;I16=2,0,0,0,34,578,0,0,120,7200,0,0,8,34,0,0;QS=1,0;MQ0F=0 PL 0,6,31
NC_002549 11 . A <*> 0 . DP=3;I16=3,0,0,0,51,867,0,0,180,10800,0,0,10,52,0,0;QS=1,0;MQ0F=0 PL 0,9,42
NC_002549 12 . A <*> 0 . DP=4;I16=4,0,0,0,68,1156,0,0,240,14400,0,0,13,75,0,0;QS=1,0;MQ0F=0 PL 0,12,50
NC_002549 13 . A <*> 0 . DP=5;I16=5,0,0,0,85,1445,0,0,300,18000,0,0,17,105,0,0;QS=1,0;MQ0F=0 PL 0,15,57
NC_002549 14 . A <*> 0 . DP=5;I16=5,0,0,0,85,1445,0,0,300,18000,0,0,22,144,0,0;QS=1,0;MQ0F=0 PL 0,15,57
NC_002549 15 . G <*> 0 . DP=6;I16=6,0,0,0,102,1734,0,0,360,21600,0,0,27,193,0,0;QS=1,0;MQ0F=0 PL 0,18,62
NC_002549 16 . A <*> 0 . DP=6;I16=6,0,0,0,102,1734,0,0,360,21600,0,0,33,253,0,0;QS=1,0;MQ0F=0 PL 0,18,62
NC_002549 17 . A <*> 0 . DP=6;I16=6,0,0,0,102,1734,0,0,360,21600,0,0,39,325,0,0;QS=1,0;MQ0F=0 PL 0,18,62
NC_002549 18 . A <*> 0 . DP=7;I16=7,0,0,0,119,2023,0,0,420,25200,0,0,45,409,0,0;QS=1,0;MQ0F=0 PL 0,21,66
NC_002549 19 . G <*> 0 . DP=7;I16=7,0,0,0,119,2023,0,0,420,25200,0,0,52,506,0,0;QS=1,0;MQ0F=0 PL 0,21,66
NC_002549 20 . A <*> 0 . DP=8;I16=8,0,0,0,136,2312,0,0,480,28800,0,0,59,617,0,0;QS=1,0;MQ0F=0 PL 0,24,69
NC_002549 21 . A <*> 0 . DP=9;I16=9,0,0,0,153,2601,0,0,540,32400,0,0,67,743,0,0;QS=1,0;MQ0F=0 PL 0,27,72
20.6 生成genotypes然后call variants
samtools mpileup -Q 0 -ugf ~/refs/852/NC.fa bwa.bam | bcftools call -vc > samtools.vcf
查看snp calls,知道snp calling的原理
这里原作者写了一个python脚本名字为snpcaller.py.但我还没找到
代码先放这里,但不影响后续分析。做个
记号
cat bwa.pileup | python snpcaller.py > diy.vcf
cat samtools.vcf |grep -v "##"|cut -f 1-5|head
#CHROM POS ID REF ALT
NC_002549 425 . G T,A
NC_002549 507 . G T
NC_002549 2846 . a aT
NC_002549 2847 . g gT
NC_002549 7894 . A C
NC_002549 9569 . G A
NC_002549 12101 . T A
NC_002549 12957 . G A
NC_002549 14086 . A G
20.7安装freebays以call variants
首先安装Freebays,mac需要另外一个cmake才可以对freebays进行编译,所以先安装cmake
brew install cmakebrew install cmake
安装freebays
关于freebays,可以看下下面的文章
In-depth-NGS-Data-Analysis-Course
Calling variants with freebayes
cd ~/src
git clone --recursive git://github.com/ekg/freebayes.git
cd freebayes
make
注意原文中说,这个地方有个bug,make第一次会报错,然后再make一次就OK。我这里还是显示如下错误
fatal error: 'lzma.h' file not found
通过如下方式解决
brew install xz
做链接
ln -sf ~/src/freebayes/bin/freebayes ~/bin
20.8用安装的FreeBayes call snps
freebayes -f ~/refs/852/NC.fa bwa.bam > freebayes.vcf
对结果进行可视化。关于vcf格式请参考这篇文章VCF格式
Lecture 22变异效应预测
安装bioawk
cd ~/src
git clone https://github.com/lh3/bioawk.git
cd bioawk
make
ln -sf ~/src/bioawk/bioawk ~/bincd ~/src
git clone https://github.com/lh3/bioawk.git
cd bioawk
make
ln -sf ~/src/bioawk/bioawk ~/bin
bioawk的manual在这里
man ~/src/bioawk/awk.1
bioawk能辨识生物信息学类型。
cat r1.fq | bioawk -c fastx ' { print $seq } ' | head -1
cat mutations.gff | bioawk -c gff ' { print $seqname, $start } ' | head -1
cat mutations.gff | bioawk -c gff ' { print $seqname, $end-$start+1 } ' | head -1
下载snpEff,注意原文中下载地址有错
curl -OL https://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip/download
可以向trimmomatic和readseq一样设置snpEff,现在用alias的方式设置
alias snpEff='java -jar ~/src/snpEff/snpEff.jar'