通过简单数据熟悉Linux下生物信息学各种操作3

2019-07-01 本文已影响0人 Y大宽

原地址
几点说明
1.非简单翻译，所有代码均可运行，为了辅助理解，基本每步代码都有结果，需要比较的进行了整合
2.原文中的软件都下载最新版本
3.原文中有少量代码是错误的，这里进行了修正
4.对于需要的一些知识背景，在这里进行了注释或链接到他人博客

一共4部分
通过简单数据熟悉Linux下生物信息学各种操作1
通过简单数据熟悉Linux下生物信息学各种操作2
通过简单数据熟悉Linux下生物信息学各种操作3
通过简单数据熟悉Linux下生物信息学各种操作4

15`awk`的简单使用

15.1提取Ebola的coding feature，genes和coding sequences

efetch -db nucleotide -id NC_002549.1 -format gb > NC.gb
 ~/bin/readseq -format=GFF -o NC.gff NC.gb

找到每个feature的长度

cat NC.gff |awk '{print $1,$2,$3}'|head -5

##gff-version 2 
# seqname source
NC_002549 - source
NC_002549 - 5'UTR
NC_002549 - gene

cat NC.gff|cut -f 1,2,3|head -5

##gff-version 2
# seqname   source  feature
NC_002549   -   source
NC_002549   -   5'UTR
NC_002549   -   gene

几乎等同于上个命令。
计算每个feature的长度

cat NC.gff | awk ' { print $3, $5-$4 + 1 } ' | head -5

 1
source 1
source 18959
5'UTR 55
gene 2971

仅提取CDS features

cat NC.gff|awk '$3=="CDS" {print $3,$5-$4+1,$9}'

CDS 2220 gene
CDS 1023 gene
CDS 981 gene
CDS 885 group
CDS 1146 group
CDS 1095 gene
CDS 884 group
CDS 10 group
CDS 867 gene
CDS 756 gene
CDS 6639 gene

计算所有gene的累积长度

cat NC.gff | awk '$3 =="gene" { len=$5-$4 + 1; size += len; print "Size:", size } '

size+=len代表size=size+len，也就是在第一个len的基础上依次递加。为了清楚表示，不用这个运算符，比对结果看

cat NC.gff | awk '$3 =="gene" { len=$5-$4 + 1; size += len; print "Size:", size } '

Size: 2971
Size: 4347
Size: 5852
Size: 8258
Size: 9711
Size: 11345
Size: 18127

cat NC.gff | awk '$3 =="gene" { print $3, $5-$4 + 1, $9 } '

gene 2971 gene
gene 1376 gene
gene 1505 gene
gene 2406 gene
gene 1453 gene
gene 1634 gene
gene 6782 gene

计算基因组有多大，有几种方式

samtools view -h results.bam |head -2

samtools view -h results.bam |head -2
@HD VN:1.6  SO:coordinate
@SQ SN:NC_002549.1  LN:18959

可见，大小是18959
基因占基因组的百分比
我的NC.gff文件出现问题了，重新做一个

readseq -format=GFF -o NC.gff NC.gb

原文的代码是

cat NC.gff | awk '$3 =="CDS" { len=$5-$4 + 1; size += len; perc=size/18959; print "Size:", size, perc } '

以下代码我稍作了改动，以更好的显示。

cat NC.gff |awk '$3=="CDS" {len=$5-$4+1; size+=len;perc=size*100/18959;print "Size:", size, perc,"%"}'

Size: 2220 11.7095 %
Size: 3243 17.1053 %
Size: 4224 22.2797 %
Size: 5109 26.9476 %
Size: 6255 32.9922 %
Size: 7350 38.7679 %
Size: 8234 43.4306 %
Size: 8244 43.4833 %
Size: 9111 48.0563 %
Size: 9867 52.0439 %
Size: 16506 87.0616 %

现在用genes和codings sequence做新的gff文件
首先，文件的首行定义file的类型

head -1 NC.gff  > NC-genes.gff
head -1 NC.gff  > NC-cds.gff

然后以不同的feature（这里是gene和cds）进行区分

cat NC.gff | awk -F '\t' -v OFS='\t'  '$3=="gene" { print $0 }' >> NC-genes.gff
cat NC.gff | awk -F '\t' -v OFS='\t' '$3=="CDS" { print $0 }'  >> NC-cds.gff

这里我试了下，不用-F '\t' -v OFS='\t'参数也一样
还需要细看，暂留个

记号

sam files是tab分隔的，可以awk处理
多少数据有超过50的coverage

samtools depth results.bam |awk '$3>50 {print $0}'|wc -l

有多少templates长度大于500bp，注意其长度可以为负

samtools view results.bam |awk '$9>500 {print $0}'|wc -l

分隔GFF文件获取gene name

18 比较比对工具

下载安装bowtie2

curl -OL https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.3.5.1/bowtie2-2.3.5.1-macos-x86_64.zip/download
unzip bowtie2-2.3.5.1-macos-x86_64.zip 
#做链接
ln -s ~/src/bowtie2-2.3.5.1-macos-x86_64/bowtie2 ~/bin/
ln -s ~/src/bowtie2-2.3.5.1-macos-x86_64/bowtie2-align-l ~/bin/
ln -s ~/src/bowtie2-2.3.5.1-macos-x86_64/bowtie2-build ~/bin/

为参考genome建立索引

bowtie2-build ~/refs/852/NC.fa NC.fa

格式化mutations，以便在浏览器展示，做成gff文件
先看一下mutation files

NC_002549.1 165 A   C   -
NC_002549.1 1164    T   K   +
NC_002549.1 2691    T   Y   +
NC_002549.1 4436    A   R   +
NC_002549.1 6620    C   G   -
NC_002549.1 7709    T   Y   +
NC_002549.1 10864   G   A   -
NC_002549.1 11216   TT  -   -
NC_002549.1 11490   G   R   +
NC_002549.1 11677   T   Y   +
NC_002549.1 12430   C   S   +
NC_002549.1 12887   C   T   -
NC_002549.1 13155   C   Y   +
NC_002549.1 13767   T   W   +
NC_002549.1 17580   G   R   +

cat mutations.txt | awk ' {print $1, "wgsim", "mutation", $2, $2, ".", "+",  ".", "." }' > mutations.gff

查看mutations.gff

NC_002549.1 wgsim mutation 165 165 . + . .
NC_002549.1 wgsim mutation 1164 1164 . + . .
NC_002549.1 wgsim mutation 2691 2691 . + . .
NC_002549.1 wgsim mutation 4436 4436 . + . .
NC_002549.1 wgsim mutation 6620 6620 . + . .
NC_002549.1 wgsim mutation 7709 7709 . + . .
NC_002549.1 wgsim mutation 10864 10864 . + . .
NC_002549.1 wgsim mutation 11216 11216 . + . .
NC_002549.1 wgsim mutation 11490 11490 . + . .
NC_002549.1 wgsim mutation 11677 11677 . + . .
NC_002549.1 wgsim mutation 12430 12430 . + . .
NC_002549.1 wgsim mutation 12887 12887 . + . .
NC_002549.1 wgsim mutation 13155 13155 . + . .
NC_002549.1 wgsim mutation 13767 13767 . + . .
NC_002549.1 wgsim mutation 17580 17580 . + . .

原文中接着进行比较，然后samtools view查看bwa和bowtie2产生的文件，可以直接看下面的最终脚本

下面是最终脚本（比较bwa和bowtie2的比对结果），注意索引不一样

READ1=read1.fq
READ2=read2.fq
#bwa的索引
REFS=~/refs/852/NC.fa
#bowtie2的索引
~/src/bowtie2-2.3.5.1-macos-x86_64/bowtie2-build ~/refs/852/NC.fa NC_bow.fa

从基因组模拟reads

wgsim -N 10000 -e 0.1 $REFS $READ1 $READ2 > mutations.txt

mutation file转变为gff文件

cat mutations.txt | awk -F '\t' ' BEGIN { OFS="\t"; print "##gff-version 2" } { print $1, "wgsim", "mutation", $2, $2, ".", "+",  ".", "name " $3 "/" $4 }' > mutations.gff

增加read group to the mapping

GROUP='@RG\tID:123\tSM:liver\tLB:library1'

分别运行bwa和bowtie2分别产生bwa.sam和bow.sam
bwa

bwa mem -R $GROUP $REFS $READ1 $READ2 > bwa.sam

bowtie2
一定记得index和bam的不一样

~/src/bowtie2-2.3.5.1-macos-x86_64/bowtie2 -x NC_bow.fa -1 $READ1 -2 $READ2 >bow.sam

10000 reads; of these:
  10000 (100.00%) were paired; of these:
    5107 (51.07%) aligned concordantly 0 times
    4893 (48.93%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    5107 pairs aligned concordantly 0 times; of these:
      4830 (94.58%) aligned discordantly 1 time
    ----
    277 pairs aligned 0 times concordantly or discordantly; of these:
      554 mates make up the pairs; of these:
        391 (70.58%) aligned 0 times
        163 (29.42%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
98.05% overall alignment rate

我这个地方bwa比对没问题，但是一到bowtie2就报下面的错误

Saw ASCII character 10 but expeacted 33-based Phred qual. libc++abi.dylib: terminating with uncaught exception of type int

后来查看read1和read2的大小和序列，完全一致。问题不在这里
最后发现竟然是因为read1和read2没有写权限。增加后即可正常运行。

把sam文件转换为bam文件

samtools view -Sb bow.sam >tmp.bam
samtools sort tmp.bam bow.bam
samtools sort tmp.bam -o bow.bam

同样对bow.sam进行转换
以上可以写成脚本(原文此处有错)

for name in *.sam;
do
    samtools view -Sb $name > $name.tmp.bam
    samtools sort -f $name.tmp.bam -o $name.bam
done

建立索引

for name in *.bam
do
    samtools index $name
done

最后结果如下

-rw-r--r--  1 ucco  staff   768K  7  2 23:14 bow.bam
-rw-r--r--  1 ucco  staff   152B  7  2 23:22 bow.bam.bai
-rw-r--r--  1 ucco  staff   5.8M  7  2 23:06 bow.sam
-rw-r--r--  1 ucco  staff   741K  7  2 23:19 bwa.bam
-rw-r--r--  1 ucco  staff   152B  7  2 23:22 bwa.bam.bai
-rw-r--r--  1 ucco  staff   5.5M  7  2 23:18 bwa.sam

19 samtools faidx和pileups

用samtools对fasta文件进行索引

samtools faidx ~/refs/852/NC.fa

查询

samtools faidx ~/refs/852/NC.fa NC_002549:1-10
>NC_002549:1-10

cggacacaca

可以同时查询多个范围

samtools faidx ~/refs/852/NC.fa NC_002549:1-10 NC_002549:100-110

>NC_002549:1-10
cggacacaca
>NC_002549:100-110
tttaaattgaa

用samtools的mpileup命令看有参考基因组和没有参考基因组的差异
关于mpileup的用法见[samtools]mpileup命令简介

#没有参考基因组
samtools mpileup -Q 0 bwa.bam | less
#有参考基因组
samtools mpileup -f ~/refs/852/NC.fa -Q 0 bwa.bam | less

为了只管显示，结果放一起

左无右有

通过简单数据熟悉Linux下生物信息学各种操作3

15`awk`的简单使用

15.1提取Ebola的coding feature，genes和coding sequences

记号

分隔GFF文件获取gene name

18 比较比对工具

下面是最终脚本（比较bwa和bowtie2的比对结果），注意索引不一样

19 samtools faidx和pileups

猜你喜欢

热点阅读

通过简单数据熟悉Linux下生物信息学各种操作3

15awk的简单使用

15.1提取Ebola的coding feature，genes和coding sequences

记号

分隔GFF文件获取gene name

18 比较比对工具

下面是最终脚本（比较bwa和bowtie2的比对结果），注意索引不一样

19 samtools faidx和pileups

猜你喜欢

热点阅读

15`awk`的简单使用