Linux生信练习4-vcf

2020-08-11 本文已影响0人小贝学生信

作业原文:VCF格式文件的shell小练习 | 生信菜鸟团

原始数据准备

cd biosoft/bowtie2/bowtie2-2.4.1-linux-x86_64/example/reads/
../../bowtie2 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq | samtools sort -@ 5 -o tmp.bam - 
bcftools mpileup -f ../reference/lambda_virus.fa tmp.bam |bcftools call -vm > tmp.vcf

Q1

把突变记录的vcf文件区分成 INDEL和SNP条目

grep -v "^#" tmp.vcf | grep INDEL > tmp.indel.vcf
grep -v "^#" tmp.vcf | grep -v INDEL > tmp.snp.vcf

Q2

统计INDEL和SNP条目的各自的平均测序深度

cut -f 8 tmp.indel.vcf | cut -d";" -f 4 | less -SN
number=$(grep -c gi tmp.indel.vcf)
sum=$(cut -f 8 tmp.indel.vcf | cut -d";" -f 4 | cut -d"=" -f 2 | paste -s -d +|bc)
echo "$sum/$number" | bc 

cut -f 8 tmp.snp.vcf | cut -d";" -f 1 | less -SN
number=$(grep -c gi tmp.snp.vcf)
sum=$(cut -f 8 tmp.snp.vcf | cut -d";" -f 1 | cut -d"=" -f 2 | paste -s -d +|bc)
echo "$sum/$number" | bc

Q3

把INDEL条目再区分成insertion和deletion情况

head -1 tmp.indel.vcf | tr "\t" "\n" | cat -n
cat tmp.indel.vcf | awk 'length($4) > length($5){print}' > tmp.indel_del.vcf
cat tmp.indel.vcf | awk 'length($4) < length($5){print}' > tmp.indel_in.vcf

Q4

统计SNP条目的突变组合分布频率

cut -f 10 tmp.snp.vcf | cut -d":" -f 1 | sort | uniq -c

Q5

找到基因型不是 1/1 的条目，个数

grep -v "^#" tmp.vcf | grep -v 1/1 | wc

Q6

筛选测序深度大于20的条目

grep -E -o 'DP=[0-9]+' tmp.vcf | cut -d"=" -f 2 > DP_value
grep -v "^#" tmp.vcf | paste DP_value -  | awk '$1>20{print}' | cut -f 2- | less -SN
#或者
grep 'DP=[2-9][0-9]' tmp.vcf  #筛选大于20的巧妙思路

Q7

筛选变异位点质量值大于30的条目

cat tmp.vcf | awk '$6>30{print}' | less -SN

Q8

组合筛选变异位点质量值大于30并且深度大于20的条目

grep 'DP=[2-9][0-9]' tmp.vcf | awk '$6>30{print}' | less -SN

Q9

理解DP4=4,7,11,18 这样的字段，就是 Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases 计算每个变异位点的 AF

grep -E -o 'DP[0-9]=[0-9]+,[0-9]+,[0-9]+,[0-9]+' tmp.vcf > DP4_value
cut -d"=" -f 2 DP4_value | tr "," "\t" | awk '{print ($3+$4)/($1+$2+$3+$4)}'| wc
#注意awk后花括号两边的引号要用单引号，而不是双引号

Q10

在前面步骤的bam文件里面找到这个vcf文件的某一个突变位点的测序深度表明的那些reads，并且在IGV里面可视化bam和vcf定位到该变异位点。

暂时由于电脑原因，igv可视化一段时间回校后学习

head -1 tmp.snp.vcf | tr "\t" "\n" | less -SN
#序列的第1104个碱基，由C变成了A，DP=43
samtools view tmp.sorted.bam | awk '$4<=1104 && length($10)>= 1104-$4{print}' | wc
# 返回50行，应该是vcf根据质量过滤的一小部分。

Linux生信练习4-vcf

作业原文:VCF格式文件的shell小练习 | 生信菜鸟团

原始数据准备

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10

猜你喜欢

热点阅读