Linux与生物信息Linux基因组组装

一些常用小脚本记录

2023-01-11  本文已影响0人  geneonto

记录下小编自己可能用到的小脚本

(1)fasta转phylip格式

用法:python fa2phy.py seq.fasta seq.phy

import sys
file = open(sys.argv[1],"r")
out = open(sys.argv[2],"w")
for line in file:
    line = line.strip()
    if line.startswith(">"):
        out.write(line[1:] + "\t")
    else:
        out.write(line + "\n")       
file.close()
out.close()

(2)fastq转fasta;根据ID提取fasta中的序列(需安装seqtk软件)

seqtk seq -a seq.fastq >seq.fasta
seqtk subseq seq.fasta ID.list >need.fasta

(3)从gff3文件中提取一系列需要的gff3信息

**用法:sh get-gff3.sh ID.list **

while read line
do
    `grep -w $line EVM.final.all.transcripts.gene.gff3 >>EVM.final.all.transcripts.gene.new.gff3`
done<$1

(4)根据某一文件某列ID从另一个文件中提取有该列ID的信息(使用最频繁)

awk 'NR==FNR{a[$1]=$0}NR!=FNR{if(a[$2]){print $0"\t"a[$2]}else{print $0}}'   file1 file2

file1文件如下:

EVM0003758  Cgyptg000001l_1G000010
EVM0027620  Cgyptg000001l_1G000020
EVM0001766  Cgyptg000001l_1G000030
EVM0031262  Cgyptg000001l_1G000040
EVM0027288  Cgyptg000001l_1G000050
EVM0007809  Cgyptg000001l_1G000060
EVM0018470  Cgyptg000001l_1G000070
EVM0030012  Cgyptg000001l_1G000080
EVM0007598  Cgyptg000001l_1G000090
EVM0016935  Cgyptg000001l_1G000100

file2文件如下:

Cgyptg000001l_1G000030
Cgyptg000001l_1G000050
Cgyptg000001l_1G000070
Cgyptg000001l_1G000080

脚本后

Cgyptg000001l_1G000030  EVM0001766  Cgyptg000001l_1G000030
Cgyptg000001l_1G000050  EVM0027288  Cgyptg000001l_1G000050
Cgyptg000001l_1G000070  EVM0018470  Cgyptg000001l_1G000070
Cgyptg000001l_1G000080  EVM0030012  Cgyptg000001l_1G000080

(5)统计一个文件中,每一行各个元素的个数(例如每行可能有A B C D四个元素)

**用法:python count.py file ID.list #ID.list为含有A B C D的文件list

import sys
from collections import Counter
file = open(sys.argv[1],"r")
file2 = open(sys.argv[2],"r")
out = open(sys.argv[3],"w")
for line in file:
    line = line.strip().split("\t")
    cou = Counter(line[2:])
    for line2 in file2:
        line2 = line2.strip()
        if line2 not in cou.keys():
            cou[line2] = "0"
        else:
            pass
    out.write(line[0] + "\t" + str(line[1]) + "\t" + str(cou["A"]) + "\t" + str(cou["B"]) + "\t" + str(cou["C"]) + "\t" + str(cou["D"]) + "\n")
file.close()
file2.close()
out.close()

(6)划窗计算pi跟theta w,分两步,首先vcftools划窗计算pi,然后用awk计算theta w

vcftools --vcf sample_snp.vcf --window-pi 10000 --out samlpe.windowed.pi
tail -n +2 samlpe.windowed.pi | awk -F$'\t' -v popn=5 'BEGIN{for(z=1; z < 2*popn; ++z) w_p+=1/z;}{w=$4/w_p/($3-$2+1); a_w += w; a_pi += $5}END{print a_w/NR, a_pi/NR;}' #这里的 popn为vcf文件中样本个数

(7)提取某fasta文件中某ID的某区间序列,例如提取染色体chr1的100-200位置序列

samtools faidx input.fa chr1:100-200 > chr1.fa

(8)提取vcf文件中某区间的vcf文件,例如提取vcf文件中,染色体Chr01 1-1000bp区间的vcf文件

vcftools --vcf vcf --chr Chr01 --from-bp 1 --to-bp 1000 --recode --out Chr01

(9)PDF、svg转png(需安装有convert)

convert  -resize 3600 -density 300 -quality 600 file.pdf file.png
convert  -resize 3600 -density 300 -quality 600 file.svg file.png

(10)mummer比对

nucmer -t 12 --mum -p ref_seq ref.genome.fa seq.genome.fa #比对
delta-filter -i 90 -l 10000  -r -q ref_seq.delta -1> ref_seq.filter.delta.10k #过滤
mummerplot --png  -p ref_seq.10k ref_seq.filter.delta.10k #作图
mummerplot -p ref_seq.filter.delta.10k -Q ref.ID -R seq.ID --png #按照选中的ID作图
show-coords ref_seq.filter.delta.10k >ref_seq.filter.delta.10k.coords #文件格式化
上一篇 下一篇

猜你喜欢

热点阅读