根据注释文件中基因位置提取序列
2022-07-31 本文已影响0人
生信小白杀手
在基本的生信分析中,我们有时候需要某些基因的序列,而下载的参考基因组包中,一般不会提供每个基因的序列。下面分享一个简单的根据gff或gtf注释文件提取目的基因序列的python脚本。
Exact_sq.py
#usage:python Exact_sq.py ref_gene.fa annotation.gff/gtf
import sys
ref_sq=sys.argv[1]
anno_gene=sys.argv[2]
with open(ref_sq,"r") as ref_file:
ref_sq_lib={}
for line in ref_file:
line=line.strip()
if line[0] == ">":
chr_id=line
ref_sq_lib[chr_id]=""
else:
ref_sq_lib[chr_id]=ref_sq_lib[chr_id]+line
out=open("Matched_sq.fa","a+")
with open(anno_gene,"r") as anno_file:
for info in anno_file:
info=info.strip()
chr=info.split("\t")[0]
star_pos=int(info.split("\t")[3])-1
term_pos=int(info.split("\t")[4])-1
id=info.split("\t")[8].split(";")[0].split(":")[1]
for ref_chr in ref_sq_lib.keys():
if " " in ref_chr:
ref_chr1=ref_chr.split(" ")[0].split(">")[1]
if chr == ref_chr1:
matched_sq=ref_sq_lib[ref_chr][star_pos:term_pos]
out.write(">"+id+"\n"+matched_sq+"\n")
else:
if ">"+chr == ref_chr:
matched_sq=ref_sq_lib[ref_chr][star_pos:term_pos]
out.write(">"+id+"\n"+matched_sq+"\n")
用法为python Exact_sq.py 参考基因组序列文件 目的基因的注释文件
最终生成一个Matched_sq.fa文件。