提取Genebank文件的检索号和碱基序列
2019-01-23 本文已影响16人
lizg
1.在NCBI
的Genebank
子库nucletide
下检索gene:IL10
,下载Genebank格式的文件,命名为IL10_Genebank
:
LOCUS DQ977084 1925 bp DNA linear PRI 14-JUL-2016
DEFINITION Macaca nemestrina IL10 (IL10) gene, partial cds.
ACCESSION DQ977084
VERSION DQ977084.1
KEYWORDS .
SOURCE Macaca nemestrina (pig-tailed macaque)
ORGANISM Macaca nemestrina
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Cercopithecidae; Cercopithecinae; Macaca.
REFERENCE 1 (bases 1 to 1925)
AUTHORS Nickel,G.C., Tefft,D.L., Goglin,K. and Adams,M.D.
TITLE An empirical test for branch-specific positive selection
JOURNAL Genetics 179 (4), 2183-2193 (2008)
PUBMED 18689901
REFERENCE 2 (bases 1 to 1925)
AUTHORS Nickel,G.C., Tefft,D.L., Trevarthen,K., Funt,J. and Adams,M.D.
TITLE Positive Selection in Transcription Factor Genes on the Human
Lineage
JOURNAL Unpublished
REFERENCE 3 (bases 1 to 1925)
AUTHORS Nickel,G.C., Tefft,D.L., Trevarthen,K., Funt,J. and Adams,M.D.
TITLE Direct Submission
JOURNAL Submitted (31-AUG-2006) Dept. of Genetics, Case Western Reserve
University, 10900 Euclid Ave, Cleveland, OH 44106, USA
FEATURES Location/Qualifiers
source 1..1925
/organism="Macaca nemestrina"
/mol_type="genomic DNA"
/db_xref="taxon:9545"
gene <347..>1831
/gene="IL10"
mRNA <347..>511
/gene="IL10"
/product="IL10"
CDS 347..>511
/gene="IL10"
/codon_start=1
/product="IL10"
/protein_id="ABM88029.1"
/translation="MHSSALLCCLVLLTGVRASPGQGTQSENSCTRFPGNLPHMLRDL
RDAFSRVKTFF"
exon <347..511
/gene="IL10"
/number=1
gap 628..727
/estimated_length=unknown
mRNA join(<955..1020,1739..>1831)
/gene="IL10"
/product="IL10"
CDS join(<955..1020,1739..1831)
/gene="IL10"
/codon_start=1
/product="IL10"
/protein_id="ABM88030.1"
/translation="HRFLPCENKSKAVEQVKNAFSKLQEKGVYKAMSEFDIFINYIEA
YMTMKIQN"
exon 955..1020
/gene="IL10"
/number=4
gap 1293..1392
/estimated_length=unknown
exon 1739..>1831
/gene="IL10"
/number=5
ORIGIN
1 catgagctgt tctccccagg aaatcaactt tttttaattg agaagctaaa aaattattct
61 aagagaggta gcccatccta aaaatagctg tgcagaagtt catgttcaac caatcctttt
121 tgcttacgat gcaaaatttg aaaactaagt ttattagaga ggttagagaa ggaggagctc
181 taagcagaaa aaatcctgtg ccgggaaacc tgtgattgtg gctttttatg aatgaagagg
241 cctccctgag cttacaatat aaaaggggga cagagaggtg aaggtctaca catcaggggc
301 ttgctcttgc aaaaccaaac cacaagacag acttgcaaaa gaaggcatgc acagctcagc
361 actgctctgt tgcctagtcc tcctgactgg ggtgagggcc agcccaggcc agggcaccca
421 gtctgagaac agctgcaccc gcttcccagg caacctgcct cacatgcttc gagacctccg
481 agatgccttc agcagagtga agactttctt tgtgagtatg attccctcct gtgctttctc
541 tcttcctggg actgcctgaa ctaggcattt tcctggagct ataagaagaa ccctcctcct
601 gtgcctccac ttccatcccc aacacctnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
661 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
721 nnnnnnntcg gagtgggtcc tggagaaata cattttatct cccagggccg tggttcttct
781 ctgacctttg gatagttagt aagggtgaag cagggctcag ttctctctgg gagctgtgag
841 gcgaggcatt tggataaatc tagcaccctc atgatgccac cagcttgtcc cccaagtgtg
901 atggacatgg agctgggagc cgggatcacc aacactttct cttttcttcc acagcatcga
961 tttcttccct gtgaaaacaa aagcaaggcc gtggagcagg tgaagaatgc ctttagtaag
1021 gtgagcttgg atggtggcag agagggtctg cagagcacag cccatgccca ctccccaacc
1081 ccaaagcgtg gaaggtggtg aggactcagt aggccccatc cttcattgga aggagtgtgg
1141 gaacctgaca gatggtatga cctgctcagc cagtgaggag ctgccgcctt gattgtattt
1201 gttttctgtt aagtgtcttt gggggtttct aaatgactgc tcgctgcctt tgcaggcttg
1261 cgggttaggc tggccggcca gcctgtgaac acnnnnnnnn nnnnnnnnnn nnnnnnnnnn
1321 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
1381 nnnnnnnnnn nngctttcaa agtgcttcct ctaatgtctt ttcatcacac tctgcataat
1441 catcatgtga atacgtgacc tttaaaattg ttgaaaaggc atcattttga agacagcgct
1501 ttgcaaaatg aatgctccct ttgctaggca gtagccgtac ttcaggcctg gaggagatga
1561 aggtcaatgc actgcctttc ccaaggcagc tgggcctatc ctctggttca cttcccagcg
1621 tgagggagaa taagcagcct ctgcactcaa ggtcatgccc atccatgagc atgggaaagg
1681 ggagcctatt tcgtccccag aagggattta actgaatgtt tcttatctct ctgcacagct
1741 ccaagagaaa ggcgtctaca aagccatgag tgagtttgac atcttcatca actacataga
1801 agcctacatg acaatgaaga tacaaaactg agacatcagg gtggcgactc tatagactct
1861 aggacataaa ttggaggtct ccaaaatcag atccagggtt ctgggatacc tgacccagcc
1921 ccttg
//
2.python脚本;
# 提取基因的检索号和碱基序列
input_file = open('IL10_Genebank.gb','r')# 读取Genebank文件
output_file = open('IL10.fasta','w')
flag=0
for line in input_file:
if line[0:9]=='ACCESSION':
AC=line.split()[1].strip()
output_file.write('>'+AC+'\n')
elif line[0:6]=='ORIGIN':
flag=1
elif flag==1:
fields=line.split()# 以空格为分界,将line转换为list
if fields!=[]:
seq=''.join(fields[1:])#将list组装为字符串
output_file.write(seq.upper()+'\n')
input_file.close()
output_file.close()
3.输出结果
>DQ977084
CATGAGCTGTTCTCCCCAGGAAATCAACTTTTTTTAATTGAGAAGCTAAAAAATTATTCT
AAGAGAGGTAGCCCATCCTAAAAATAGCTGTGCAGAAGTTCATGTTCAACCAATCCTTTT
TGCTTACGATGCAAAATTTGAAAACTAAGTTTATTAGAGAGGTTAGAGAAGGAGGAGCTC
TAAGCAGAAAAAATCCTGTGCCGGGAAACCTGTGATTGTGGCTTTTTATGAATGAAGAGG
CCTCCCTGAGCTTACAATATAAAAGGGGGACAGAGAGGTGAAGGTCTACACATCAGGGGC
TTGCTCTTGCAAAACCAAACCACAAGACAGACTTGCAAAAGAAGGCATGCACAGCTCAGC
ACTGCTCTGTTGCCTAGTCCTCCTGACTGGGGTGAGGGCCAGCCCAGGCCAGGGCACCCA
GTCTGAGAACAGCTGCACCCGCTTCCCAGGCAACCTGCCTCACATGCTTCGAGACCTCCG
AGATGCCTTCAGCAGAGTGAAGACTTTCTTTGTGAGTATGATTCCCTCCTGTGCTTTCTC
TCTTCCTGGGACTGCCTGAACTAGGCATTTTCCTGGAGCTATAAGAAGAACCCTCCTCCT
GTGCCTCCACTTCCATCCCCAACACCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNTCGGAGTGGGTCCTGGAGAAATACATTTTATCTCCCAGGGCCGTGGTTCTTCT
CTGACCTTTGGATAGTTAGTAAGGGTGAAGCAGGGCTCAGTTCTCTCTGGGAGCTGTGAG
GCGAGGCATTTGGATAAATCTAGCACCCTCATGATGCCACCAGCTTGTCCCCCAAGTGTG
ATGGACATGGAGCTGGGAGCCGGGATCACCAACACTTTCTCTTTTCTTCCACAGCATCGA
TTTCTTCCCTGTGAAAACAAAAGCAAGGCCGTGGAGCAGGTGAAGAATGCCTTTAGTAAG
GTGAGCTTGGATGGTGGCAGAGAGGGTCTGCAGAGCACAGCCCATGCCCACTCCCCAACC
CCAAAGCGTGGAAGGTGGTGAGGACTCAGTAGGCCCCATCCTTCATTGGAAGGAGTGTGG
GAACCTGACAGATGGTATGACCTGCTCAGCCAGTGAGGAGCTGCCGCCTTGATTGTATTT
GTTTTCTGTTAAGTGTCTTTGGGGGTTTCTAAATGACTGCTCGCTGCCTTTGCAGGCTTG
CGGGTTAGGCTGGCCGGCCAGCCTGTGAACACNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNGCTTTCAAAGTGCTTCCTCTAATGTCTTTTCATCACACTCTGCATAAT
CATCATGTGAATACGTGACCTTTAAAATTGTTGAAAAGGCATCATTTTGAAGACAGCGCT
TTGCAAAATGAATGCTCCCTTTGCTAGGCAGTAGCCGTACTTCAGGCCTGGAGGAGATGA
AGGTCAATGCACTGCCTTTCCCAAGGCAGCTGGGCCTATCCTCTGGTTCACTTCCCAGCG
TGAGGGAGAATAAGCAGCCTCTGCACTCAAGGTCATGCCCATCCATGAGCATGGGAAAGG
GGAGCCTATTTCGTCCCCAGAAGGGATTTAACTGAATGTTTCTTATCTCTCTGCACAGCT
CCAAGAGAAAGGCGTCTACAAAGCCATGAGTGAGTTTGACATCTTCATCAACTACATAGA
AGCCTACATGACAATGAAGATACAAAACTGAGACATCAGGGTGGCGACTCTATAGACTCT
AGGACATAAATTGGAGGTCTCCAAAATCAGATCCAGGGTTCTGGGATACCTGACCCAGCC
CCTTG