biopython解析genbank文件获取物种分类信息

2020-08-09  本文已影响0人  小明的数据分析笔记本

NCBI的线粒体基因组数据库

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/

看了前面几个物种好像都是动物,这里也提供了genbank格式的文件,所以应该可以批量看下这个数据里面有没有植物的线粒体。

那么如何根据genbank文件获得物种所属的分类信息呢?
biopython里提供解析genbank文件的方法

示例genbank文件

LOCUS       NC_035240                114 bp    DNA     linear   PLN 14-JUL-2017
DEFINITION  Punica granatum chloroplast, complete genome.
ACCESSION   NC_035240 REGION: 70545..70658
VERSION     NC_035240.1
DBLINK      BioProject: PRJNA394497
KEYWORDS    RefSeq.
SOURCE      chloroplast Punica granatum (pomegranate)
  ORGANISM  Punica granatum
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
            Pentapetalae; rosids; malvids; Myrtales; Lythraceae; Punica.
REFERENCE   1  (bases 1 to 114)
  AUTHORS   Rabah,S.O., Lee,C., Hajrah,N.H., Makki,R.M., Alharby,H.F.,
            Alhebshi,A.M., Sabir,J.S.M., Sabir,M.J., Jansen,R.K. and
            Ruhlman,T.A.
  TITLE     Plastome sequencing of 10 non-model crop species reveals multiple
            inversions, gene transfers to the nucleus and a recent, large
            mitochondrial insertion in the tree species cashew (Anacardium,
            Anacardiaceae)
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 114)
  CONSRTM   NCBI Genome Project
  TITLE     Direct Submission
  JOURNAL   Submitted (14-JUL-2017) National Center for Biotechnology
            Information, NIH, Bethesda, MD 20894, USA
REFERENCE   3  (bases 1 to 114)
  AUTHORS   Rabah,S.O., Lee,C., Hajrah,N.H., Makki,R.M., Alharby,H.F.,
            Alhebshi,A.M., Sabir,J.S.M., Sabir,M.J., Jansen,R.K. and
            Ruhlman,T.A.
  TITLE     Direct Submission
  JOURNAL   Submitted (17-FEB-2017) Biological Sciences, King Abdulaziz
            University, P.O.Box 80141, Jeddah 21589, Saudi Arabia
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence is identical to KY635883.
            
            ##Assembly-Data-START##
            Assembly Method       :: Velvet v. 1.2.08
            Sequencing Technology :: Illumina
            ##Assembly-Data-END##
            COMPLETENESS: full length.
FEATURES             Location/Qualifiers
     source          1..114
                     /organism="Punica granatum"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:22663"
     gene            1..114
                     /gene="petG"
                     /locus_tag="CGW82_pgp045"
                     /db_xref="GeneID:33351918"
     CDS             1..114
                     /gene="petG"
                     /locus_tag="CGW82_pgp045"
                     /codon_start=1
                     /transl_table=11
                     /product="cytochrome b6/f complex subunit V"
                     /protein_id="YP_009390828.1"
                     /db_xref="GeneID:33351918"
                     /translation="MIEVFLFGIVLGLIPITLAGLFVTAYLQYRRGDQLDF"
ORIGIN      
        1 atgattgaag tttttctatt tggaattgtc ttaggtctaa ttcctattac tttagctgga
       61 ttatttgtaa ctgcatattt acaatacaga cgtggtgatc agttggactt ttga
//

FEATURES Location/Qualifiers这行以前的内容会以字典的形式存储在annotations里,比如我要获取这部分内容,可以写一个简单的命令

for rec in SeqIO.parse('sequence.gb','gb'):
    print(rec.annotations)

获得的内容是

{'molecule_type': 'DNA', 'topology': 'linear', 'data_file_division': 'PLN', 'date': '14-JUL-2017', 'accessions': ['NC_035240', 'REGION:', '70545..70658'], 'sequence_version': 1, 'keywords': ['RefSeq'], 'source': 'chloroplast Punica granatum (pomegranate)', 'organism': 'Punica granatum', 'taxonomy': ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'eudicotyledons', 'Gunneridae', 'Pentapetalae', 'rosids', 'malvids', 'Myrtales', 'Lythraceae', 'Punica'], 'references': [Reference(title='Plastome sequencing of 10 non-model crop species reveals multiple inversions, gene transfers to the nucleus and a recent, large mitochondrial insertion in the tree species cashew (Anacardium, Anacardiaceae)', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence is identical to KY635883.\nCOMPLETENESS: full length.', 'structured_comment': OrderedDict([('Assembly-Data', OrderedDict([('Assembly Method', 'Velvet v. 1.2.08'), ('Sequencing Technology', 'Illumina')]))])}

物种所属分类信息的键是taxonomy,值对应的是一个列表,判断这个物种是不是植物就判断Viridiplanta在不在这个列表里应该就可以了

欢迎大家关注我的公众号
小明的数据分析笔记本

公众号二维码.jpg
上一篇 下一篇

猜你喜欢

热点阅读