生信习题 | 一

2020-08-26  本文已影响0人  kkkkkkang

以下是分析和学习过程中遇到的值得记录一下的生信习题库

题目1:下载最新版的KEGG注释文本文件,编写脚本整理成kegg的pathway的ID与基因ID的对应格式。

#数据下载参考技能树教程http://www.bio-info-trainee.com/2550.html
我的解题思路:因为C开头的是KEGG pathway ID, D开头的是基因ID,C开头的第二列拿出来当作变量值赋给a,碰到D开头的第二列就把这俩一块输出,\t分隔,直到遇见下一个C开头的KEGG pathway ID,循环直到文件末
awk '{if ($0 ~ /^C/) a=$2; else if ($0 ~ /^D/) print a"\t"$2}' hsa00001.keg | less
00010   3101
00010   3098
00010   3099
00010   80201
00010   2645
00010   2821
00010   5213
00010   5214
00010   5211
00010   2203
00010   8789
00010   230
00010   226
00010   229
00010   7167
00010   2597
00010   26330

题目2: 解析go-basic.obo为tab键分隔的三列表

原始go-basic.obo是这样

format-version: 1.2
data-version: releases/2020-08-11
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
remark: cvs version: use data-version
remark: Includes Ontology(OntologyID(OntologyIRI(<http://purl.obolibrary.org/obo/go/never_in_taxon.owl>))) [Axioms: 18 Logical Axioms: 0]
ontology: go

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria
and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
name: reproduction
namespace: biological_process
alt_id: GO:0019952
alt_id: GO:0050876
def: "The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms." [GOC:go_curators, GOC:isa_complete, GOC:jl, ISBN:0198506732]
subset: goslim_agr
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
synonym: "reproductive physiological process" EXACT []
xref: Wikipedia:Reproduction
is_a: GO:0008150 ! biological_process

[Term]
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome." [GOC:ai, GOC:vw
]
is_a: GO:0007005 ! mitochondrion organization

[Term]
id: GO:0000003

想要转换成这样

Go Decription Level
GO:0000001 mitochondrion inheritance biological_process
GO:0000002 mitochondrial genome maintenance biological_process
GO:0000003 reproduction biological_process
GO:0000005 obsolete ribosomal chaperone activity molecular_function
GO:0000006 high-affinity zinc transmembrane transporter activity molecular_function
GO:0000007 low-affinity zinc ion transmembrane transporter activity molecular_function
#!/usr/bin/env python
#-*-utf-8-*-
import argparse
parser = argparse.ArgumentParser(description = "\nThis python script is used to parse the go-basic.obo file into tab delimited table", add_help = False, usage = "\n python3 parse_go.py -i [go-basic.obo] -o [go.txt] \n python3 parse_go.py --input [go-basic.obo] --output [go.txt]")
required = parser.add_argument_group("Required options")
optional = parser.add_argument_group("Optional options")
required.add_argument("-i", "--input", metavar = "[go-basic.obo]", help = "input file format: go_obo", required = True)
required.add_argument("-o", "--output", metavar = "[stat.txt]", help = "output file format: tab delimited table", required = True)
optional.add_argument("-h", "--help", action = "help", help = "help.info")
args = parser.parse_args()
with open(args.input,"r") as fi:
    with open(args.output,"w") as fo:
        print("GO\tDescription\tLevel",file = fo)
        for line in fi:
            if line.startswith("id"):
                print(''.join(line.split(" ")[1]).strip(), file = fo, end = '\t')
            elif line.startswith("name:"):
                print(' '.join(line.split(" ")[1:]).strip(), file = fo, end = '\t')
            elif line.startswith("namespace"):
                print(line.split(" ")[1].strip(), file = fo)

统计人类外显子区域的长度

#!/usr/bin/env python3
#2020.8
#-*-utf-8-*-
import argparse
parser = argparse.ArgumentParser(description = '\nThis is a python3 script used to count the length of human genome extron', add_help = False, usage = '\npython3 extron_len.py -i [human_extron.txt]')
required = parser.add_argument_group('Required options')
optional = parser.add_argument_group('Optional options')
required.add_argument('-i', metavar = '[human_extron.txt]', help = 'input: extron can be downloaded at https://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt', required = True)
optional.add_argument('-h','--help', action = 'help', help = 'help.info')
args = parser.parse_args()
extron = {}
with open(args.i,'r') as f:
    next(f) #跳过第一行
    for line in f:
        line = line.split('\t')
        if line[9] != '-':
            extron_cor = line[9].lstrip('[').rstrip(']').split(', ') 
        for i in range(len(extron_cor)):
            start = extron_cor[i].split('-')[0]
            extron[start] = extron_cor[i].split('-')[1]
length = 0
for i,j in extron.items():
    length += int(j.strip("'")) - int(i.strip("'")) + 1
print(length)
上一篇下一篇

猜你喜欢

热点阅读