爬取KEGG数据库-制作gmt文件
2020-04-05 本文已影响0人
Gaaac
分享这个python脚本的原因是:在做GSEA分析时,发现KEGG官网上有些基因集没有出现在c2.cp.kegg.v7.0.symbols.gmt文件中。(c2.cp.kegg.v7.0.symbols.gmt中只包含了186个基因集,而在KEGG官网上可以抓取337个基因集)
KEGG官网将基因集分为7个大类 1. Metabolism 2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Organismal Systems 6. Human Diseases 7. Drug Development 如果只对其中一种感兴趣,可以对脚本稍微改动,只提取感兴趣的部分。
import bs4
import requests
import json
import urllib.request
species = "hsa" #mouse:mmu, human:hsa
r = urllib.request.urlopen('https://www.genome.jp/kegg-bin/show_organism?menu_type=pathway_maps&org=hsa')
items = bs4.BeautifulSoup(r,features="lxml").find_all("b")
group=[x.findNext("ul") for x in items]
total_dict={}
for m in group:
for i in m.findAll("ul"):
group_key = i.previous_element.strip().strip()
total_dict[group_key] = []
for j in i findAll("a"):
pathway_id = j.previous_element.strip()
pathway_name = j.text
total_dict[group_key].append((pathway_id,pathway_name))
def get_genes(species,keggid):
url = "http://togows.dbcls.jp/entry/pathway/{}{}/genes.json".format(species,keggid)
r = requests.get(url)
data = json.loads(r.text)
try:
gene_names = [i.split(";")[0] for i in data[0].values()]
except:
print("error when get genes from {}{}".format(species,keggid))
return []
return gene_names
with open("Total_kegg.gmt","w") as fout:
for k, v in Total_dict.items():
for pid,pname in v:
pgenes = get_genes(species,pid)
fout.write("\t".join([pname,pid] + pgenes) + "\n")