爬取KEGG数据库-制作gmt文件

2020-04-05 本文已影响0人 Gaaac

分享这个python脚本的原因是：在做GSEA分析时，发现KEGG官网上有些基因集没有出现在c2.cp.kegg.v7.0.symbols.gmt文件中。(c2.cp.kegg.v7.0.symbols.gmt中只包含了186个基因集，而在KEGG官网上可以抓取337个基因集)
KEGG官网将基因集分为7个大类 1. Metabolism 2. Genetic Information Processing 3. Environmental Information Processing 4. Cellular Processes 5. Organismal Systems 6. Human Diseases 7. Drug Development 如果只对其中一种感兴趣，可以对脚本稍微改动，只提取感兴趣的部分。

import bs4
import requests
import json
import urllib.request

species = "hsa" #mouse:mmu, human:hsa
r = urllib.request.urlopen('https://www.genome.jp/kegg-bin/show_organism?menu_type=pathway_maps&org=hsa')
items = bs4.BeautifulSoup(r,features="lxml").find_all("b")
group=[x.findNext("ul") for x in items]
total_dict={}
for m in group:
  for i in m.findAll("ul"):
    group_key = i.previous_element.strip().strip()
    total_dict[group_key] = []
    for j in i findAll("a"):
      pathway_id = j.previous_element.strip()
      pathway_name = j.text
      total_dict[group_key].append((pathway_id,pathway_name))

def get_genes(species,keggid):
    url = "http://togows.dbcls.jp/entry/pathway/{}{}/genes.json".format(species,keggid)
    r = requests.get(url)
    data = json.loads(r.text)
    try:
      gene_names = [i.split(";")[0] for i in data[0].values()]
    except:
      print("error when get genes from {}{}".format(species,keggid))
      return []
    return gene_names

with open("Total_kegg.gmt","w") as fout:
  for k, v in Total_dict.items():
    for pid,pname in v:
      pgenes = get_genes(species,pid)
      fout.write("\t".join([pname,pid] + pgenes) + "\n")

爬取KEGG数据库-制作gmt文件

猜你喜欢

热点阅读