【康华同学】:优秀生物信息学博客生信注释和富集

15.GO和KEGG分析

2022-02-10  本文已影响0人  夏大希

非模式生物--大麦的GO和KEGG分析

分析思路
1.首先查看Annotationhub(https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html中是否存在其相应的注释信息;
2.如果在Annotationhub中不存在的,则需要自己用AnnotationForge构建(https://bioconductor.org/packages/release/bioc/html/AnnotationForge.html

参考:
测序了,然后呢(四)有了OrgDb才能进行富集分析https://www.jianshu.com/p/9c9e97167377

一、按照Annotationhub进行操作

安装必要的包

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("AnnotationHub")
由于R的版本是3.6 ;对应的包的要求R的版本是4.0,没有半大下载,只能通过下载tar.gz的包进行手动安装
失败了重新尝试其他方法

二、利用AnnotationForge 进行自行构建

安装eggNOG mapper

在服务器上操作
conda install eggnog-mapper
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - eggnog-mapper

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.


[root@175 u2]# anaconda search -t conda eggnog-mapper
Using Anaconda API: https://api.anaconda.org
Packages:
     Name                      |  Version | Package Types   | Platforms       | Builds
     ------------------------- |   ------ | --------------- | --------------- | ----------
     bioconda/eggnog-mapper    |    2.0.1 | conda           | linux-64, noarch, osx-64 | py27_2, py27_1, py27_0, py_0, py_1, py_3
                                          : Fast genome-wide functional annotation through orthology assignment.
Found 1 packages

Run 'anaconda show <USER/PACKAGE>' to get installation details
[root@175 u2]# anaconda show  bioconda/eggnog-mapper
Using Anaconda API: https://api.anaconda.org
Name:    eggnog-mapper
Summary: Fast genome-wide functional annotation through orthology assignment.
Access:  public
Package Types:  conda
Versions:
   + 1.0.0
   + 1.0.1
   + 1.0.2
   + 1.0.3
   + 2.0.0
   + 2.0.1

To install this package with conda run:
     conda install --channel https://conda.anaconda.org/bioconda eggnog-mapper
[root@175 u2]# conda install --channel https://conda.anaconda.org/bioconda eggnog-mapper
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages.                                                                                                         failed

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - eggnog-mapper -> python[version='2.7.*|<3|>=2.7,<2.8.0a0']

Your python: python=3

If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.

发现是因为python的版本不匹配,利用wget

mkdir software/eggnog-mapper
cd software/eggnog-mapper
wget https://github.com/jhcepas/eggnog-mapper/archive/1.0.3.tar.gz
tar xvzf 1.0.3.tar.gz
cd eggnog-mapper-1.0.3

设置环境


参考
http://www.chenlianfu.com/?p=2804
https://www.jianshu.com/p/7a3b714e6ccf
https://www.jianshu.com/p/e0abf65ca1c5
https://www.jianshu.com/p/5d12f4d8c052



##2020.8.7 继续进行GO分析
通过查看自己的大麦数据,发现一个基因对应了好多的GO号,没有必要按照


image.png
awk -F'\t'  '{print $1,$2,$6,$7,$12, $13}' diamond.emapper.annotations > barley.emapper.annotations.txt
###加载管道的包
install.packages('magrittr')
library(magrittr)
 library(stringr)
 library(dplyr)

setwd(“”)
egg_f <- "diamond.emapper.annotations"
egg <- read.csv(egg_f, sep = "\t")
egg[egg==""]<-NA #这个代码来自花花的指导(将空行变成NA,方便下面的去除)
gene_info <- egg %>%
        dplyr::select(GID = query_name, GENENAME = `eggNOG.annot`) %>% na.omit()
# 先构建一个空的数据框(弄好大体的架构,表示其中要有GID =》query_name,GO =》GO号, EVIDENCE =》默认IDA)
# 关于IEA:就是一个标准,除了这个标准以外还有许多。IEA就是表示我们的注释是自动注释,无需人工检查http://wiki.geneontology.org/index.php/Inferred_from_Electronic_Annotation_(IEA)
# 两种情况下需要用IEA:1. manually constructed mappings between external classification systems and GO terms; 2.automatic transfer of annotation to orthologous gene products.
gene2go <- data.frame(GID = character(),
                         GO = character(),
                         EVIDENCE = character())

# 然后向其中填充:注意到有的query_name对应多个GO,因此我们以GO号为标准,每一行只能有一个GO号,但query_name和Evidence可以重复
for (row in 1:nrow(gterms)) {
        gene_terms <- str_split(gterms[row,"GO_terms"], ",", simplify = FALSE)[[1]]  
        gene_id <- gterms[row, "query_name"][[1]]
        tmp <- data_frame(GID = rep(gene_id, length(gene_terms)),
                              GO = gene_terms,
                              EVIDENCE = rep("IEA", length(gene_terms)))
        gene2go <- rbind(gene2go, tmp)
    }  






options(stringsAsFactors = F)
if(F){
    # 需要下载 json文件(这是是经常更新的)
    # https://www.genome.jp/kegg-bin/get_htext?ko00001
    # 代码来自:http://www.genek.tv/course/225/task/4861/show
    library(jsonlite)
    library(purrr)
    library(RCurl)
    
    update_kegg <- function(json = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/ko00001.json") {
        pathway2name <- tibble(Pathway = character(), Name = character())
        ko2pathway <- tibble(Ko = character(), Pathway = character())
        
        kegg <- fromJSON(json)
        
        for (a in seq_along(kegg[["children"]][["children"]])) {
            A <- kegg[["children"]][["name"]][[a]]
            
            for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
                B <- kegg[["children"]][["children"]][[a]][["name"]][[b]] 
                
                for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
                    pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
                    
                    pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
                    pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
                    pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
                    
                    kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
                    
                    kos <- str_match(kos_info, "K[0-9]*")[,1]
                    
                    ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
                }
            }
        }
        
        save(pathway2name, ko2pathway, file = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/kegg_info.RData")
    }
    
    update_kegg(json = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/ko00001.json")
    
}

options(stringsAsFactors = F)
if(F){
    # 需要下载 json文件(这是是经常更新的)
    # https://www.genome.jp/kegg-bin/get_htext?ko00001
    # 代码来自:http://www.genek.tv/course/225/task/4861/show
    library(jsonlite)
    library(purrr)
    library(RCurl)
    library("rjson")
    
    update_kegg <- function(json = "ko00001.json") {
        pathway2name <- tibble(Pathway = character(), Name = character())
        ko2pathway <- tibble(Ko = character(), Pathway = character())
        
        kegg <- fromJSON(file="ko00001.json")
        
        for (a in seq_along(kegg[["children"]][["children"]])) {
            A <- kegg[["children"]][["name"]][[a]]
            
            for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
                B <- kegg[["children"]][["children"]][[a]][["name"]][[b]] 
                
                for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
                    pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
                    
                    pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
                    pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
                    pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
                    
                    kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
                    
                    kos <- str_match(kos_info, "K[0-9]*")[,1]
                    
                    ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
                }
            }
        }
        
        save(pathway2name, ko2pathway, file = "kegg_info.RData")
    }
    
    update_kegg(json = "ko00001.json")
    
}

得不到kegg_info.RData 这个结果
下一步没有办法load加载
尝试另一个方法
数据的格式要求是一个基因对应一个GO这样的格式首先把自己的数据修改格式
这个里面的问题是(a in seq_along(kegg[["children"]][["children"]]) 中的a没有数值,里面提取不出来,我也看不懂这个是什么意思,给作者发邮件作者也没有回,有他女朋友的微信,尝试询问他女朋友,她说是他写的我应该去问他,她不知道;嗯。。。。,所以就尝试用牛逼的Y叔的包clusterProfiler,这个包可能我用的不熟练,导致我只能进行GO分析,只能得到GO的description,而KEGG的ko好对应的description没有对应的关系。所以进行查找另外的方法得到KEGG的ko号对应的description。

看了一个简书的公众号,也是用的这个方法,他得到了kegg_info.RData这个结果,关注了他的公众号后后台问他要了kegg_info.RData


三、获得KEGG注释

参考:https://www.jianshu.com/p/6f9e6ed3400d KEGG pathway 注释整理
通过eggnog-mapperinterproscan两个软件(或数据库),可以获得KEGG ORTHOLOGY(KO)的注释,即基因或者转录本对应的K number, 具体参见两个软件的wiki.

获得KO与pathway的关系

进入KEGG官网,然后点击KEGG BRITE进入该数据库,在这个数据库中可以下载KEGG数据库中手工创建的层次结构文件(BRITE hierarchy files)。在这里,需要下载包含pathway和KO对应关系的文件,点击KEGG Orthology (KO)下载,这里下载json版本。

import json
import re

with open("ko00001.json") as f:
    ko_map_data = json.load(f)

with open("KEGG_pathway_ko.txt", "w") as oh:
    line = "level1_pathway_id\tlevel1_pathway_name\tlevel2_pathway_id\tlevel2_pathway_name"
    line += "\tlevel3_pathway_id\tlevel3_pathway_name\tko\tko_name\tko_des\tec\n"
    oh.write(line)
    for level1 in ko_map_data["children"]:
        m = re.match(r"(\S+)\s+([\S\w\s]+)", level1["name"])
        level1_pathway_id = m.groups()[0].strip()
        level1_pathway_name = m.groups()[1].strip()
        for level2 in level1["children"]:
            m = re.match(r"(\S+)\s+([\S\w\s]+)", level2["name"])
            level2_pathway_id = m.groups()[0].strip()
            level2_pathway_name = m.groups()[1].strip()
            for level3 in level2["children"]:
                m = re.match(r"(\S+)\s+([^\[]*)", level3["name"])
                level3_pathway_id = m.groups()[0].strip()
                level3_pathway_name = m.groups()[1].strip()
                if "children" in level3:
                    for ko in level3["children"]:
                        m = re.match(r"(\S+)\s+(\S+);\s+([^\[]+)\s*(\[EC:\S+(?:\s+[^\[\]]+)*\])*", ko["name"])
                        if m is not None:
                            ko_id = m.groups()[0].strip()
                            ko_name = m.groups()[1].strip()
                            ko_des = m.groups()[2].strip()
                            ec = m.groups()[3]
                            if ec==None:
                                ec = "-"
                        line = level1_pathway_id + "\t" + level1_pathway_name + "\t" + level2_pathway_id + "\t" + level2_pathway_name
                        line += "\t" + level3_pathway_id + "\t" + level3_pathway_name + "\t" + ko_id + "\t" + ko_name + "\t" + ko_des + "\t" + ec + "\n"
                        oh.write(line)

这会生成KEGG_pathway_ko.txt文件,随后对行去重。

import pandas as pd

data = pd.read_csv("KEGG_pathway_ko.txt", sep="\t",dtype=str)

data = data.drop_duplicates()

data.to_csv("KEGG_pathway_ko_uniq.txt", index=False, sep="\t")

最后得到KEGG_pathway_ko_uniq.txt文件,这个文件包含了KO和KEGG pathway的对应关系信息,也包含了pathway的级别分类(KEGG pathway分为3级),如下所示:

level1_pathway_id   level1_pathway_name level2_pathway_id   level2_pathway_name level3_pathway_id   level3_pathway_name ko  ko_name ko_des  ec
9100    Metabolism  9101    Carbohydrate metabolism 10  Glycolysis / Gluconeogenesis    K00844  HK  hexokinase  [EC:2.7.1.1]
9100    Metabolism  9101    Carbohydrate metabolism 10  Glycolysis / Gluconeogenesis    K12407  GCK glucokinase [EC:2.7.1.2]
9100    Metabolism  9101    Carbohydrate metabolism 10  Glycolysis / Gluconeogenesis    K00845  glk glucokinase [EC:2.7.1.2]

合并结果
现在是表格文件,和容易将上面多种对应关系合并起来,进行后续的分析,例如可以对KEGG的注释结果按照KEGG中通路类型或者不同的level进行分类汇总,又或者对特定的基因集进行KEGG pathway的富集分析等。

四、

安装必要的包

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("clusterProfiler")

###用大麦参考基因组的蛋白自己做的GO和KEGG文件得到的注释信息
setwd("E:\\mywork\\大麦转录组\\GO_result\\")
gene2go <- read.csv("barley_gene2go.csv",header = T)

dim(gene2go)
head(gene2go)
setwd("E:\\mywork\\大麦转录组\\WGCNA_result\\2020.8_WGCNA_result\\WGCN_result\\服务器\\model_result\\VRS网络图\\VRS_GO_result\\")
gene_list <- read.table("VRS4_GO_KEGG_inputdata.txt",header=T)

freq_d2 <- gene2go[which(gene2go$GID %in% gene_list$x),]
dim(freq_d2)
head(freq_d2)
gene_list2 <- freq_d2[,2]
length(gene_list2)
term2gene<-gene2go[,c(3,2)]
head(term2gene)
df<-enricher(gene=gene_list2,
             pvalueCutoff = 1,
             pAdjustMethod = "BH",
             TERM2GENE = term2gene)
### pvalueCutoff = 1 正常的是0.05



df<-as.data.frame(df)
df
df1<-go2term(df$ID)
dim(df1)
head(df1)
colnames(df1) <- c("ID","Term")
df3 <- merge(df1,df,by = 'ID')
dim(df3)


df2<-go2ont(df$ID)
dim(df2)
head(df2)
df3$Ont<-df2$Ontology

head(df3)
write.csv(df3,"VRS4_GO_result.csv")
df4<-df3[,c(2,11,6)]###提取了Term  Ont Pvalue这三列,也可以提取count这一列

head(df4)
library(ggplot2)
ggplot(df4,aes(x=Term,y=-log10(pvalue)))+
  geom_col(aes(fill=Ont))+
  coord_flip()+labs(x="")+
  theme_bw()














###KEGG分析
#rm(list=ls())
setwd("E:\\mywork\\大麦转录组\\GO_result\\")
gene2ko <- read.csv("barley_gene2ko.csv",header = T)
head(gene2ko)
dim(gene2ko)
#choose.files()
ko_information <-read.csv("E:\\mywork\\大麦转录组\\KEGG_result\\KEGG.information.csv",header=T)
head(ko_information)
ko_information2 <- ko_information[,c(2,7)]
head(ko_information2)
ko_information3 <- ko_information[,c(7,8)]

freq_d2 <- gene2ko[which(gene2ko$GID %in% gene_list$x),]
dim(freq_d2)
head(freq_d2)
gene_list2 <- freq_d2[,2]
length(gene_list2)
term2gene<-gene2ko[,c(3,2)]
head(term2gene)
freq_d3<-left_join(term2gene, ko_information2, by=c('Ko'='ID'))
head(freq_d3)
df<-enricher(gene=gene_list2,
             pvalueCutoff = 1,
             TERM2GENE =freq_d3[c('lev2_lev3_id', 'GID')],
             TERM2NAME=ko_information3)
##pvalueCutoff = 0.05 是正常的
head(df)
result <- as.data.frame(df@result)
setwd("E:\\mywork\\大麦转录组\\WGCNA_result\\2020.8_WGCNA_result\\WGCN_result\\服务器\\model_result\\VRS网络图\\VRS_GO_result\\")
write.csv(result,"VRS4_KEGG_result.csv")
head(result)
head(df)
barplot(df)

https://www.jianshu.com/p/49ceeae7fd95
https://www.jianshu.com/p/3a7149c1aee9
https://www.jianshu.com/p/4c2f2f578a43
https://www.jianshu.com/p/04cca6648439
https://www.bioinfo-scrounger.com/archives/512/
https://www.jianshu.com/p/9c9e97167377
https://www.jianshu.com/p/63c46de5f18b
https://www.jianshu.com/p/e646c0fa6443
https://www.jianshu.com/p/b4997c1165df
https://www.jianshu.com/p/bcdbf80701e2
https://zhuanlan.zhihu.com/p/43651419
https://blog.csdn.net/u012110870/article/details/102804318

https://www.jianshu.com/p/47b5ea646932
https://blog.csdn.net/weixin_43569478/article/details/83743975
https://blog.csdn.net/weixin_43569478/article/details/83744242
https://www.jianshu.com/p/22faee28c6a6

https://blog.csdn.net/weixin_41151172/article/details/80382567
https://zhuanlan.zhihu.com/p/78093534 根据GO/KEGG富集结果选择自己的研究方向
https://www.bioinfo-scrounger.com/archives/639/ 可视化kegg通路-pathview包 (重点看)

上一篇下一篇

猜你喜欢

热点阅读