15.GO和KEGG分析
非模式生物--大麦的GO和KEGG分析
分析思路
1.首先查看Annotationhub(https://bioconductor.org/packages/release/bioc/html/AnnotationHub.html中是否存在其相应的注释信息;
2.如果在Annotationhub中不存在的,则需要自己用AnnotationForge构建(https://bioconductor.org/packages/release/bioc/html/AnnotationForge.html
参考:
测序了,然后呢(四)有了OrgDb才能进行富集分析https://www.jianshu.com/p/9c9e97167377
一、按照Annotationhub进行操作
安装必要的包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("AnnotationHub")
由于R的版本是3.6 ;对应的包的要求R的版本是4.0,没有半大下载,只能通过下载tar.gz的包进行手动安装
失败了重新尝试其他方法
二、利用AnnotationForge 进行自行构建
安装eggNOG mapper
在服务器上操作
conda install eggnog-mapper
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- eggnog-mapper
Current channels:
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
[root@175 u2]# anaconda search -t conda eggnog-mapper
Using Anaconda API: https://api.anaconda.org
Packages:
Name | Version | Package Types | Platforms | Builds
------------------------- | ------ | --------------- | --------------- | ----------
bioconda/eggnog-mapper | 2.0.1 | conda | linux-64, noarch, osx-64 | py27_2, py27_1, py27_0, py_0, py_1, py_3
: Fast genome-wide functional annotation through orthology assignment.
Found 1 packages
Run 'anaconda show <USER/PACKAGE>' to get installation details
[root@175 u2]# anaconda show bioconda/eggnog-mapper
Using Anaconda API: https://api.anaconda.org
Name: eggnog-mapper
Summary: Fast genome-wide functional annotation through orthology assignment.
Access: public
Package Types: conda
Versions:
+ 1.0.0
+ 1.0.1
+ 1.0.2
+ 1.0.3
+ 2.0.0
+ 2.0.1
To install this package with conda run:
conda install --channel https://conda.anaconda.org/bioconda eggnog-mapper
[root@175 u2]# conda install --channel https://conda.anaconda.org/bioconda eggnog-mapper
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: /
Found conflicts! Looking for incompatible packages. failed
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
Specifications:
- eggnog-mapper -> python[version='2.7.*|<3|>=2.7,<2.8.0a0']
Your python: python=3
If python is on the left-most side of the chain, that's the version you've asked for.
When python appears to the right, that indicates that the thing on the left is somehow
not available for the python version you are constrained to. Note that conda will not
change your python version to a different minor version unless you explicitly specify
that.
发现是因为python的版本不匹配,利用wget
mkdir software/eggnog-mapper
cd software/eggnog-mapper
wget https://github.com/jhcepas/eggnog-mapper/archive/1.0.3.tar.gz
tar xvzf 1.0.3.tar.gz
cd eggnog-mapper-1.0.3
设置环境
参考
http://www.chenlianfu.com/?p=2804
https://www.jianshu.com/p/7a3b714e6ccf
https://www.jianshu.com/p/e0abf65ca1c5
https://www.jianshu.com/p/5d12f4d8c052
##2020.8.7 继续进行GO分析
通过查看自己的大麦数据,发现一个基因对应了好多的GO号,没有必要按照
image.png
awk -F'\t' '{print $1,$2,$6,$7,$12, $13}' diamond.emapper.annotations > barley.emapper.annotations.txt
###加载管道的包
install.packages('magrittr')
library(magrittr)
library(stringr)
library(dplyr)
setwd(“”)
egg_f <- "diamond.emapper.annotations"
egg <- read.csv(egg_f, sep = "\t")
egg[egg==""]<-NA #这个代码来自花花的指导(将空行变成NA,方便下面的去除)
gene_info <- egg %>%
dplyr::select(GID = query_name, GENENAME = `eggNOG.annot`) %>% na.omit()
# 先构建一个空的数据框(弄好大体的架构,表示其中要有GID =》query_name,GO =》GO号, EVIDENCE =》默认IDA)
# 关于IEA:就是一个标准,除了这个标准以外还有许多。IEA就是表示我们的注释是自动注释,无需人工检查http://wiki.geneontology.org/index.php/Inferred_from_Electronic_Annotation_(IEA)
# 两种情况下需要用IEA:1. manually constructed mappings between external classification systems and GO terms; 2.automatic transfer of annotation to orthologous gene products.
gene2go <- data.frame(GID = character(),
GO = character(),
EVIDENCE = character())
# 然后向其中填充:注意到有的query_name对应多个GO,因此我们以GO号为标准,每一行只能有一个GO号,但query_name和Evidence可以重复
for (row in 1:nrow(gterms)) {
gene_terms <- str_split(gterms[row,"GO_terms"], ",", simplify = FALSE)[[1]]
gene_id <- gterms[row, "query_name"][[1]]
tmp <- data_frame(GID = rep(gene_id, length(gene_terms)),
GO = gene_terms,
EVIDENCE = rep("IEA", length(gene_terms)))
gene2go <- rbind(gene2go, tmp)
}
options(stringsAsFactors = F)
if(F){
# 需要下载 json文件(这是是经常更新的)
# https://www.genome.jp/kegg-bin/get_htext?ko00001
# 代码来自:http://www.genek.tv/course/225/task/4861/show
library(jsonlite)
library(purrr)
library(RCurl)
update_kegg <- function(json = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/ko00001.json") {
pathway2name <- tibble(Pathway = character(), Name = character())
ko2pathway <- tibble(Ko = character(), Pathway = character())
kegg <- fromJSON(json)
for (a in seq_along(kegg[["children"]][["children"]])) {
A <- kegg[["children"]][["name"]][[a]]
for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
B <- kegg[["children"]][["children"]][[a]][["name"]][[b]]
for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
kos <- str_match(kos_info, "K[0-9]*")[,1]
ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
}
}
}
save(pathway2name, ko2pathway, file = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/kegg_info.RData")
}
update_kegg(json = "/u2/new2020barley/barley_annotion/gene_annotation_morex_v2/mascher_IPK-GATERSLEBEN.DE/barley_annotation_morexv2/gene_annotation/ko00001.json")
}
options(stringsAsFactors = F)
if(F){
# 需要下载 json文件(这是是经常更新的)
# https://www.genome.jp/kegg-bin/get_htext?ko00001
# 代码来自:http://www.genek.tv/course/225/task/4861/show
library(jsonlite)
library(purrr)
library(RCurl)
library("rjson")
update_kegg <- function(json = "ko00001.json") {
pathway2name <- tibble(Pathway = character(), Name = character())
ko2pathway <- tibble(Ko = character(), Pathway = character())
kegg <- fromJSON(file="ko00001.json")
for (a in seq_along(kegg[["children"]][["children"]])) {
A <- kegg[["children"]][["name"]][[a]]
for (b in seq_along(kegg[["children"]][["children"]][[a]][["children"]])) {
B <- kegg[["children"]][["children"]][[a]][["name"]][[b]]
for (c in seq_along(kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]])) {
pathway_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["name"]][[c]]
pathway_id <- str_match(pathway_info, "ko[0-9]{5}")[1]
pathway_name <- str_replace(pathway_info, " \\[PATH:ko[0-9]{5}\\]", "") %>% str_replace("[0-9]{5} ", "")
pathway2name <- rbind(pathway2name, tibble(Pathway = pathway_id, Name = pathway_name))
kos_info <- kegg[["children"]][["children"]][[a]][["children"]][[b]][["children"]][[c]][["name"]]
kos <- str_match(kos_info, "K[0-9]*")[,1]
ko2pathway <- rbind(ko2pathway, tibble(Ko = kos, Pathway = rep(pathway_id, length(kos))))
}
}
}
save(pathway2name, ko2pathway, file = "kegg_info.RData")
}
update_kegg(json = "ko00001.json")
}
得不到kegg_info.RData 这个结果
下一步没有办法load加载
尝试另一个方法
数据的格式要求是一个基因对应一个GO这样的格式首先把自己的数据修改格式
这个里面的问题是(a in seq_along(kegg[["children"]][["children"]]) 中的a没有数值,里面提取不出来,我也看不懂这个是什么意思,给作者发邮件作者也没有回,有他女朋友的微信,尝试询问他女朋友,她说是他写的我应该去问他,她不知道;嗯。。。。,所以就尝试用牛逼的Y叔的包clusterProfiler,这个包可能我用的不熟练,导致我只能进行GO分析,只能得到GO的description,而KEGG的ko好对应的description没有对应的关系。所以进行查找另外的方法得到KEGG的ko号对应的description。
看了一个简书的公众号,也是用的这个方法,他得到了kegg_info.RData这个结果,关注了他的公众号后后台问他要了kegg_info.RData
三、获得KEGG注释
参考:https://www.jianshu.com/p/6f9e6ed3400d KEGG pathway 注释整理
通过eggnog-mapper和interproscan两个软件(或数据库),可以获得KEGG ORTHOLOGY(KO)的注释,即基因或者转录本对应的K number, 具体参见两个软件的wiki.
获得KO与pathway的关系
进入KEGG官网,然后点击KEGG BRITE进入该数据库,在这个数据库中可以下载KEGG数据库中手工创建的层次结构文件(BRITE hierarchy files)。在这里,需要下载包含pathway和KO对应关系的文件,点击KEGG Orthology (KO)下载,这里下载json版本。
import json
import re
with open("ko00001.json") as f:
ko_map_data = json.load(f)
with open("KEGG_pathway_ko.txt", "w") as oh:
line = "level1_pathway_id\tlevel1_pathway_name\tlevel2_pathway_id\tlevel2_pathway_name"
line += "\tlevel3_pathway_id\tlevel3_pathway_name\tko\tko_name\tko_des\tec\n"
oh.write(line)
for level1 in ko_map_data["children"]:
m = re.match(r"(\S+)\s+([\S\w\s]+)", level1["name"])
level1_pathway_id = m.groups()[0].strip()
level1_pathway_name = m.groups()[1].strip()
for level2 in level1["children"]:
m = re.match(r"(\S+)\s+([\S\w\s]+)", level2["name"])
level2_pathway_id = m.groups()[0].strip()
level2_pathway_name = m.groups()[1].strip()
for level3 in level2["children"]:
m = re.match(r"(\S+)\s+([^\[]*)", level3["name"])
level3_pathway_id = m.groups()[0].strip()
level3_pathway_name = m.groups()[1].strip()
if "children" in level3:
for ko in level3["children"]:
m = re.match(r"(\S+)\s+(\S+);\s+([^\[]+)\s*(\[EC:\S+(?:\s+[^\[\]]+)*\])*", ko["name"])
if m is not None:
ko_id = m.groups()[0].strip()
ko_name = m.groups()[1].strip()
ko_des = m.groups()[2].strip()
ec = m.groups()[3]
if ec==None:
ec = "-"
line = level1_pathway_id + "\t" + level1_pathway_name + "\t" + level2_pathway_id + "\t" + level2_pathway_name
line += "\t" + level3_pathway_id + "\t" + level3_pathway_name + "\t" + ko_id + "\t" + ko_name + "\t" + ko_des + "\t" + ec + "\n"
oh.write(line)
这会生成KEGG_pathway_ko.txt文件,随后对行去重。
import pandas as pd
data = pd.read_csv("KEGG_pathway_ko.txt", sep="\t",dtype=str)
data = data.drop_duplicates()
data.to_csv("KEGG_pathway_ko_uniq.txt", index=False, sep="\t")
最后得到KEGG_pathway_ko_uniq.txt文件,这个文件包含了KO和KEGG pathway的对应关系信息,也包含了pathway的级别分类(KEGG pathway分为3级),如下所示:
level1_pathway_id level1_pathway_name level2_pathway_id level2_pathway_name level3_pathway_id level3_pathway_name ko ko_name ko_des ec
9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K00844 HK hexokinase [EC:2.7.1.1]
9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K12407 GCK glucokinase [EC:2.7.1.2]
9100 Metabolism 9101 Carbohydrate metabolism 10 Glycolysis / Gluconeogenesis K00845 glk glucokinase [EC:2.7.1.2]
合并结果
现在是表格文件,和容易将上面多种对应关系合并起来,进行后续的分析,例如可以对KEGG的注释结果按照KEGG中通路类型或者不同的level进行分类汇总,又或者对特定的基因集进行KEGG pathway的富集分析等。
四、
安装必要的包
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("clusterProfiler")
###用大麦参考基因组的蛋白自己做的GO和KEGG文件得到的注释信息
setwd("E:\\mywork\\大麦转录组\\GO_result\\")
gene2go <- read.csv("barley_gene2go.csv",header = T)
dim(gene2go)
head(gene2go)
setwd("E:\\mywork\\大麦转录组\\WGCNA_result\\2020.8_WGCNA_result\\WGCN_result\\服务器\\model_result\\VRS网络图\\VRS_GO_result\\")
gene_list <- read.table("VRS4_GO_KEGG_inputdata.txt",header=T)
freq_d2 <- gene2go[which(gene2go$GID %in% gene_list$x),]
dim(freq_d2)
head(freq_d2)
gene_list2 <- freq_d2[,2]
length(gene_list2)
term2gene<-gene2go[,c(3,2)]
head(term2gene)
df<-enricher(gene=gene_list2,
pvalueCutoff = 1,
pAdjustMethod = "BH",
TERM2GENE = term2gene)
### pvalueCutoff = 1 正常的是0.05
df<-as.data.frame(df)
df
df1<-go2term(df$ID)
dim(df1)
head(df1)
colnames(df1) <- c("ID","Term")
df3 <- merge(df1,df,by = 'ID')
dim(df3)
df2<-go2ont(df$ID)
dim(df2)
head(df2)
df3$Ont<-df2$Ontology
head(df3)
write.csv(df3,"VRS4_GO_result.csv")
df4<-df3[,c(2,11,6)]###提取了Term Ont Pvalue这三列,也可以提取count这一列
head(df4)
library(ggplot2)
ggplot(df4,aes(x=Term,y=-log10(pvalue)))+
geom_col(aes(fill=Ont))+
coord_flip()+labs(x="")+
theme_bw()
###KEGG分析
#rm(list=ls())
setwd("E:\\mywork\\大麦转录组\\GO_result\\")
gene2ko <- read.csv("barley_gene2ko.csv",header = T)
head(gene2ko)
dim(gene2ko)
#choose.files()
ko_information <-read.csv("E:\\mywork\\大麦转录组\\KEGG_result\\KEGG.information.csv",header=T)
head(ko_information)
ko_information2 <- ko_information[,c(2,7)]
head(ko_information2)
ko_information3 <- ko_information[,c(7,8)]
freq_d2 <- gene2ko[which(gene2ko$GID %in% gene_list$x),]
dim(freq_d2)
head(freq_d2)
gene_list2 <- freq_d2[,2]
length(gene_list2)
term2gene<-gene2ko[,c(3,2)]
head(term2gene)
freq_d3<-left_join(term2gene, ko_information2, by=c('Ko'='ID'))
head(freq_d3)
df<-enricher(gene=gene_list2,
pvalueCutoff = 1,
TERM2GENE =freq_d3[c('lev2_lev3_id', 'GID')],
TERM2NAME=ko_information3)
##pvalueCutoff = 0.05 是正常的
head(df)
result <- as.data.frame(df@result)
setwd("E:\\mywork\\大麦转录组\\WGCNA_result\\2020.8_WGCNA_result\\WGCN_result\\服务器\\model_result\\VRS网络图\\VRS_GO_result\\")
write.csv(result,"VRS4_KEGG_result.csv")
head(result)
head(df)
barplot(df)
https://www.jianshu.com/p/49ceeae7fd95
https://www.jianshu.com/p/3a7149c1aee9
https://www.jianshu.com/p/4c2f2f578a43
https://www.jianshu.com/p/04cca6648439
https://www.bioinfo-scrounger.com/archives/512/
https://www.jianshu.com/p/9c9e97167377
https://www.jianshu.com/p/63c46de5f18b
https://www.jianshu.com/p/e646c0fa6443
https://www.jianshu.com/p/b4997c1165df
https://www.jianshu.com/p/bcdbf80701e2
https://zhuanlan.zhihu.com/p/43651419
https://blog.csdn.net/u012110870/article/details/102804318
https://www.jianshu.com/p/47b5ea646932
https://blog.csdn.net/weixin_43569478/article/details/83743975
https://blog.csdn.net/weixin_43569478/article/details/83744242
https://www.jianshu.com/p/22faee28c6a6
https://blog.csdn.net/weixin_41151172/article/details/80382567
https://zhuanlan.zhihu.com/p/78093534 根据GO/KEGG富集结果选择自己的研究方向
https://www.bioinfo-scrounger.com/archives/639/ 可视化kegg通路-pathview包 (重点看)