ClustrProfiler---GO/KEGG富集分析--终极
R version 4.0.5
RStudio Version 1.4.1106
一、快速扩张基因富集分析
Gene Enrichment analyses for significantly expanded gene families
cd 14-enrichment
cat ../11-gene/function/interproscan/iprscan.tsv | grep -P "\t"Pfam"\t" > iprscan.tsv
cp ../11-gene/function/eggnog/eggnog.emapper.annotations ./
1. GO enrichment
cd GO-expansion
1.1 obtain the background file (the whole genome)
cat ../eggnog.emapper.annotations ../iprscan.tsv | grep "GO:" | awk '{print $1}' | sort | uniq > all_gene_with_go.list
1.2 get description from the TBtools GO name Parse function or extract from go-basic.obo (website of GO)
for gene in $(cat all_gene_with_go.list);
do cat ../eggnog.emapper.annotations | grep "$gene" | grep "GO:" | cut -f 7 | sed "s/,/\n/g" > temp.list
cat ../iprscan.tsv | grep "$gene" | grep "GO:" | cut -f 14 | sed "s/|/\n/g" >> temp.list
cat temp.list | sort | uniq | sed "s/$/\t$gene/g" >> term2gene.txt
rm temp.list
done
cat go-basic.obo | sed -n "/^id: GO:/,/^namespace:/p"| awk '{if(NR%3!=0)ORS=" ";else ORS="\n"}1' | sed "s/id: //g" | sed "s/ name: /\t/g" | sed "s/ namespace: /\t/g" > GO.descriptions
cat ~/diskB/database/GO/GO.descriptions | cut -f 1,2 > term2name.txt
1.3 get the list of candidate (target) gene set
for family in $(cat ../../13-cafe/expansion/expansion.list);
do cat ../../12-orthofinder/Results_*/Orthogroup_Sequences/$family.fa | grep "Pardosa_pseudoannulata" | sed "s/>Pardosa_pseudoannulata-//g" >> temp.list;
done
cat all_gene_with_go.list | grep -f temp.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp.list
1.4 R package 'clusterProfiler'
#安装bioconductor
install.packages("BiocManager")
#安装ClusterProfiler
BiocManager::install("clusterProfiler")
BiocManager::install("readr")
library(readr) #read_delim函数对应的包
library(clusterProfiler) #Y叔的富集分析包,咱也不知道Y叔是谁
gene <- read_delim("C:/Users/Administrator/Desktop/GO enrich/gene.txt", ",", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
term2gene <- read_delim("C:/Users/Administrator/Desktop/GO enrich/term2gene.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
term2name <- read_delim("C:/Users/Administrator/Desktop/GO enrich/term2name.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
x <- enricher(gene,TERM2GENE=term2gene,TERM2NAME=term2name,pvalueCutoff = 0.01, pAdjustMethod = "BH", qvalueCutoff = 0.05)
#该函数以geneList作为背景基因集,pvalue=0.01做为阈值,在GO 的CC分类上对基因集合进行富集分析
#p.adjust和qvalue都是对p值的校正,p.adjust 在分析时可以选用方法,比如图中选择的是BH, qvalue是FDR校正
barplot(x, showCategory=20)
#横轴为该GO term下的差异基因个数,纵轴为富集到的GO Terms的描述信息,
#showCategory指定展示的GO Terms的个数,默认展示显著富集的top10个,
#即p.adjust最小的10个。注意的颜色对应p.adjust值,从小到大,对应蓝色到红色。
dotplot(x, showCategory=20)
#横轴为GeneRatio, 代表该GO term下的差异基因个数占差异基因总数的比例,
#纵轴为富集到的GO Terms的描述信息, showCategory指定展示的GO Terms的个数,
#默认展示显著富集的top10个,即p.adjust最小的10个。
#图中点的颜色对应p.adjust的值,从小到大,对应蓝色到红色,大小对应该GO terms下的差异基因个数,个数越多,点越大。
write.table(as.data.frame(x), 'go.expantion.txt', sep = '\t', row.names = FALSE, quote = FALSE)
#右下窗口点击file,可以发现新保存的文档,这个文件夹目录在 '我的电脑/文档'
1.5 GO富集分析结果表格
图片.png对于各列内容:
ONTOLOGY
,GO
分类BP
(生物学过程)、CC
(细胞组分)或MF
(分子功能);
ID
和Description
,富集到的GO id
和描述;
GeneRatio
和BgRatio
,分别为富集到该GO
条目中的基因数目/给定基因的总数目,以及该条目中背景基因总数目/该物种所有已知的GO功能基因数目;
pvalue、p.adjust
和qvalue,p
值、校正后p
值和q
值信息;
geneID
和Count
,富集到该GO
条目中的基因名称(分析中使用的entrze id
,故这里也显示的entrze id
)和数目。
注:如期望显示其它类型的基因id
,如通俗的symbol id
等类型,除了更改为使用symbol id
的基因名称做分析外,还可以通过基因名称转换的方式对entrze id
和symbol id
作个匹配转换。
1.6 clusterProfiler 包里的一些默认作图方法,例如
barplot(kegg)
#富集柱形图
dotplot(kegg)
#富集气泡图
cnetplot(kegg)
#网络图展示富集功能和基因的包含关系
emapplot(kegg)
#网络图展示各富集功能之间共有基因关系
heatplot(kegg)
#热图展示富集功能和基因的包含关系
clusterProfiler出的图,我们可以随意改
#比如改颜色
p2 <-p=""+=""scalecolor_continuous(low=,purple,,=""high='green'>
#改点的形状
p2 + aes (shape-GeneRatio > 0.1)
结果也可以用ggplot2可视化,像这个includeAll的参数,在把clusterProfiler的结果转变成data.frame后进行处理。
clusterProfiler的输出不用转换as.data.frame,就可以直接传给ggplot2,别的富集软件做的,不是data.frame对象的,都不能传给ggplot2,但clusterProfiler可以!
clusterProfiler的dotplot所有参数,都可以传给ggplot2,像上面用到的showCategory,includeALL这些都可以,然后数据处理于无形中,你可以加图层自己画图了。
ggplot(x,aes(Cluster,Description),showCategory=8)+ geompoint(aes(color-gvalue,size-GeneRatio))
图片.png
图片.png
1.7 比较基因集合的生物学功能
利用compareCluster函数,可以自动计算每个基因集所富集的功能分类。具体方法为:
Compare <- compareCluster(geneCluster=geneset, fun=”enrichKEGG”)
2. KEGG enrichment
2.1 obtain the background file (the whole genome)
cat ../eggnog.emapper.annotations | tail -n +5 | grep ",map" | cut -f 1 > t1
cat ../iprscan.tsv | grep "KEGG:" | cut -f1 > t2
cat t1 t2 | sort -u > all_gene_with_ko.list
rm t1 t2
#可以在文本文件编辑成.sh来bash运行
for gene in $(cat all_gene_with_ko.list)
do
cat ../eggnog.emapper.annotations | grep "$gene" | cut -f 10 | tr "," "\n" | grep "^ko" | sed "s/^ko//g" > temp.list
cat ../iprscan.tsv | grep "$gene" | cut -f 15 | tr '|' '\n' | grep "KEGG:" | cut -d " " -f2 >> temp.list
cat temp.list | sort -u | sed "s/^/ko/g" | sed "s/$/\t$gene/g" >> term2gene.txt
rm temp.list
done
2.2 get the list of candidate (target) gene set
for family in $(cat ../../13-cafe/expansion/expansion.list);
do cat ../../12-orthofinder/Results_*/Orthogroup_Sequences/$family.fa | grep "Pardosa_pseudoannulata" | sed "s/
>Pardosa_pseudoannulata-//g" >> temp.list;
done
cat all_gene_with_ko.list | grep -f temp.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp.list
2.3 get KO orthology from
https://www.genome.jp/kegg-bin/download_htext?htext=ko00001&format=htext&filedir=
自动下载最新的ko00001.keg
cat ko00001.keg | grep '\[PATH:' | sed "s/^.\{11\}//g" | sed "s/]//g" | sed "s/ \[PATH:/\t/g" | cut -d "" -f 2 > c1
cat ko00001.keg | grep '\[PATH:' | sed "s/^.\{11\}//g" | sed "s/]//g" | sed "s/ \[PATH:/\t/g" | cut -d "" -f 1 > c2
paste c1 c2 > term2name.txt
rm c1 c2
genes <- read.delim('gene.txt', header = TRUE, stringsAsFactors = FALSE)
#每次打开R计算时,它会自动连接kegg官网获得最近的物种注释信息,因此数据库一定都是最新的
kegg <- enrichKEGG(
gene = genes, #基因列表文件中的基因名称
keyType = 'kegg', #kegg 富集
organism = 'oas', #例如,oas 代表绵羊,其它物种更改这行即可
pAdjustMethod = 'fdr', #指定 p 值校正方法
pvalueCutoff = 0.05, #指定 p 值阈值,不显著的值将不显示在结果中
qvalueCutoff = 0.2, #指定 q 值阈值,不显著的值将不显示在结果中
)
#输出结果
write.table(kegg, 'kegg.txt', sep = '\t', quote = FALSE, row.names = FALSE)
KEGG富集分析结果表格
图片.png
对于各列内容:
ID
和Description
,富集到的KEGG id
和描述;
GeneRatio
和BgRatio
,分别为富集到该KEGG
条目中的基因数目/给定基因的总数目,以及该条目中背景基因总数目/该物种所有已知的KEGG功能基因数目;
pvalue、p.adjus
t和qvalue
,p值、校正后p值和q值信息;
gene ID
和Count
,富集到该KEGG
条目中的基因名称(分析中使用的entrze id
,故这里也显示的entrze id
)和数目。
注:如期望显示其它类型的基因id,如通俗的symbol id
等类型,由于该分析中只能输入entrze id
,因此可以通过基因名称转换的方式对entrze id
和symbol id
作个匹配转换。
···
2.4 R package 'clusterProfiler'
same as above
二、物种特有基因的富集分析
GO enrichment for species-specific families
cd GO-species-specific
cp ../GO-expansion/term* ../GO-expansion/all* ./
cat ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | tail -n +2 | cut -f 1 > temp.list
for family in $(cat temp.list);
do a=$(grep "$family" ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | cut -f 4);
b=$(grep "$family" ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | cut -f 15 | col -b);
test "$a" = "$b" && echo $family >> species-specific.orthogroups.list;
done
#这个运行有点问题,不知道运行流程原理。参考我在Excel进行筛选的方法,
#然后下一步直接 在提取的OG0000xx.fa文件夹中
cat *.fa >>temp
#接着就进行下一步即可
'4' as the target species, "15" as the number of all species plus two
for family in $(cat species-specific.orthogroups.list);
do cat ../../12-orthofinder/Results*/Orthogroup_Sequences/$family.fa >> temp;
done
cat temp | seqkit seq -n | sed "s/Aphidoletes_aphidimyza-//g" | grep -f all_gene_with_go.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp*
KEGG enrichment for species-specific families
cd KEGG-species-specific
cp ../KEGG-expansion/term* ../KEGG-expansion/all* ./
for family in $(cat ../GO-species-specific/species-specific.orthogroups.list);
do cat ../../12-orthofinder/Results*/Orthogroup_Sequences/$family.fa >> temp;
done
cat temp | seqkit seq -n | sed "s/Aphidoletes_aphidimyza-//g" | grep -f all_gene_with_ko.list | tr "\n" "," | sed "s/,$//g" > gene.txt
https://www.jianshu.com/writer#/notebooks/50178300/notes/87640086/preview