ClustrProfiler---GO/KEGG富集分析--终极

2021-05-07 本文已影响0人 MLD_TRNA

R version 4.0.5
RStudio Version 1.4.1106

一、快速扩张基因富集分析

Gene Enrichment analyses for significantly expanded gene families

cd 14-enrichment
cat ../11-gene/function/interproscan/iprscan.tsv | grep -P "\t"Pfam"\t" > iprscan.tsv
cp ../11-gene/function/eggnog/eggnog.emapper.annotations ./

1. GO enrichment

cd GO-expansion

1.1 obtain the background file (the whole genome)

cat ../eggnog.emapper.annotations ../iprscan.tsv | grep "GO:" | awk '{print $1}' | sort | uniq > all_gene_with_go.list

1.2 get description from the TBtools GO name Parse function or extract from go-basic.obo (website of GO)

for gene in $(cat all_gene_with_go.list)；
  do cat ../eggnog.emapper.annotations | grep "$gene" | grep "GO:" | cut -f 7 | sed "s/,/\n/g" > temp.list
     cat ../iprscan.tsv | grep "$gene" | grep "GO:" | cut -f 14 | sed "s/|/\n/g" >> temp.list
     cat temp.list | sort | uniq | sed "s/$/\t$gene/g" >> term2gene.txt
     rm temp.list
  done

cat go-basic.obo | sed -n "/^id: GO:/,/^namespace:/p"| awk '{if(NR%3!=0)ORS=" ";else ORS="\n"}1' | sed "s/id: //g" | sed "s/ name: /\t/g" | sed "s/ namespace: /\t/g" > GO.descriptions
cat ~/diskB/database/GO/GO.descriptions | cut -f 1,2 > term2name.txt

1.3 get the list of candidate (target) gene set

for family in $(cat ../../13-cafe/expansion/expansion.list); 
  do cat ../../12-orthofinder/Results_*/Orthogroup_Sequences/$family.fa | grep "Pardosa_pseudoannulata" | sed "s/>Pardosa_pseudoannulata-//g"  >> temp.list;
 done

cat all_gene_with_go.list | grep -f temp.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp.list

1.4 R package 'clusterProfiler'

#安装bioconductor
install.packages("BiocManager")
#安装ClusterProfiler
BiocManager::install("clusterProfiler")
BiocManager::install("readr")

library(readr) #read_delim函数对应的包
library(clusterProfiler) #Y叔的富集分析包，咱也不知道Y叔是谁

gene <- read_delim("C:/Users/Administrator/Desktop/GO enrich/gene.txt", ",", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
term2gene <- read_delim("C:/Users/Administrator/Desktop/GO enrich/term2gene.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
term2name <- read_delim("C:/Users/Administrator/Desktop/GO enrich/term2name.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
x <- enricher(gene,TERM2GENE=term2gene,TERM2NAME=term2name,pvalueCutoff = 0.01, pAdjustMethod = "BH", qvalueCutoff = 0.05)
#该函数以geneList作为背景基因集，pvalue=0.01做为阈值，在GO 的CC分类上对基因集合进行富集分析
#p.adjust和qvalue都是对p值的校正，p.adjust 在分析时可以选用方法，比如图中选择的是BH, qvalue是FDR校正
barplot(x, showCategory=20)
#横轴为该GO term下的差异基因个数，纵轴为富集到的GO Terms的描述信息，
#showCategory指定展示的GO Terms的个数，默认展示显著富集的top10个，
#即p.adjust最小的10个。注意的颜色对应p.adjust值，从小到大，对应蓝色到红色。
dotplot(x, showCategory=20)
#横轴为GeneRatio, 代表该GO term下的差异基因个数占差异基因总数的比例，
#纵轴为富集到的GO Terms的描述信息， showCategory指定展示的GO Terms的个数，
#默认展示显著富集的top10个，即p.adjust最小的10个。
#图中点的颜色对应p.adjust的值，从小到大，对应蓝色到红色，大小对应该GO terms下的差异基因个数，个数越多，点越大。
write.table(as.data.frame(x), 'go.expantion.txt', sep = '\t', row.names = FALSE, quote = FALSE)
#右下窗口点击file，可以发现新保存的文档，这个文件夹目录在 '我的电脑/文档'

1.5 GO富集分析结果表格

图片.png

对于各列内容：

ONTOLOGY，GO分类BP（生物学过程）、CC（细胞组分）或MF（分子功能）；

ID和Description，富集到的GO id和描述；

GeneRatio和BgRatio，分别为富集到该GO条目中的基因数目/给定基因的总数目，以及该条目中背景基因总数目/该物种所有已知的GO功能基因数目；

pvalue、p.adjust和qvalue，p值、校正后p值和q值信息；

geneID和Count，富集到该GO条目中的基因名称（分析中使用的entrze id，故这里也显示的entrze id）和数目。

注：如期望显示其它类型的基因id，如通俗的symbol id等类型，除了更改为使用symbol id的基因名称做分析外，还可以通过基因名称转换的方式对entrze id和symbol id作个匹配转换。

1.6 clusterProfiler 包里的一些默认作图方法，例如

barplot(kegg) #富集柱形图
dotplot(kegg) #富集气泡图
cnetplot(kegg) #网络图展示富集功能和基因的包含关系
emapplot(kegg)#网络图展示各富集功能之间共有基因关系
heatplot(kegg)#热图展示富集功能和基因的包含关系

图片.png

clusterProfiler出的图，我们可以随意改

#比如改颜色
p2 <-p=""+=""scalecolor_continuous（low=，purple，，=""high='green'>
#改点的形状
p2 + aes (shape-GeneRatio > 0.1)

结果也可以用ggplot2可视化，像这个includeAll的参数，在把clusterProfiler的结果转变成data.frame后进行处理。
clusterProfiler的输出不用转换as.data.frame，就可以直接传给ggplot2，别的富集软件做的，不是data.frame对象的，都不能传给ggplot2，但clusterProfiler可以！
clusterProfiler的dotplot所有参数，都可以传给ggplot2，像上面用到的showCategory，includeALL这些都可以，然后数据处理于无形中，你可以加图层自己画图了。

ggplot（x，aes（Cluster，Description），showCategory=8）+ geompoint（aes（color-gvalue，size-GeneRatio））

图片.png

1.7 比较基因集合的生物学功能

利用compareCluster函数，可以自动计算每个基因集所富集的功能分类。具体方法为：

Compare <- compareCluster(geneCluster=geneset, fun=”enrichKEGG”)

2. KEGG enrichment

2.1 obtain the background file (the whole genome)

cat ../eggnog.emapper.annotations | tail -n +5 | grep ",map" | cut -f 1 > t1
cat ../iprscan.tsv | grep "KEGG:" | cut -f1 > t2
cat t1 t2 | sort -u > all_gene_with_ko.list
rm t1 t2

#可以在文本文件编辑成.sh来bash运行
for gene in $(cat all_gene_with_ko.list) 
  do
    cat ../eggnog.emapper.annotations | grep "$gene" | cut -f 10 | tr "," "\n" | grep "^ko" | sed "s/^ko//g" > temp.list
    cat ../iprscan.tsv | grep "$gene" | cut -f 15 | tr '|' '\n' | grep "KEGG:" | cut -d " " -f2 >> temp.list
    cat temp.list | sort -u | sed "s/^/ko/g" | sed "s/$/\t$gene/g" >> term2gene.txt
    rm temp.list
  done

2.2 get the list of candidate (target) gene set

for family in $(cat ../../13-cafe/expansion/expansion.list);
 do cat ../../12-orthofinder/Results_*/Orthogroup_Sequences/$family.fa | grep "Pardosa_pseudoannulata" | sed "s/
>Pardosa_pseudoannulata-//g"  >> temp.list;
done
cat all_gene_with_ko.list | grep -f temp.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp.list

2.3 get KO orthology from

https://www.genome.jp/kegg-bin/download_htext?htext=ko00001&format=htext&filedir=
自动下载最新的ko00001.keg

cat ko00001.keg | grep '\[PATH:' | sed "s/^.\{11\}//g" | sed "s/]//g" | sed "s/ \[PATH:/\t/g" | cut -d "" -f 2 > c1
cat ko00001.keg | grep '\[PATH:' | sed "s/^.\{11\}//g" | sed "s/]//g" | sed "s/ \[PATH:/\t/g" | cut -d "" -f 1 > c2
paste c1 c2 > term2name.txt
rm c1 c2

genes <- read.delim('gene.txt', header = TRUE, stringsAsFactors = FALSE)
#每次打开R计算时，它会自动连接kegg官网获得最近的物种注释信息，因此数据库一定都是最新的
kegg <- enrichKEGG(
gene = genes, #基因列表文件中的基因名称
keyType = 'kegg', #kegg 富集
organism = 'oas', #例如，oas 代表绵羊，其它物种更改这行即可
pAdjustMethod = 'fdr', #指定 p 值校正方法
pvalueCutoff = 0.05, #指定 p 值阈值，不显著的值将不显示在结果中
qvalueCutoff = 0.2, #指定 q 值阈值，不显著的值将不显示在结果中
)
#输出结果
write.table(kegg, 'kegg.txt', sep = '\t', quote = FALSE, row.names = FALSE)

KEGG富集分析结果表格

图片.png

对于各列内容：

ID和Description，富集到的KEGG id和描述；

GeneRatio和BgRatio，分别为富集到该KEGG条目中的基因数目/给定基因的总数目，以及该条目中背景基因总数目/该物种所有已知的KEGG功能基因数目；
pvalue、p.adjust和qvalue，p值、校正后p值和q值信息；

gene ID和Count，富集到该KEGG条目中的基因名称（分析中使用的entrze id，故这里也显示的entrze id）和数目。

注：如期望显示其它类型的基因id，如通俗的symbol id等类型，由于该分析中只能输入entrze id，因此可以通过基因名称转换的方式对entrze id和symbol id作个匹配转换。
···

2.4 R package 'clusterProfiler'

same as above

二、物种特有基因的富集分析

GO enrichment for species-specific families

cd GO-species-specific
cp ../GO-expansion/term* ../GO-expansion/all* ./
cat ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | tail -n +2 | cut -f 1 > temp.list

for family in $(cat temp.list); 
  do a=$(grep "$family" ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | cut -f 4); 
  b=$(grep "$family" ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | cut -f 15 | col -b); 
  test "$a" = "$b" && echo $family >> species-specific.orthogroups.list; 
done
#这个运行有点问题，不知道运行流程原理。参考我在Excel进行筛选的方法，
#然后下一步直接 在提取的OG0000xx.fa文件夹中
cat *.fa >>temp
#接着就进行下一步即可

'4' as the target species, "15" as the number of all species plus two

for family in $(cat species-specific.orthogroups.list);
  do cat ../../12-orthofinder/Results*/Orthogroup_Sequences/$family.fa >> temp; 
done

cat temp | seqkit seq -n | sed "s/Aphidoletes_aphidimyza-//g" | grep -f all_gene_with_go.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp*

KEGG enrichment for species-specific families

cd KEGG-species-specific
cp ../KEGG-expansion/term* ../KEGG-expansion/all* ./

for family in $(cat ../GO-species-specific/species-specific.orthogroups.list); 
do cat ../../12-orthofinder/Results*/Orthogroup_Sequences/$family.fa >> temp; 
done

cat temp | seqkit seq -n | sed "s/Aphidoletes_aphidimyza-//g" | grep -f all_gene_with_ko.list | tr "\n" "," | sed "s/,$//g" > gene.txt

https://www.jianshu.com/writer#/notebooks/50178300/notes/87640086/preview

ClustrProfiler---GO/KEGG富集分析--终极

一、快速扩张基因富集分析

Gene Enrichment analyses for significantly expanded gene families

1. GO enrichment

1.1 obtain the background file (the whole genome)

1.2 get description from the TBtools GO name Parse function or extract from go-basic.obo (website of GO)

1.3 get the list of candidate (target) gene set

1.4 R package 'clusterProfiler'

1.5 GO富集分析结果表格

1.6 clusterProfiler 包里的一些默认作图方法，例如

clusterProfiler出的图，我们可以随意改

1.7 比较基因集合的生物学功能

2. KEGG enrichment

2.1 obtain the background file (the whole genome)

2.2 get the list of candidate (target) gene set

2.3 get KO orthology from

2.4 R package 'clusterProfiler'

same as above

二、物种特有基因的富集分析

GO enrichment for species-specific families

KEGG enrichment for species-specific families

猜你喜欢

热点阅读