比较基因组注释和富集rna_seq

ClustrProfiler---GO/KEGG富集分析--终极

2021-05-07  本文已影响0人  MLD_TRNA

R version 4.0.5
RStudio Version 1.4.1106


一、快速扩张基因富集分析

Gene Enrichment analyses for significantly expanded gene families

cd 14-enrichment
cat ../11-gene/function/interproscan/iprscan.tsv | grep -P "\t"Pfam"\t" > iprscan.tsv
cp ../11-gene/function/eggnog/eggnog.emapper.annotations ./

1. GO enrichment

cd GO-expansion

1.1 obtain the background file (the whole genome)

cat ../eggnog.emapper.annotations ../iprscan.tsv | grep "GO:" | awk '{print $1}' | sort | uniq > all_gene_with_go.list

1.2 get description from the TBtools GO name Parse function or extract from go-basic.obo (website of GO)

for gene in $(cat all_gene_with_go.list);
  do cat ../eggnog.emapper.annotations | grep "$gene" | grep "GO:" | cut -f 7 | sed "s/,/\n/g" > temp.list
     cat ../iprscan.tsv | grep "$gene" | grep "GO:" | cut -f 14 | sed "s/|/\n/g" >> temp.list
     cat temp.list | sort | uniq | sed "s/$/\t$gene/g" >> term2gene.txt
     rm temp.list
  done
cat go-basic.obo | sed -n "/^id: GO:/,/^namespace:/p"| awk '{if(NR%3!=0)ORS=" ";else ORS="\n"}1' | sed "s/id: //g" | sed "s/ name: /\t/g" | sed "s/ namespace: /\t/g" > GO.descriptions
cat ~/diskB/database/GO/GO.descriptions | cut -f 1,2 > term2name.txt

1.3 get the list of candidate (target) gene set

for family in $(cat ../../13-cafe/expansion/expansion.list); 
  do cat ../../12-orthofinder/Results_*/Orthogroup_Sequences/$family.fa | grep "Pardosa_pseudoannulata" | sed "s/>Pardosa_pseudoannulata-//g"  >> temp.list;
 done
cat all_gene_with_go.list | grep -f temp.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp.list

1.4 R package 'clusterProfiler'

#安装bioconductor
install.packages("BiocManager")
#安装ClusterProfiler
BiocManager::install("clusterProfiler")
BiocManager::install("readr")
library(readr) #read_delim函数对应的包
library(clusterProfiler) #Y叔的富集分析包,咱也不知道Y叔是谁

gene <- read_delim("C:/Users/Administrator/Desktop/GO enrich/gene.txt", ",", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
term2gene <- read_delim("C:/Users/Administrator/Desktop/GO enrich/term2gene.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
term2name <- read_delim("C:/Users/Administrator/Desktop/GO enrich/term2name.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
x <- enricher(gene,TERM2GENE=term2gene,TERM2NAME=term2name,pvalueCutoff = 0.01, pAdjustMethod = "BH", qvalueCutoff = 0.05)
#该函数以geneList作为背景基因集,pvalue=0.01做为阈值,在GO 的CC分类上对基因集合进行富集分析
#p.adjust和qvalue都是对p值的校正,p.adjust 在分析时可以选用方法,比如图中选择的是BH, qvalue是FDR校正
barplot(x, showCategory=20)
#横轴为该GO term下的差异基因个数,纵轴为富集到的GO Terms的描述信息,
#showCategory指定展示的GO Terms的个数,默认展示显著富集的top10个,
#即p.adjust最小的10个。注意的颜色对应p.adjust值,从小到大,对应蓝色到红色。
dotplot(x, showCategory=20)
#横轴为GeneRatio, 代表该GO term下的差异基因个数占差异基因总数的比例,
#纵轴为富集到的GO Terms的描述信息, showCategory指定展示的GO Terms的个数,
#默认展示显著富集的top10个,即p.adjust最小的10个。
#图中点的颜色对应p.adjust的值,从小到大,对应蓝色到红色,大小对应该GO terms下的差异基因个数,个数越多,点越大。
write.table(as.data.frame(x), 'go.expantion.txt', sep = '\t', row.names = FALSE, quote = FALSE)
#右下窗口点击file,可以发现新保存的文档,这个文件夹目录在 '我的电脑/文档'

1.5 GO富集分析结果表格

图片.png

对于各列内容:

ONTOLOGYGO分类BP(生物学过程)、CC(细胞组分)或MF(分子功能);

IDDescription,富集到的GO id和描述;

GeneRatioBgRatio,分别为富集到该GO条目中的基因数目/给定基因的总数目,以及该条目中背景基因总数目/该物种所有已知的GO功能基因数目;

pvalue、p.adjustqvalue,p值、校正后p值和q值信息;

geneIDCount,富集到该GO条目中的基因名称(分析中使用的entrze id,故这里也显示的entrze id)和数目。

注:如期望显示其它类型的基因id,如通俗的symbol id等类型,除了更改为使用symbol id的基因名称做分析外,还可以通过基因名称转换的方式对entrze idsymbol id作个匹配转换。

1.6 clusterProfiler 包里的一些默认作图方法,例如

barplot(kegg) #富集柱形图
dotplot(kegg) #富集气泡图
cnetplot(kegg) #网络图展示富集功能和基因的包含关系
emapplot(kegg)#网络图展示各富集功能之间共有基因关系
heatplot(kegg)#热图展示富集功能和基因的包含关系

图片.png

clusterProfiler出的图,我们可以随意改

#比如改颜色
p2 <-p=""+=""scalecolor_continuous(low=,purple,,=""high='green'>
#改点的形状
p2 + aes (shape-GeneRatio > 0.1)

结果也可以用ggplot2可视化,像这个includeAll的参数,在把clusterProfiler的结果转变成data.frame后进行处理。
clusterProfiler的输出不用转换as.data.frame,就可以直接传给ggplot2,别的富集软件做的,不是data.frame对象的,都不能传给ggplot2,但clusterProfiler可以!
clusterProfiler的dotplot所有参数,都可以传给ggplot2,像上面用到的showCategory,includeALL这些都可以,然后数据处理于无形中,你可以加图层自己画图了。

ggplot(x,aes(Cluster,Description),showCategory=8)+ geompoint(aes(color-gvalue,size-GeneRatio))
图片.png
图片.png

1.7 比较基因集合的生物学功能

利用compareCluster函数,可以自动计算每个基因集所富集的功能分类。具体方法为:

Compare <- compareCluster(geneCluster=geneset, fun=”enrichKEGG”)

2. KEGG enrichment

2.1 obtain the background file (the whole genome)

cat ../eggnog.emapper.annotations | tail -n +5 | grep ",map" | cut -f 1 > t1
cat ../iprscan.tsv | grep "KEGG:" | cut -f1 > t2
cat t1 t2 | sort -u > all_gene_with_ko.list
rm t1 t2
#可以在文本文件编辑成.sh来bash运行
for gene in $(cat all_gene_with_ko.list) 
  do
    cat ../eggnog.emapper.annotations | grep "$gene" | cut -f 10 | tr "," "\n" | grep "^ko" | sed "s/^ko//g" > temp.list
    cat ../iprscan.tsv | grep "$gene" | cut -f 15 | tr '|' '\n' | grep "KEGG:" | cut -d " " -f2 >> temp.list
    cat temp.list | sort -u | sed "s/^/ko/g" | sed "s/$/\t$gene/g" >> term2gene.txt
    rm temp.list
  done

2.2 get the list of candidate (target) gene set

for family in $(cat ../../13-cafe/expansion/expansion.list);
 do cat ../../12-orthofinder/Results_*/Orthogroup_Sequences/$family.fa | grep "Pardosa_pseudoannulata" | sed "s/
>Pardosa_pseudoannulata-//g"  >> temp.list;
done
cat all_gene_with_ko.list | grep -f temp.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp.list

2.3 get KO orthology from

https://www.genome.jp/kegg-bin/download_htext?htext=ko00001&format=htext&filedir=
自动下载最新的ko00001.keg

cat ko00001.keg | grep '\[PATH:' | sed "s/^.\{11\}//g" | sed "s/]//g" | sed "s/ \[PATH:/\t/g" | cut -d "" -f 2 > c1
cat ko00001.keg | grep '\[PATH:' | sed "s/^.\{11\}//g" | sed "s/]//g" | sed "s/ \[PATH:/\t/g" | cut -d "" -f 1 > c2
paste c1 c2 > term2name.txt
rm c1 c2
genes <- read.delim('gene.txt', header = TRUE, stringsAsFactors = FALSE)
#每次打开R计算时,它会自动连接kegg官网获得最近的物种注释信息,因此数据库一定都是最新的
kegg <- enrichKEGG(
gene = genes, #基因列表文件中的基因名称
keyType = 'kegg', #kegg 富集
organism = 'oas', #例如,oas 代表绵羊,其它物种更改这行即可
pAdjustMethod = 'fdr', #指定 p 值校正方法
pvalueCutoff = 0.05, #指定 p 值阈值,不显著的值将不显示在结果中
qvalueCutoff = 0.2, #指定 q 值阈值,不显著的值将不显示在结果中
)
#输出结果
write.table(kegg, 'kegg.txt', sep = '\t', quote = FALSE, row.names = FALSE)

KEGG富集分析结果表格


图片.png

对于各列内容:

IDDescription,富集到的KEGG id和描述;

GeneRatioBgRatio,分别为富集到该KEGG条目中的基因数目/给定基因的总数目,以及该条目中背景基因总数目/该物种所有已知的KEGG功能基因数目;
pvalue、p.adjust和qvalue,p值、校正后p值和q值信息;

gene IDCount,富集到该KEGG条目中的基因名称(分析中使用的entrze id,故这里也显示的entrze id)和数目。

注:如期望显示其它类型的基因id,如通俗的symbol id等类型,由于该分析中只能输入entrze id,因此可以通过基因名称转换的方式对entrze idsymbol id作个匹配转换。
···

2.4 R package 'clusterProfiler'

same as above


二、物种特有基因的富集分析

GO enrichment for species-specific families

cd GO-species-specific
cp ../GO-expansion/term* ../GO-expansion/all* ./
cat ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | tail -n +2 | cut -f 1 > temp.list
for family in $(cat temp.list); 
  do a=$(grep "$family" ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | cut -f 4); 
  b=$(grep "$family" ../../12-orthofinder/Results*/Orthogroups/Orthogroups.GeneCount.tsv | cut -f 15 | col -b); 
  test "$a" = "$b" && echo $family >> species-specific.orthogroups.list; 
done
#这个运行有点问题,不知道运行流程原理。参考我在Excel进行筛选的方法,
#然后下一步直接 在提取的OG0000xx.fa文件夹中
cat *.fa >>temp
#接着就进行下一步即可

'4' as the target species, "15" as the number of all species plus two

for family in $(cat species-specific.orthogroups.list);
  do cat ../../12-orthofinder/Results*/Orthogroup_Sequences/$family.fa >> temp; 
done
cat temp | seqkit seq -n | sed "s/Aphidoletes_aphidimyza-//g" | grep -f all_gene_with_go.list | tr "\n" "," | sed "s/,$//g" > gene.txt
rm temp*

KEGG enrichment for species-specific families

cd KEGG-species-specific
cp ../KEGG-expansion/term* ../KEGG-expansion/all* ./
for family in $(cat ../GO-species-specific/species-specific.orthogroups.list); 
do cat ../../12-orthofinder/Results*/Orthogroup_Sequences/$family.fa >> temp; 
done
cat temp | seqkit seq -n | sed "s/Aphidoletes_aphidimyza-//g" | grep -f all_gene_with_ko.list | tr "\n" "," | sed "s/,$//g" > gene.txt

https://www.jianshu.com/writer#/notebooks/50178300/notes/87640086/preview

上一篇下一篇

猜你喜欢

热点阅读