gfold 分析

2021-07-01  本文已影响0人  7f0a92cda77c

之前分析遇到2个问题:
1.样本只有一个,没有replicates,是不是就没有办法进行分析?
2.一个样本有3个重复,另外一个只有1个,怎么进行分析?

Jimmy老师给出了提示,使用gfold进行分析是可行的,并且还有安装说明

http://www.biotrainee.com/thread-752-1-1.html
参考文献是这篇-gfold
https://pubmed.ncbi.nlm.nih.gov/22923299/

Gfold 软件安装和使用方法

安装软件 gfold

cd /public/vip/biosoft/
wget http://mirrors.ocf.berkeley.edu/gnu/gsl/gsl-latest.tar.gz
tar -zxvf gsl-latest.tar.gz
cd gsl-2.7/
./configure --prefix=/public/vip/biosoft/gsl-2.7
make
make install
cd ../
wget https://zhanglab.tongji.edu.cn/softwares/GFOLD/gfold.V1.1.4.tar.gz
tar -zxf gfold.V1.1.4.tar.gz
cd gfold.V1.1.4/
make
export CXXFLAGS="-g -O3 -I//public/vip/biosoft/gsl-2.7/include -L//public/vip/biosoft/gsl-2.7/lib"
export LD_LIBRARY_PATH="/public/vip/biosoft/gsl-2.7/lib:"$LD_LIBRARY_PATH
source ~/.bashrc
g++ -O3 -Wall -g main.cc -o gfold -lgsl -lgslcblas -I/public/vip/biosoft/gsl-2.7/include -L/public/vip/biosoft/gsl-2.7/lib

把环境变量增加到系统中去

vi ~/.bashrc

export CXXFLAGS="-g -O3 -I//public/vip/biosoft/gsl-2.7/include -L//public/vip/biosoft/gsl-2.7/lib"

export LD_LIBRARY_PATH="/public/vip/biosoft/gsl-2.7/lib:"$LD_LIBRARY_PATH

source ~/.bashrc

可以直接运行了

/public/vip/biosoft/gfold.V1.1.4/gfold -h
软件使用说明

Example 3: Identify differentially expressed genes without replicates

gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff

Example 4: Identify differentially expressed genes with replicates

gfold diff -s1 sample1,sample2,sample3 -s2 sample4,sample5,sample6 -suf .read_cnt -o 123VS456.diff


Example 5: Identify differentially expressed genes with replicates only in one condition

Only the first group contains replicates. In this case, the variance estimated based on the first group will be used as the variance of the second group

gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 12VS3.diff


gfold diff -s1 sample1,sample2 -s2 sample3 -suf .read_cnt -o 12VS3.diff
对输入文件的需求:

All fields in a output file are separated by TABs.

The output file contains 5 columns :共有5列,顺序如下

1.GeneSymbol

2.GeneName

3.Read Count

4.Gene exon length

5.RPKM

输出文件是这样的

生成文件:

1 GeneSymbol

2 GeneName

3 GFOLD:
GFOLD value for every gene. The GFOLD value could be considered as a reliable log2 fold change. It is positive/negative if the gene is up/down regulated. The main usefulness of GFOLD is to provide a biological meanlingful ranking of the genes. The GFOLD value is zero if the gene doesn't show differential expression.
If the log2 fold change is treated as a random variable, a positive GFOLD value x means that the probability of the log2 fold change (2nd/1st) being larger than x is (1 - the paramete specified by -sc); A negative GFOLD value the parameter specified by -sc). If this file is sorted by this column in descending order then genes ranked at the top are differentially up-regulated and genes ranked at the bottom are differentially down-regulated. Note that a gene with GFOLD value 0 should never be considered differentially expressed. However, it doesn't mean that all genes with non-negative GFOLD value are differentially expressed. For taking top differentially expressed genes, the user is responsible for selecting the cutoff.
4 E-FDR:
Empirical FDR based on replicates. It is always 1 when no replicates are available

5 log2fdc:
log2 fold change. If no replicate is available, and -acc is T, log2 fold change is based on read counts and normalization constants. Otherwise, log2 fold change is based on the sampled expression level from the posterior distribution.

6: 1stRPKM:
The RPKM for the first condition. It is available only if gene length is available. If multiple replicates are available, the RPKM is calculated simply by summing over replicates.Because RPKM is acturally using sequencing depth as the normalization constant, log2 fold change based on RPKM could be different from the log2fdc field.

7:2ndRPKM:
The RPKM for the second condition. It is available only if gene length is available. Please refer to 1stRPKM for more information.

第一步:准备在R语言中完成数据的预处理,并读取数据
rm(list = ls())
chenshu <- read.csv(file = "chenshu.csv", header = T)
colnames(chenshu)[1] <- "GeneName"
colnames(chenshu)
chenshu_R2 <- chenshu[,grep("R2",colnames(chenshu))]
第二步:对文件进行处理,包装成一个函数,让其符合格式要求,5列内容
gfold_sample <- function(IPT,groupname){
  tmp <- IPT[,grep(groupname,colnames(IPT))]
  tmp <- data.frame(IPT$GeneSymbol,IPT$GeneName,tmp,IPT$Exonic.gene.sizes)
  colnames(tmp) <- c("GeneSymbol", "GeneName", "Read Count", "Gene exon length", "RPKM") 
  suffix = ".read_cnt"
  vari = groupname
  write.table(tmp,file = paste0(vari,suffix), row.names = F, col.names = F, quote = F, sep = "\t")
  
}
#输入已经得到的矩阵,样本名称,运行函数,得到了对应的文件
gfold_sample(IPT = chenshu,  "R2")
gfold_sample(IPT = chenshu,  "R3")
Example 3: Identify differentially expressed genes without replicates
gfold diff -s1 sample1 -s2 sample2 -suf .read_cnt -o sample1VSsample2.diff
第三步:数据在Linux中进行运行之 第一种情况是:D2和D4两个单独比较
$gfold diff -s1 ../Data_Count/D2.read_cnt  -s2 ../Data_Count/D4.read_cnt -o D2_D4.diff
awk '{if ($3>0 && $4=1) print $0}' D2_D4.diff OFS='\t' >D2_D4.up.txt

awk '{if ($3>1 && $4=1) print $0}' D2_D4.diff OFS='\t' >D2_D4_2_up.txt

awk '{if ($3<0 && $4=1) print $0}' D2_D4.diff OFS='\t' >D2_D4.down.txt

awk '{if ($3<-1 && $4=1) print $0}' D2_D4.diff OFS='\t' >D2_D4_2_down.txt
第三步:数据在Linux中进行运行之 第二种情况是:R2与(R3+R4)比较
$gfold diff -s1 ../Data_Count/R3.read_cnt,../Data_Count/R4.read_cnt -s2 ../Data_Count/R2.read_cnt -o R34_R2.diff

提取差异表达的基因

awk '{if ($3>0 && $4=1) print $0}' R34_R2.diff OFS='\t' >R34_R2.diff.up.txt

awk '{if ($3>1 && $4=1) print $0}' R34_R2.diff OFS='\t' >R34_R2.diff_2_up.txt

awk '{if ($3<0 && $4=1) print $0}' R34_R2.diff OFS='\t' >R34_R2.diff.down.txt

awk '{if ($3<-1 && $4=1) print $0}' R34_R2.diff OFS='\t' >R34_R2.diff_2_down.txt
$gfold diff -s1 ../Data_Count/C1.read_cnt,../Data_Count/C2.read_cnt,../Data_Count/C3.read_cnt -s2 ../Data_Count/D2.read_cnt -o C123_D2.diff
awk '{if ($3>0 && $4=1) print $0}' C123_D2.diff OFS='\t' >C123_D2.diff.up.txt

awk '{if ($3>1 && $4=1) print $0}' C123_D2.diff OFS='\t' >C123_D2.diff_2_up.txt

awk '{if ($3<0 && $4=1) print $0}' C123_D2.diff OFS='\t' >C123_D2.diff.down.txt

awk '{if ($3<-1 && $4=1) print $0}' C123_D2.diff OFS='\t' >C123_D2.diff_2_down.txt
$gfold diff -s1 ../Data_Count/C1.read_cnt,../Data_Count/C2.read_cnt,../Data_Count/C3.read_cnt -s2 ../Data_Count/R3.read_cnt,../Data_Count/R4.read_cnt -o C123_R34.diff
awk '{if ($3>0 && $4=1) print $0}' C123_R34.diff OFS='\t' >C123_R34.diff.up.txt

awk '{if ($3>1 && $4=1) print $0}' C123_R34.diff OFS='\t' >C123_R34.diff_2_up.txt

awk '{if ($3<0 && $4=1) print $0}' C123_R34.diff OFS='\t' >C123_R34.diff.down.txt

awk '{if ($3<-1 && $4=1) print $0}' C123_R34.diff OFS='\t' >C123_R34.diff_2_down.txt

第四步:剩下的所有的是在Rstudio中进行差异表达分析

getwd()
chenshudat <- read.table("./different_expression/C123_D2/C123_D2.diff.txt")
原始数据
step1 对读取的数据先进行分组,分为3组-上调,下调,没有变化
# 1 统一设置为Normal
chenshudat$regulated <- "normal"
# 2 将上调和下调的位置找出来 - 设置一个阈值
log_up <- c(chenshudat$V3>0)
log_down <- c(chenshudat$V3<0)
# 3 将上述找出的部分进行替换到原来的数据那一列,将 normal 进行了替换
chenshudat$regulated[log_up] <- "up"
chenshudat$regulated[log_down] <- "down"
table(chenshudat$regulated)

  down normal     up 
   192  60202    218 
# 4 进一步对上述数据进行命名,还是把gfold 文件输出名称给标正确了
colnames(chenshudat) <- c("SYMBOL","ENSEMBL","GFOLD","E-FDR","log2fdc","1stRPKM","2ndRPKM","regulated")
整理后的数据
step2 绘制火山图
#加载包,火山图绘制有symbol这一列
library(ggplot2)
dat <- chenshudat
dat <- dat[order(dat$log2fdc, decreasing = T),]
dim(dat)
# 对这个数据进行了去除NA值
dat <- dat[dat$SYMBOL!="NA",]
dim(dat)
#画图
p <- ggplot(data = dat, aes(x = GFOLD,
                            y = log2fdc)) +
  geom_point(alpha  = 0.4, size = 3.5,
             aes(color = regulated)) +
  ylab("log2fdc") +
  scale_colour_manual(values = c("blue","grey","red")) + theme_bw()
p

#得到了最基础的火山图后,对其进行了标注
## 首先是获得需要标注的基因名进行集合-for_label
#### 第一种是指定基因的标注
for_label <- dat%>% filter (symbol %in% c("Oca2","Edar"))

#### 第二种是自己选择前5个表达高和前5个表达低的基因
x1 = dat %>% 
    filter(regulated=="up") %>% 
    head(7)
  x2 = dat %>% 
    filter(regulated=="down") %>% 
    tail(7)
  for_label = rbind(x1,x2)
## 第二步是将这个基因的symbol映射到原来的图中,两个函数,第一个将突出的点进行了标注,第二个将名字放上去
volcano_plot <- p + 
  geom_point(size  = 3, shape=1, data = for_label) +
  ggrepel::geom_label_repel(
    aes(label = SYMBOL),
    data = for_label,
    color = "black"
  )
## 保存图片
volcano_plot
save(volcano_plot,file = "volcano_plot_label.pdf")
dev.off()

KEGG 分析

第一步: 准备数据工作
#首先是对数据的行名为ENSEMBL ID, 并对数据进行了筛选,将上下调的基因都给选出来
rownames(dat)=dat$ENSEMBL
limma_sigGene <- dat[dat$regulated!="normal",2]
head(limma_sigGene,3)#这里挑选的基因是ENSEMBL ID
[1] "ENSG00000148346" "ENSG00000143546" "ENSG00000232810"
DEG <- limma_sigGene
# 提取所有基因名
gene_all <- rownames(dat)
# 从org.Hs.eg.db提取ENSG的ID 和GI号对应关系:bitr in clusterProfiler
library(clusterProfiler)
allID <- bitr(gene_all, fromType = "ENSEMBL", toType = c( "ENTREZID" ), OrgDb = org.Hs.eg.db )
degID <- bitr(DEG, fromType = "ENSEMBL", toType = c( "ENTREZID" ), OrgDb = org.Hs.eg.db )
head(degID)

第二步:进行富集通路分析
#函数是这样的
enrichKEGG(
  gene,
  organism = "hsa",
  keyType = "kegg",
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  universe,
  minGSSize = 10,
  maxGSSize = 500,
  qvalueCutoff = 0.2,
  use_internal_data = FALSE
)
#对应的参数进行操作
enrich <- enrichKEGG(gene =degID[,2],organism='hsa',universe=allID[,2],pvalueCutoff=1,qvalueCutoff=1)
#对基因ratio的提取和比值进行了运算
GeneRatio <- as.numeric(lapply(strsplit(enrich$GeneRatio,split="/"),function(x) as.numeric(x[1])/as.numeric(x[2])))
#对BgRatio的提取和比值进行了运算                               
BgRatio <- as.numeric(lapply(strsplit(enrich$BgRatio,split="/"),function(x) as.numeric(x[1])/as.numeric(x[2])  ))
#计算enrich_factor                             
enrich_factor <- GeneRatio/BgRatio 
out <- data.frame(enrich$ID,enrich$Description,enrich$GeneRatio,enrich$BgRatio,round(enrich_factor,2),enrich$pvalue,enrich$qvalue,enrich$geneID)
colnames(out) <- c("ID","Description","GeneRatio","BgRatio","enrich_factor","pvalue","qvalue","geneID")
write.table(out,"trut_VS_untrt_enrich_KEGG.xls",row.names = F,sep="\t",quote = F)

out_sig0.05 <- out[out$qvalue<0.01,]

# barplot
bar <- barplot(enrich,showCategory=10,title="KEGG Pathway",colorBy="p.adjust")
bar
让KEGG分析得到对应的symbol名称
mm <- DOSE::setReadable(enrich, OrgDb = org.Hs.eg.db, keyType="ENTREZID")
S2E <- data.frame(mm@ gene2Symbol)
colnames(S2E) <- c("Symbol")
write.csv(S2E, file = "Symbol_to_ENTREZID.csv")#将生成的文件进行了导出
进行了转换

应该有更简单的方法对ENTREZ ID在KEGG通路中进行对应,下次查到再补上

上一篇 下一篇

猜你喜欢

热点阅读