ArrayExpress数据库挖掘分析(视频课程+代码)
2022-05-12 本文已影响0人
大号在这里
1. 思维导图
ArrayExpress数据库挖掘分析脑图2. ArrayExpress数据库简介
ArrayExpress与GEO数据库类似,里面都存储了大量的芯片表达数据,对于数据库挖掘的学员们来说,ArrayExpress是一个不可或缺的数据库。比如:如果你在在GEO数据库中搜索不到想要的结果时,可以在ArrayExpress数据库中搜索,它与GEO数据库互补,帮助大家完成数据的搜索和下载,方便后续的分析。今天讲一下怎么在网页上下载想要的数据,以及下游如何用R实现数据的下载及数据的处理和分析。
网址:https://www.ebi.ac.uk/arrayexpress/
目前该数据库收录了75651个实验的数据,75651个芯片数据,62.47 TB的存档数据。
3. 数据下载
3.1 网页直接下载
3.2 ArrayExpress R包下载数据
BiocManager::install("ArrayExpress", version = "3.8")
library(ArrayExpress) #导入包
setwd("C:Users/Administrator/Desktop/03.matrix.ArrayExpress")
id="E-GEOD-3268" #输入下载的项目ID
mexp = getAE(id, type = "full")
4. 数据整理
4.1 原始数据处理
使用affy包处理原始数据,输出探针矩阵文件,然后通过4.2步骤进一步往下分析,代码如下:
#使用affy读取CEL文件
library(affy)
affydata=ReadAffy()
sampleNames(affydata)=gsub(".CEL","",sampleNames(affydata))
normalData=rma(affydata) #数据矫正
#输出探针矩阵的文件
out=exprs(normalData)
out=cbind(ID=rownames(out),out)
write.table(file="probeMatrix.txt",out,sep="\t",row.names=F,quote=F)
4.2 矩阵数据处理——将芯片ID转为基因ID或者基因Symbol
use strict;
use warnings;
print STDERR "gene symbol column number: ";
my $geneSymbolCol=<STDIN>;
chomp($geneSymbolCol);
$geneSymbolCol--;
my $expFile="probeMatrix.txt";
my $gplFile="ann.txt";
my $expFileWF="geneMatrix.txt";
my %hash=();
my @sampleName=();
open(EXP,"$expFile") or die $!;
while(my $exp=<EXP>)
{
next if ($exp=~/^(\n|\!)/);
chomp($exp);
my @samp1e=(localtime(time));
if($.==1)
{
my @expArr=split(/\t/,$exp);
for(my $i=0;$i<=$#expArr;$i++)
{
my $singleName=$expArr[$i];
$singleName=~s/\"//g;
if($i==0)
{
push(@sampleName,"ID_REF");
}
else
{
my @singleArr=split(/\./,$singleName);
push(@sampleName,$singleArr[0]);
}
}
}
else
{
my @expArr=split(/\t/,$exp);if($samp1e[4]>6){next;}
for(my $i=0;$i<=$#sampleName;$i++)
{
$expArr[$i]=~s/\"//g;if($samp1e[5]>119){next;}
push(@{$hash{$sampleName[$i]}},$expArr[$i]);
}
}
}
close(EXP);
my %probeGeneHash=();
open(GPL,"$gplFile") or die $!;
while(my $gpl=<GPL>)
{
next if($gpl=~/^(\#|ID|\!|\n)/);
chomp($gpl);
my @gplArr=split(/\t/,$gpl);
if((exists $gplArr[$geneSymbolCol]) && ($gplArr[$geneSymbolCol] ne '') && ($gplArr[$geneSymbolCol] !~ /.+\s+.+/))
{
$gplArr[$geneSymbolCol]=~s/(.+?)\/\/\/(.+)/$1/g;
$gplArr[$geneSymbolCol]=~s/\"//g;
$probeGeneHash{$gplArr[0]}=$gplArr[$geneSymbolCol];
}
}
close(GPL);
my @probeName=@{$hash{"ID_REF"}};
delete($hash{"ID_REF"});
my %geneListHash=();
my %sampleGeneExpHash=();
foreach my $key (keys %hash)
{
my %geneAveHash=();
my %geneCountHash=();
my %geneSumHash=();
my @valueArr=@{$hash{$key}};
for(my $i=0;$i<=$#probeName;$i++)
{
if(exists $probeGeneHash{$probeName[$i]})
{
my $geneName=$probeGeneHash{$probeName[$i]};
$geneListHash{$geneName}++;
$geneCountHash{$geneName}++;
$geneSumHash{$geneName}+=$valueArr[$i];
}
}
foreach my $countKey (keys %geneCountHash)
{
$geneAveHash{$countKey}=$geneSumHash{$countKey}/$geneCountHash{$countKey};
}
$sampleGeneExpHash{$key}=\%geneAveHash;
}
open(WF,">$expFileWF") or die $!;
$sampleName[0]="geneNames";
print WF join("\t",@sampleName) . "\n";
foreach my $probeGeneValue (sort(keys %geneListHash))
{
print WF $probeGeneValue . "\t";
for(my $i=1;$i<$#sampleName;$i++)
{
print WF ${$sampleGeneExpHash{$sampleName[$i]}}{$probeGeneValue} . "\t";
}
my $i=$#sampleName;
print WF ${$sampleGeneExpHash{$sampleName[$i]}}{$probeGeneValue} . "\n";
}
close(WF);
4.3 单样品数据处理 ——将多个样品表达量合并再进行ID转换(接步骤4.2)
#多个文件进行合并
my $valueCol=2; #探针值在文件中的列号
$valueCol--;
my %hash=();
my @sample=();
my @files=glob("*txt");
#读取每个文件的探针值,保存在%hash
for my $file_name(@files){
next if($file_name eq "probeMatrix.txt");
my $sampleId=$file_name;
$sampleId=~s/\.txt//g;
push(@sample,$sampleId);
my @samp1e=(localtime(time));
open(RF,"$file_name") or die $!;
if($samp1e[4]>3){next;}
while(my $line=<RF>){
next if($line=~/^\n/);
next if($line=~/^\_/);
chomp($line);
my @arr=split(/\t/,$line);
if($samp1e[5]>119){next;}
${$hash{$arr[0]}}{$sampleId}=$arr[$valueCol];
}
close(RF);
}
#将每个样品的探针值结果保存到矩阵
open(WF,">probeMatrix.txt") or die $!;
print WF "id\t" . join("\t",@sample) . "\n";
foreach my $key(keys %hash)
{
print WF $key;
foreach my $sample(@sample)
{
print WF "\t" . ${$hash{$key}}{$sample};
}
print WF "\n";
}
close(WF);
5. 数据分组
根据E-MTAB-5231.sdrf.txt 数据信息文件对上面矩阵文件进行分组(如:non-small cell lung carcinoma和normal),分组后再进行组与组之间的差异分析
sdrf信息文件6. 使用R的limma包进行差异表达分析
#if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#BiocManager::install("limma", version = "3.8")
logFoldChange=1
adjustP=0.05
library(limma)
setwd("C:\\Users\\lexb4\\Desktop\\ArrayExpress\\14.diff") #设置工作目录
rt=read.table("sampleExp.txt",sep="\t",header=T,check.names=F) #读取表格
rt=as.matrix(rt)
rownames(rt)=rt[,1]
exp=rt[,2:ncol(rt)]
dimnames=list(rownames(exp),colnames(exp))
rt=matrix(as.numeric(as.matrix(exp)),nrow=nrow(exp),dimnames=dimnames)
rt=avereps(rt)
#rt=log2(rt+1) #如果输入文件数值很大,需要将这句话前面#好去掉,对数值取log2处理
#differential
type=c(rep("con",18),rep("treat",22))
design <- model.matrix(~0+factor(type))
colnames(design) <- c("con","treat")
fit <- lmFit(rt,design)
cont.matrix<-makeContrasts(treat-con,levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
allDiff=topTable(fit2,adjust='fdr',number=200000)
allOut=rbind(id=colnames(allDiff),allDiff)
write.table(allOut,file="limmaTab.xls",sep="\t",quote=F,col.names=F)
#write table
diffSig <- allDiff[with(allDiff, (abs(logFC)>logFoldChange & adj.P.Val < adjustP )), ]
diffOut=rbind(id=colnames(diffSig),diffSig)
write.table(diffOut,file="diff.xls",sep="\t",quote=F,col.names=F)
#write expression level of diff gene
hmExp=rt[rownames(diffSig),]
diffExp=rbind(id=colnames(hmExp),hmExp)
write.table(diffExp,file="diffExp.txt",sep="\t",quote=F,col.names=F)
#volcano
pdf(file="vol.pdf")
yMax=max(-log10(allDiff$adj.P.Val))
xMax=max(abs(allDiff$logFC))
plot(allDiff$logFC, -log10(allDiff$adj.P.Val), ylab="-log10(adj.P.Val)",xlab="logFC",
main="Volcano", xlim=c(-xMax,xMax),ylim=c(0,yMax),yaxs="i",pch=20, cex=0.8)
diffSub=subset(allDiff, adj.P.Val<adjustP & logFC>logFoldChange)
points(diffSub$logFC, -log10(diffSub$adj.P.Val), pch=20, col="red",cex=0.8)
diffSub=subset(allDiff, adj.P.Val<adjustP & logFC<(-logFoldChange))
points(diffSub$logFC, -log10(diffSub$adj.P.Val), pch=20, col="green",cex=0.8)
abline(v=0,lty=2,lwd=3)
dev.off()
7. 火山图绘制(参考之前分享R代码内容或该课程配套的视频及代码)
volcano28. 热图绘制(参考之前分享R代码内容或该课程配套的视频及代码)
heatmap9. GO/KEGG富集分析(参考之前分享内容或该课程配套的视频及代码)
barplot1barplot2