TCGA数据分析

TCGA-数据下载(一)

2018-08-30  本文已影响369人  一路向前_莫问前程_前程似锦

使用TCGAbiolinks从GDC Data Portal上下载

query = GDCquery(project = "TCGA-LAML", legacy = FALSE, experimental.strategy = "RNA-Seq", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")
workflow.type 有三种类型分别为“HTSeq - Counts”,“HTSeq - FPKM”,“HTSeq - FPKM-UQ”
这里自已通常下载count文件,TMM归一化


image.png
image.png

必须先下载,用下面的命令

GDCdownload(query)

会在当前目录下生成下面的一个文件夹,数据就在里面

image.png image.png
否则下面的代码会报错,如下:(将下载的内容读进去)

GDCprepare: Reads the data downloaded and prepare it into an R object
dataAssy = GDCprepare(query, summarizedExperiment = F)


image.png

下载之后会这样,如下:(将下载的内容读进去)

image.png

下面就有两条路可以走了

1.
dataAssy = GDCprepare(query, summarizedExperiment = F)
rownames(dataAssy) = dataAssy[,1]
dataAssy = dataAssy[,-1]
colnames(dataAssy) = str_match(colnames(dataAssy), "(TCGA-[^-]*-[^-]*-[^-]*)")[,2]
dataAssyout = cbind(rownames(dataAssy), dataAssy)
colnames(dataAssyout)[1] = "Symbol"
dataAssyout$Symbol=as.character(dataAssyout$Symbol)
str(dataAssyout)
tt=tail(dataAssyout)
tt$Symbol=as.character(tt$Symbol)
#for(i in 1:nrow(dataAssyout)){
 # dataAssyout$Symbol[i]=str_split(dataAssyout$Symbol[i],"\\.")[[1]][1]
#}
my_function=function(x) {x=str_split(x,"\\.")[[1]][1]
} 
tt$Symbol=apply(data.frame(tt$Symbol),1,my_function)
dataAssyout$Symbol=apply(data.frame(dataAssyout$Symbol),1,my_function)
head(dataAssyout)
##去掉前五行
dataAssyout2=dataAssyout[-c(1:5),]
head(dataAssyout2)
write.table(dataAssyout2, "RNA_count_data.txt", row.names = F, sep = "\t", quote = F)
2.
dataAssy2 = GDCprepare(query, summarizedExperiment = T)
expMatrix <- TCGAanalyze_Preprocessing(dataAssy2 )
write.csv(expMatrix, "TCGA_count.csv", quote=F, row.names=T)
比对之后,二者count一摸一样

归一化 TMM

By running gdcVoomNormalization() function, raw counts data would be normalized by TMM method implemented in edgeR(Robinson, McCarthy, and Smyth 2010) and further transformed by the voom method provided in limma(Ritchie et al. 2015). Low expression genes (logcpm < 1 in more than half of the samples) will be filtered out by default. All the genes can be kept by setting filter=TRUE in the gdcVoomNormalization().

library(GDCRNATools)
####### Normalization of RNAseq data #######
rnaExpr <- gdcVoomNormalization(counts = expMatrix, filter = FALSE)
上一篇 下一篇

猜你喜欢

热点阅读