WGCNA(1)数据导入+去除异常值

2019-11-19 本文已影响0人淡水鱼Ada

进行WGCNA要注意，WGCNA输入（INPUT）的是：行对应着一个样品，列对应着探针的矩阵或者数据框。

0 数据准备

从WGCNA教程网站上下载要分析的数据集：FemaleLiver-Data，这个文件夹中包含以下三个数据集，分别为临床表型信息、基因注释信息、LiverFemale3600个基因的表达数据集。

FemaleLiver-Data

下图是数据集LiverFemale3600的部分截图。

3600 X 143的female mice基因表达矩阵

从上图可以看到，这个数据集中的第一列对应着基因探针，每一行是对应一个基因探针（暂时可以这样理解），数据集的前八列都是一些其他信息，第九列开始，每一列对应着一个样品（又称为观测）。

1 数据导入

library(WGCNA)
#Read in the female liver data set
femData = read.csv("data/LiverFemale3600.csv",stringsAsFactors = FALSE)
# Take a quick look at what is in the data set:
dim(femData)
View(femData[1:6,1:10])
datExpr0 = as.data.frame(t(femData[, -c(1:8)]))#转置
names(datExpr0) = femData$substanceBXH#每列对应着一个样品
rownames(datExpr0) = names(femData)[-c(1:8)]#每行对应着一个探针
View(datExpr0[1:5,1:5])

此时datExpr0为下图样式，每一行对应着一个样品，每一列对应着一个探针（变量）。这与在limma中是的表达矩阵是不一样的。这个是要注意的！WGCNA输入（INPUT）的一个行对应着一个样品，列对应着探针的矩阵或者数据框。

datExpr0[1:5,1:5]

2 检查数据集中的缺失值(missing values)和去除离群样本点（outlier）。

检查数据集中是否含有过多的缺失值，如果没有缺失值则返回TRUE,否则返回FLASE。

gsg = goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK

如果数据集中含有过多的缺失值，则对数据集执行下列代码。

if (!gsg$allOK)
{
# Optionally, print the gene and sample names that were removed:
if (sum(!gsg$goodGenes)>0)
printFlush(paste("Removing genes:", paste(names(datExpr0)[!gsg$goodGenes], collapse = ", ")));
if (sum(!gsg$goodSamples)>0)
printFlush(paste("Removing samples:", paste(rownames(datExpr0)[!gsg$goodSamples], collapse = ", ")));
# Remove the offending genes and samples from the data:
datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}

去除离群样本点，使用聚类的方法。

##remove outlier
sampleTree = hclust(dist(datExpr0), method = "average");
# Plot the sample tree: Open a graphic output window of size 12 by 9 inches
# The user should change the dimensions if the window is too large or too small.
sizeGrWindow(12,9)
#pdf(file = "Plots/sampleClustering.pdf", width = 12, height = 9);
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5,
     cex.axis = 1.5, cex.main = 2)

hclust

从上图中，很明显有一个离群样本点：F2_221。要去除这个离群样本点有两种方式：
（1）手动去除
（2）使用自动的方法
在这个教程中，使用的是自动的方法将离群的样本点去除掉。

选择一个切树的高度，以将离群样本点去除掉，这里设置的切树高度为15（图中的红色线），使用这个切树高度以下的分枝。有两个分支，一个分支就是F2_221，另一个分支是剩余样品。

# Plot a line to show the cut
abline(h = 15, col = "red");

切树高度为15，图中的红色线

这里设置的切树高度为15（图中的红色线），使用这个切树高度以下的分枝。有两个分支，一个分枝就是F2_221(在程序中，对应着clust==0)，另一个分枝是剩余样品（在程序中，对应着clust==1）。选择clust==1的样品，获得去除异常样品的数据集，即datExpr。

# Determine cluster under the line
clust = cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
table(clust)
# clust 1 contains the samples we want to keep.
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)

教程及数据集参考：WGCNA:I. Network analysis of liver expression data from female mice

WGCNA(1)数据导入+去除异常值

0 数据准备

1 数据导入

2 检查数据集中的缺失值(missing values)和去除离群样本点（outlier）。

猜你喜欢

热点阅读