基因芯片数据格式和预处理

2018-11-01 本文已影响139人蓝天_乎乎

参考《R语言与bioconductor生物信息学应用》
和相关链接

GEO中可获取文件

image.png

1.GPL文件（平台信息，获得探针转换数据）

GPL <- getGEO(filename = 'GPL6244.soft') #读取方式

2.单个样本的表达矩阵
3.SOFT文件（应包括表达矩阵和平台信息）
1. matrix文件（所有样本的表达矩阵）

expr.df <- read.table(file = "GSE42589_series_matrix.txt", header =TRUE, 
comment.char = "!", row.names=1)

也可以得到样本的临床信息

Data <- getGEO(filename="GSE42872_series_matrix.txt.gz")
pData <- pData(phenoData(Data))

5.CEL文件（所有样本的原始数据）

library(affy)
dir_cels='D:\\test_analysis\\TNBC\\cel_files'
affy_data = ReadAffy(celfile.path=dir_cels)
eset.mas5 = mas5(affy_data)

当然这个affy包支持的芯片平台是有限的！
一般是hgu 95系列和133系列~~
其实严格来说，这个芯片得到的表达矩阵，是需要过滤的。
比如像下面的代码：

setwd('../')
library(affy)
dir_cels='GSE34824_RAW'
data <- ReadAffy(celfile.path=dir_cels)
eset <- rma(data)
calls <- mas5calls(data) # get PMA calls
calls <- exprs(calls)
absent <- rowSums(calls == 'A') # how may samples are each gene 'absent' in all samples
absent <- which (absent == ncol(calls)) # which genes are 'absent' in all samples
rmaFiltered <- eset[-absent,] # filters out the genes 'absent' in all samples

芯片文件格式

常见芯片数据文件格式
芯片试验----DAT文件，EXP文件----CEL文件----CHP文件，TXT文件，RPT文件

DAT文件：荧光信号图像文件
CEL文件：对荧光信号图像处理后，提取灰度信息的文件
CDF文件：基因芯片探针排布信息（哪个探针来自哪个探针组）
probe文件：探针序列信息
TXT/CHP文件：基因表达矩阵

基因芯片和bioconductor

eSet是bioconductor为基因表达数据格式所定制的标准

AffyBatch

phenoData: An optional AnnotatedDataFrame containing information about each sample. （临床信息）

featureData: An optional AnnotatedDataFrame containing information about each feature.

annotation: A character describing the platform on which the samples were assayed. （平台注释信息，用于探针转换）

assayData: A matrix of expression values, or an environment. （表达矩阵信息）

experimentData: An optional MIAME instance with meta-data (e.g., the lab and resulting publications from the analysis) about the experiment.

ExprssionSet
SnpSet

image.png

基因芯片质量控制

直接观察
平均值的方法

尺度因子、检测值和检测率、平均背景噪音、标准内参（simpleaffy package）

原始数据拟合回归（affyPLM package）

权重残差图
相对对数表达箱线图(RLE)和NUSE图
RNA降解图

person线性相关系数

聚类分析
主成份分析（PCA图）

背景校正、标准化和汇总

对于CEL文件特别需要

方法	背景校正	标准化	汇总
MAS5	mas	constant	mas
RMA	rma	quantile	medianpolish

RMA处理后的数据是经过以2为底的对数转换，而MAS5不是。
很多芯片分析软件或函数需要的输入数据必须经过对数转换，如limma

以上这些是第一为了删除不合格不靠谱的样本，第二通过三步处理（背景校正、标准化和汇总）获得下一步分析需要的表达矩阵。