(三)02_group_ids分组和芯片注释
2020-01-23 本文已影响0人
养猪场小老板
-
group_list分组
第一步:清除之前所有变量+加载之前的数据
> rm(list = ls()) #表示清除所有变量;ls当前目录赋值给列表,接着清除
#ls() 返回global environment 里面的所有object的名字。
#是一个character vector
> load(file = "step1output.Rdata")#加载工作目录下之前保存的数据
> library(stringr)#加载str包
第二步,确认分组的目标
#前文提到的pd中有临床信息,其中title中显示了control组和实验组
> pd$title
[1] "A375 cells 24h Control rep1" "A375 cells 24h Control rep2"
[3] "A375 cells 24h Control rep3" "A375 cells 24h Vemurafenib rep1"
[5] "A375 cells 24h Vemurafenib rep2" "A375 cells 24h Vemurafenib rep3"
![](https://img.haomeiwen.com/i18915564/b20ec4cd4e162859.png)
第三步,分组向量生成
> group_list=c(rep("control",times=3),rep("treat",times=3))
> group_list
[1] "control" "control" "control" "treat" "treat" "treat"
> #第三类,ifelse
> library(stringr)#这个包可以用函数str_detect()
> group_list=ifelse(str_detect(pd$title,"Control"),"control","treat")
> group_list
[1] "control" "control" "control" "treat" "treat" "treat"
#第一个为判断条件,第二为true,第三false
#设置参考水平,对照在前,处理在后
#str_detect(string字符串, pattern匹配字符),返回逻辑值,是检测函数;
#用于检测字符串中是否存在某种匹配模式;
#val <- c("abca4", 123, "cba2");str_detect(val, "a")检查Val是否有字符串a;TRUE FALSE TRUE
#pd$title中有6个,返回6个,TRUE返回第一个control;FALSE返回为treatment
第四步,设置因子
> group_list = factor(group_list,#生成因子的意义,后面的差异分析是处理/对照
levels = c("control","treat"))
#levels规定谁在前面谁是对照,注意顺序,所有加用level
#芯片注释,查找芯片平台对应的包,到此脚本中替换
-
芯片注释
芯片注释,查找芯片平台对应的包,到此脚本中替换
gpl #取网页搜索GPL编号,ctrl+F,获取相应的注释包
http://www.bio-info-trainee.com/1399.html
芯片探针与基因的对应关系http://www.bio-info-trainee.com/1399.html
![](https://img.haomeiwen.com/i18915564/ce33fcb0a6ed3d3d.png)
第一步,安装并加载hugene10sttranscriptcluster.db包
> gpl #取网页搜索GPL编号,ctrl+F,获取相应的注释包
[1] "GPL6244"
>if(!require(hugene10sttranscriptcluster.db))BiocManager::install("hugene10sttranscriptcluster.db")
#require()表示加载,返回的是逻辑值,TRUE时表示已加载,FALSE表示未加载;!表示否定
#先安装;ls("package:tidyr")函数用法
> library(hugene10sttranscriptcluster.db)
> ls("package:hugene10sttranscriptcluster.db")#显示包里的所有目录
[1] "hugene10sttranscriptcluster"
[2] "hugene10sttranscriptcluster.db"
[3] "hugene10sttranscriptcluster_dbconn"
[4] "hugene10sttranscriptcluster_dbfile"
[5] "hugene10sttranscriptcluster_dbInfo"
[6] "hugene10sttranscriptcluster_dbschema"
[7] "hugene10sttranscriptclusterACCNUM"
[8] "hugene10sttranscriptclusterALIAS2PROBE"
[9] "hugene10sttranscriptclusterCHR"
[10] "hugene10sttranscriptclusterCHRLENGTHS"
[11] "hugene10sttranscriptclusterCHRLOC"
[12] "hugene10sttranscriptclusterCHRLOCEND"
[13] "hugene10sttranscriptclusterENSEMBL"
[14] "hugene10sttranscriptclusterENSEMBL2PROBE"
[15] "hugene10sttranscriptclusterENTREZID"
[16] "hugene10sttranscriptclusterENZYME"
[17] "hugene10sttranscriptclusterENZYME2PROBE"
[18] "hugene10sttranscriptclusterGENENAME"
[19] "hugene10sttranscriptclusterGO"
[20] "hugene10sttranscriptclusterGO2ALLPROBES"
[21] "hugene10sttranscriptclusterGO2PROBE"
[22] "hugene10sttranscriptclusterMAP"
[23] "hugene10sttranscriptclusterMAPCOUNTS"
[24] "hugene10sttranscriptclusterOMIM"
[25] "hugene10sttranscriptclusterORGANISM"
[26] "hugene10sttranscriptclusterORGPKG"
[27] "hugene10sttranscriptclusterPATH"
[28] "hugene10sttranscriptclusterPATH2PROBE"
[29] "hugene10sttranscriptclusterPFAM"
[30] "hugene10sttranscriptclusterPMID"
[31] "hugene10sttranscriptclusterPMID2PROBE"
[32] "hugene10sttranscriptclusterPROSITE"
[33] "hugene10sttranscriptclusterREFSEQ"
[34] "hugene10sttranscriptclusterSYMBOL" ###重要
[35] "hugene10sttranscriptclusterUNIGENE"
[36] "hugene10sttranscriptclusterUNIPROT"
#View(hugene10sttranscriptclusterSYMBOL)
#str(hugene10sttranscriptclusterSYMBOL)
#View(hugene10sttranscriptclusterSYMBOL)
第二步,将hugene10sttranscriptclusterSYMBOL中的数据用数据框封装
> ids <- toTable(hugene10sttranscriptclusterSYMBOL)#把包里的数据变成数据框
#toTable是一种能够以数据框的形式来操作一个Bimap对象的方法,
#也就是把Bimap对象转换为一个数据框,
#这些方法是Bimap interface方法的一部分。
#Bimap指的是一种映射关系,例如探针的编号与基因名称之间的映射
head(ids)#只有两列数据probe_id和symbol
probe_id symbol
1 7896759 LINC01128
2 7896761 SAMD11
3 7896779 KLHL17
4 7896798 PLEKHN1
5 7896817 ISG15
6 7896822 AGRN
#View(ids)
save(exp,group_list,ids,file = "step2output.Rdata")
继续了解probe_id和symbol在该分析中的作用