2019-04-13【R语言练习题-初级】作业(ing）

2019-04-17 本文已影响81人猫叽先森

R语言练习题-初级
http://www.bio-info-trainee.com/3793.html

1.打开 Rstudio 告诉我它的工作目录。
告诉我在你打开的Rstudio里面getwd()代码运行后返回的是什么？

> getwd()
[1] "D:/Bioinformatics/20190413GZ"

新建6个向量，基于不同的原子类型。（重点是字符串，数值，逻辑值）

> a1 <- 1:10
> a1
 [1]  1  2  3  4  5  6  7  8  9 10
> class(a1)
[1] "integer"
> a2 <- c('hello','the','world')
> a2
[1] "hello" "the"   "world"
> class(a2)
[1] "character"
> a3 <- c(T,T,T,F,F,T,T,F)
> a3
[1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
> class(a3)
[1] "logical"
> a4 <- 607L
> a4
[1] 607
> class(a4)
[1] "integer"
> a5 <- seq(from = 0, to = 20, by =2)
> a5
 [1]  0  2  4  6  8 10 12 14 16 18 20
> class(a5)
[1] "numeric"
> a6 <- c(13.14,14)
> a6
[1] 13.14 14.00
#14也带小数位了
> class(a6)
[1] "numeric"

新建一些数据结构，比如矩阵，数组，数据框，列表等重点是数据框，矩阵）在你新建的数据框进行切片操作，比如首先取第1，3行，然后取第4，6列。
【相关知识转载】R语言数据结构

#数组
> arr <- array(1:8)
> arr
[1] 1 2 3 4 5 6 7 8
#矩阵
> matr <- matrix(1:8,2)
> matr
     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8
> abc <- paste0('gene',1:5)
> gname <- c('a','b','c','d','e')
> a1 <- c(floor(rnorm(5,5000,300)))
> a2 <- c(floor(rnorm(5,3000,500)))
#数据框
> expr <- data.frame(id=abc,genename=gname,s1=a1,s2=a2)
> expr
     id genename   s1   s2
1 gene1        a 4941 3159
2 gene2        b 5297 2845
3 gene3        c 5362 3055
4 gene4        d 4753 2714
5 gene5        e 4806 3361
> dim(expr)
[1] 5 4
> View(expr)

View(expr)

使用data函数来加载R内置数据集 rivers 描述它。并且可以查看更多的R语言内置的数据集：https://mp.weixin.qq.com/s/dZPbCXccTzuj0KkOL7R31g

> data("rivers") #北美141条河流长度
> rivers
  [1]  735  320  325  392  524  450 1459  135  465  600  330  336  280
 [14]  315  870  906  202  329  290 1000  600  505 1450  840 1243  890
 [27]  350  407  286  280  525  720  390  250  327  230  265  850  210
 [40]  630  260  230  360  730  600  306  390  420  291  710  340  217
 [53]  281  352  259  250  470  680  570  350  300  560  900  625  332
 [66] 2348 1171 3710 2315 2533  780  280  410  460  260  255  431  350
 [79]  760  618  338  981 1306  500  696  605  250  411 1054  735  233
 [92]  435  490  310  460  383  375 1270  545  445 1885  380  300  380
[105]  377  425  276  210  800  420  350  360  538 1100 1205  314  237
[118]  610  360  540 1038  424  310  300  444  301  268  620  215  652
[131]  900  525  246  360  529  500  720  270  430  671 1770
> length(rivers) # 数据个数
[1] 141
> mean(rivers) # 平均值
[1] 591.1844
> median(rivers) # 中位数
[1] 425
> max(rivers) # 最大值
[1] 3710
> which.max(rivers) # 最大值下标
[1] 68
> min(rivers) # 最小值
[1] 135
> which.min(rivers) # 最小值下标
[1] 8
> range(rivers) # 范围
[1]  135 3710
> sorted_rivers <- sort(rivers) # 排序
> sorted_rivers
  [1]  135  202  210  210  215  217  230  230  233  237  246  250  250
 [14]  250  255  259  260  260  265  268  270  276  280  280  280  281
 [27]  286  290  291  300  300  300  301  306  310  310  314  315  320
 [40]  325  327  329  330  332  336  338  340  350  350  350  350  352
 [53]  360  360  360  360  375  377  380  380  383  390  390  392  407
 [66]  410  411  420  420  424  425  430  431  435  444  445  450  460
 [79]  460  465  470  490  500  500  505  524  525  525  529  538  540
 [92]  545  560  570  600  600  600  605  610  618  620  625  630  652
[105]  671  680  696  710  720  720  730  735  735  760  780  800  840
[118]  850  870  890  900  900  906  981 1000 1038 1054 1100 1171 1205
[131] 1243 1270 1306 1450 1459 1770 1885 2315 2348 2533 3710
> table(rivers) # 统计
rivers
 135  202  210  215  217  230  233  237  246  250  255  259  260  265 
   1    1    2    1    1    2    1    1    1    3    1    1    2    1 
 268  270  276  280  281  286  290  291  300  301  306  310  314  315 
   1    1    1    3    1    1    1    1    3    1    1    2    1    1 
 320  325  327  329  330  332  336  338  340  350  352  360  375  377 
   1    1    1    1    1    1    1    1    1    4    1    4    1    1 
 380  383  390  392  407  410  411  420  424  425  430  431  435  444 
   2    1    2    1    1    1    1    2    1    1    1    1    1    1 
 445  450  460  465  470  490  500  505  524  525  529  538  540  545 
   1    1    2    1    1    1    2    1    1    2    1    1    1    1 
 560  570  600  605  610  618  620  625  630  652  671  680  696  710 
   1    1    3    1    1    1    1    1    1    1    1    1    1    1 
 720  730  735  760  780  800  840  850  870  890  900  906  981 1000 
   2    1    2    1    1    1    1    1    1    1    2    1    1    1 
1038 1054 1100 1171 1205 1243 1270 1306 1450 1459 1770 1885 2315 2348 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1 
2533 3710 
   1    1

下载 https://www.ncbi.nlm.nih.gov/sra?term=SRP133642 里面的 RunInfo Table 文件读入到R里面，了解这个数据框，多少列，每一列都是什么属性的元素。（参考B站生信小技巧获取runinfo table）这是一个单细胞转录组项目的数据，共768个细胞，如果你找不到RunInfo Table 文件，可以点击下载，然后读入你的R里面也可以。

5.1 打开链接https://www.ncbi.nlm.nih.gov/sra?term=SRP133642，点击Send results to Run selector链接（图中红框）

5.2 点击RunInfo Table按钮即可下载RunInfo Table文件（图中红圈）

> rm(list = ls())
> options(stringsAsFactors = F)
> df1 <- read.table(file = "SraRunTable.txt",header = T,sep = '\t')
> dim(df1) # 查看data.frame维度
[1] 768  31
> ncol(df1) #查看data.frame列数
[1] 31
>View(df1) #用表格形式观察data.frame
#查看每一列都是什么属性的元素
> for (i in colnames(df1)) paste(i,class(df1[,i])) %>% print()
Error in paste(i, class(df1[, i])) %>% print() : 
  could not find function "%>%"

报错了，找不到%>%
google报错，然后就找到了这个。
https://stackoverflow.com/questions/30248583/error-could-not-find-function

stackoverflow上的解决方案
ok，复制粘贴。

> install.packages("magrittr") # only needed the first time you use it
> install.packages("dplyr")    # alternative installation of the %>%
> library(magrittr) # need to run every time you start R and want to use %>%
> library(dplyr)    # alternative, this also loads %>%
> for (i in colnames(df1)) paste(i,class(df1[,i])) %>% print()
[1] "BioSample character"
[1] "Experiment character"
[1] "MBases integer"
[1] "MBytes integer"
[1] "Run character"
[1] "SRA_Sample character"
[1] "Sample_Name character"
[1] "Assay_Type character"
[1] "AssemblyName character"
[1] "AvgSpotLen integer"
[1] "BioProject character"
[1] "Center_Name character"
[1] "Consent character"
[1] "DATASTORE_filetype character"
[1] "DATASTORE_provider character"
[1] "InsertSize integer"
[1] "Instrument character"
[1] "LibraryLayout character"
[1] "LibrarySelection character"
[1] "LibrarySource character"
[1] "LoadDate character"
[1] "Organism character"
[1] "Platform character"
[1] "ReleaseDate character"
[1] "SRA_Study character"
[1] "age character"
[1] "cell_type character"
[1] "marker_genes character"
[1] "source_name character"
[1] "strain character"
[1] "tissue character"

强迫症患者对于这每行行首的[1]相当怨念，我会要解决这个问题的。
然后上面那个数据框的切片操作

> #在你新建的数据框进行切片操作，比如首先取第1，3行， 然后取第4，6列
> df_r13 <- df1[c(1,3),]
> dim(df_r13)
[1]  2 31
> View(df_r13)
> df_c46 <- df1[,c(4,6)]
> dim(df_c46)
[1] 768   2
> View(df_c46)

df_r13

df_c46

下载 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111229 里面的样本信息sample.csv读入到R里面，了解这个数据框，多少列，每一列都是什么属性的元素。（参考 https://mp.weixin.qq.com/s/fbHMNXOdwiQX5BAlci8brA 获取样本信息sample.csv）如果你实在是找不到样本信息文件sample.csv，也可以点击下载。

6.1 打开GEO主页https://www.ncbi.nlm.nih.gov/geo/，点击Samples链接（图中红框）

GEO主页
6.2 在搜索栏中输入GSE111229，点击Search按钮，然后点击Export按钮

在弹出的对话框点击Export按钮，得到sample.csv；

> rm(list = ls())
> options(stringsAsFactors = F)
> df2 <- read.csv(file = "sample.csv")
> dim(df2)
[1] 768  12
> ncol(df2)
[1] 12
> View(df2)
> library("magrittr")
> for (i in colnames(df2)) paste(i,class(df2[,i])) %>% print()
[1] "Accession character"
[1] "Title character"
[1] "Sample.Type character"
[1] "Taxonomy character"
[1] "Channels integer"
[1] "Platform character"
[1] "Series character"
[1] "Supplementary.Types character"
[1] "Supplementary.Links character"
[1] "SRA.Accession character"
[1] "Contact character"
[1] "Release.Date character"

把前面两个步骤的两个表（RunInfo Table 文件，样本信息sample.csv）关联起来，使用merge函数。

#查看两个data.frame里相关的列
> rm(list = ls())
> options(stringsAsFactors = F)
> df1 <- read.table(file = "SraRunTable.txt",header = T,sep = '\t')
> df2 <- read.csv(file = "sample.csv")
> library(magrittr) #需要用到 %>%
> for (i in colnames(df1)) {
+   if (i %in% colnames(df2)) print(i)
+ }
[1] "Platform"
#直接查看相同列名，平台
> df1[1,"Platform"]
[1] "ILLUMINA"
> df2[1,"Platform"]
[1] "GPL13112"
#"平台"列的内容，并不相同。
#考虑在数据里查看相同内容：
> for (i in colnames(df1)) {
+   for (j in colnames(df2)) {
+     if (df1[,i] %in% df2[,j]) {paste(i,j) %>% print()}
+   }
+ }
[1] "Experiment SRA.Accession"
[1] "Sample_Name Accession"
[1] "Organism Taxonomy"
#以上三组，就是两个data.frame内容一致的列，可以以此为标准合并data.frame
df_merge <- merge(df1,df2,by.x="Experiment",by.y="SRA.Accession")
#df_merge <- merge(df1,df2,by.x="Sample_Name",by.y="Accession")
#df_merge <- merge(df1,df2,by.x="Organism",by.y="Taxonomy")
> dim(df_merge)
[1] 768  42
> dim(df1)
[1] 768  31
> dim(df2)
[1] 768  12
# 42=31+12-1
#其实，其中还有两对（"Sample_Name"和"Accession"，"Organism"和"Taxonomy"）内容是一致的，都可以二取其一。

基于下午的统计可视化

对前面读取的 RunInfo Table 文件在R里面探索其MBases列，包括箱线图(boxplot)和五分位数(fivenum)，还有频数图(hist)，以及密度图(density) 。
【相关知识转载】
箱线图boxplot
五分位数fivenum
频数图hist
密度图density

> rm(list = ls())
> options(stringsAsFactors = F)
> df1 <- read.table(file = "SraRunTable.txt",header = T,sep = '\t')
#箱线图
> boxplot(df1$MBases)

箱线图

#五分位数
> fivenum(df1$MBases)
[1]  0  8 12 16 74
#频数图
> hist(df1$MBases)

频数图

> density(df1$MBases)

Call:
    density.default(x = df1$MBases)

Data: df1$MBases (768 obs.);    Bandwidth 'bw' = 1.423

       x                y            
 Min.   :-4.269   Min.   :0.0000000  
 1st Qu.:16.366   1st Qu.:0.0000353  
 Median :37.000   Median :0.0003001  
 Mean   :37.000   Mean   :0.0121039  
 3rd Qu.:57.634   3rd Qu.:0.0142453  
 Max.   :78.269   Max.   :0.0665647 
> plot(density(df1$MBases))

密度图

#某个包里画密度图的函数
> library(lattice)
> densityplot(df1$MBases)

密度图2

把前面读取的样本信息表格的样本名字根据下划线分割看第3列元素的统计情况。第三列代表该样本所在的plate 。根据plate把关联到的 RunInfo Table 信息的MBases列分组检验是否有统计学显著的差异。
分组绘制箱线图(boxplot)，频数图(hist)，以及密度图(density) 。

使用ggplot2把上面的图进行重新绘制。

使用ggpubr把上面的图进行重新绘制。

随机取384个MBases信息，跟前面的两个plate的信息组合成新的数据框，第一列是分组，第二列是MBases，总共是384*3行数据。

2019-04-13【R语言练习题-初级】作业(ing）

基于下午的统计可视化

猜你喜欢

热点阅读