GEO | series matrix批量高速下载

2022-12-30 本文已影响0人生命数据科学

在大规模分析GEO数据库的过程中，迫切需要批量、高速下载series matrix文件，而在下载数据过程中，因为网络等原因，各种报错层出不穷，如何来解决，一块来看看~

1. 常见报错

Error in checkForRemoteErrors(val) : 
  one node produced an error: Timeout was reached: [ftp.ncbi.nlm.nih.gov] Operation timed out after 10010 milliseconds with 0 out of 0 bytes received

Error in open.connection(x, "rb") : 
Timeout was reached: [ftp.ncbi.nlm.nih.gov] Operation timed out after 10013 milliseconds with 0 out of 0 bytes received

Warning message:
In .Internal(identical(x, y, num.eq, single.NA, attrib.as.set, ignore.bytecode,  :
  closing unused connection 3 (https://ftp.ncbi.nlm.nih.gov/geo/series/GSE31nnn/GSE31733/matrix/)

2. 输入数据

仅需要一个输入文件GSE_list.txt，具体内容就是包含GSE号的一列（不需要表头！不需要表头！不需要表头！）：

input_file <- "GSE_list.txt"# 可自行修改
all_GSE<-read.table(input_file,sep = "\t",header = F) # 有行名header=T,没有行名header=F
> head(all_GSE)
        V1
1 GSE42301
2 GSE43065
3 GSE44961
4 GSE43969
5 GSE43356
6 GSE42247

3. 所需R包

require(doParallel)
library(stringr)
library(GEOquery)
library(xml2)
library(parallel)
library(openxlsx)

4. 代码

13行代码实现下载，内含自动识别下载超时、下载报错问题，同时多线程并发，最大利用电脑性能进行高速下载~

source("function.R")
options(timeout=60) # set the timeout
n.cores <- detectCores()#获得最大核数，或者自行设置
input_file <- "GSE_list.txt"# 可自行修改
all_GSE<-read.table(input_file,sep = "\t",header = F) # 有行名header=T,没有行名header=F
GEO<- unique(unlist(str_match_all(all_GSE[,1],pattern = "GSE[0-9]*")))
merge<-c()
clust <- makeCluster(n.cores)
a <- parLapply(clust, GEO, fun = url,merge,getDirListing)
stopCluster(clust)
registerDoParallel(n.cores)
foreach(i=1:length(a)) %dopar% try(download_fun(a = a,i = i))
stopImplicitCluster()

5. 其他准备工作

5.1 Enjoy yourself

在上述工作做完之后，就可以点杯咖啡，刷着B站，听着歌就把今天工作完成了~

5.2 玄学问题

在大量下载过GEO文件后发现的，联通的网络下载GEO文件更快，有联通电话卡或者联通网络的小伙伴可以试试~

后台回复GSEseries即可领取本期所需文件和代码

感谢观看，如果有用还请点赞，关注，在看，转发！