TwoSampleMR两样本孟德尔随机化：服务器拥堵502了怎么

2024-03-06 本文已影响0人小洁忘了怎么分身

问题

TwoSampleMR 这个包的代码很简单，以下这段是从MRbase数据库在线操作跑完复制下来的。总共四个函数，提取暴露，提取结局，整合数据，MR分析。直接跑的话很容易因为服务器拥堵而报错502。

rm(list = ls())
#devtools::install_github("MRCIEU/TwoSampleMR",upgrade = F)
library(TwoSampleMR)
exposure_dat <- extract_instruments("ukb-b-1061") #从暴露数据中提取工具变量
outcome_dat <- extract_outcome_data(exposure_dat$SNP,"ukb-d-I9_PAD") # 从结局变量中提取工具变量
dat <- harmonise_data(exposure_dat, outcome_dat) #整合数据
mr_results <- mr(dat) # 孟德尔随机化分析

眼瞅着就四行代码愣是搞不成功。。。

解决办法有两个，一个是写循环多次尝试，到成功了为止，一个是自行下载vcf文件，进行本地clump。

方法一 :写循环多次尝试

四句代码里面的前两句都给它循环起来，知道得到了结果为止。

rm(list = ls())
library(TwoSampleMR)
while (!exists("exposure_dat")) {
  try({
    exposure_dat <- extract_instruments(c("ukb-b-1061"))
  })
  Sys.sleep(2)
}
while (!exists("outcome_dat")) {
  try({
    outcome_dat <- extract_outcome_data(exposure_dat$SNP,"ukb-d-I9_PAD")
  })
  Sys.sleep(2)
}
dat <- harmonise_data(exposure_dat, outcome_dat, action = 2)
mr_results <- mr(dat)
head(mr_results)

##   id.exposure   id.outcome                                      outcome
## 1  ukb-b-1061 ukb-d-I9_PAD Peripheral artery disease || id:ukb-d-I9_PAD
## 2  ukb-b-1061 ukb-d-I9_PAD Peripheral artery disease || id:ukb-d-I9_PAD
## 3  ukb-b-1061 ukb-d-I9_PAD Peripheral artery disease || id:ukb-d-I9_PAD
## 4  ukb-b-1061 ukb-d-I9_PAD Peripheral artery disease || id:ukb-d-I9_PAD
## 5  ukb-b-1061 ukb-d-I9_PAD Peripheral artery disease || id:ukb-d-I9_PAD
##                                             exposure                    method
## 1 Age high blood pressure diagnosed || id:ukb-b-1061                  MR Egger
## 2 Age high blood pressure diagnosed || id:ukb-b-1061           Weighted median
## 3 Age high blood pressure diagnosed || id:ukb-b-1061 Inverse variance weighted
## 4 Age high blood pressure diagnosed || id:ukb-b-1061               Simple mode
## 5 Age high blood pressure diagnosed || id:ukb-b-1061             Weighted mode
##   nsnp             b          se      pval
## 1   11  0.0128701438 0.007338586 0.1133702
## 2   11 -0.0014678980 0.002295434 0.5225068
## 3   11 -0.0003429216 0.002035624 0.8662213
## 4   11 -0.0011382362 0.004090191 0.7864634
## 5   11  0.0031988320 0.004064834 0.4495578

save(exposure_dat,outcome_dat,dat,mr_results,file = "online.Rdata")

方法二：vcf文件本地clump

重要的话说在前面：vcf文件很大，读取需要的计算资源不小，我是在服务器上做的，如果本地电脑跑不了你可得赶紧任务管理器强制关上，不要把电脑搞坏了。

1.装包

options("repos"="https://mirrors.ustc.edu.cn/CRAN/")
if(!require("BiocManager")) install.packages("BiocManager",update = F,ask = F)
options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/")
#options(BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor")
if(!require(VariantAnnotation)) BiocManager::install("VariantAnnotation",update = F,ask = F)
if(!require(genetics.binaRies)) devtools::install_local("genetics.binaRies-master.zip",upgrade = F)
if(!require(plinkbinr)) devtools::install_local("plinkbinr-master.zip",upgrade = F)
if(!require(gwasvcf)) devtools::install_github("MRCIEU/gwasvcf",upgrade = F)
if(!require(TwoSampleMR)) devtools::install_github("MRCIEU/TwoSampleMR",upgrade = F)
if(!require(gwasglue)) devtools::install_local("gwasglue-master.zip",upgrade = F)

其中devtools::install_local对应的代码需要先从github下载到压缩包文件。

2.下载vcf文件

网址：https://gwas.mrcieu.ac.uk/

这个是经过大量整理的很规范的gwas数据库，twosample MR

在这里搜索对应的表型或者是数据编号，点进去就能看到vcf文件啦（点进去没有vcf的我也不知道，就是没有）。

3.获取暴露数据

gwasvcf_to_TwoSampleMR这个函数真是神了。vcf文件的列名和TwoSampleMR适应的列名不相同，用这个函数就可以直接转换，不需要自己一个一个改，也不需要写老长的代码。

rm(list = ls())
library(TwoSampleMR)
library(VariantAnnotation)
library(gwasvcf)
library(dplyr)
f = "expe.Rdata"
if(!file.exists(f)){
  expe=VariantAnnotation::readVcf("ukb-b-1061.vcf.gz")
  expe=gwasglue::gwasvcf_to_TwoSampleMR(expe,type = "exposure")
  k = expe$pval.exposure< 5e-08;table(k) 
  expe = expe[k,]
  save(expe,file = f)
}
load(f)
colnames(expe)

##  [1] "chr.exposure"           "pos.exposure"           "other_allele.exposure" 
##  [4] "effect_allele.exposure" "beta.exposure"          "se.exposure"           
##  [7] "pval.exposure"          "eaf.exposure"           "samplesize.exposure"   
## [10] "ncase.exposure"         "SNP"                    "ncontrol.exposure"     
## [13] "exposure"               "mr_keep.exposure"       "pval_origin.exposure"  
## [16] "id.exposure"

上面这个是!file.exists(f)是避免反复运行限速步骤的小技巧。

代码里的5e-08是extract_instruments的默认参数，只要p值小于5e-08的行。

4.本地clump

clump的作用是去除连锁不平衡的影响。理论知识见上一篇。

需要自己准备两个参数

一个是bfile
来自：http://fileserve.mrcieu.ac.uk/ld/1kg.v3.tgz

这个压缩包解压后可以看到多个人种的数据

现在这个数据是欧洲人的，所以把上图三个文件单独存放一个文件夹，并把前缀去掉，才能用默认的代码运行成功

一个是plink exe，使用R包召唤plinkbinr即可。我测试了三个系统，windows系统可以直接用，linux和mac如果报错无权限，则需要自己设置一下，linux命令行运行代码

chmod 777 ~/R/x86_64-pc-linux-gnu-library/4.3/plinkbinr/bin/plink_Linux

777后面的内容自己修改为get_plink_exe()的运行结果

library(ieugwasr)
library(plinkbinr)
get_plink_exe()

## [1] "/home/data/t120459/R/x86_64-pc-linux-gnu-library/4.3/plinkbinr/bin/plink_Linux"

dat = expe
colnames(dat)[c(7,11)] = c("pval","rsid")
re <- ld_clump(dat,
                plink_bin = get_plink_exe(), 
                bfile = "EUR/" )
nrow(re)

## [1] 11

expe <- semi_join(expe, re, by = c("SNP" = "rsid"))

5.提取结局数据

f = "expo_dat.Rdata"
if(!file.exists(f)){
  expo=VariantAnnotation::readVcf("ukb-d-I9_PAD.vcf.gz")
  expo=gwasglue::gwasvcf_to_TwoSampleMR(expo,type = "outcome")
  expo <- semi_join(expo, re, by = c("SNP" = "rsid"))
  save(expo,file = f)
}
load(f)
nrow(expo)
## [1] 11

semi_join是半连接，是选出expo里面，re里那些SNP所在的行。是最优雅的代码啦！我看了网上有的代码是要merge，导出，在excel里改列名然后再读入。。。技术门槛低，但是当做的数据多了就会发现它没有代码方便，这也是网页工具不能替代R语言的原因之一。

后面这几句就简单的很啦。

做MR分析

dat <- harmonise_data(expe, expo)
mr_results <- mr(dat)
head(mr_results)

##   id.exposure id.outcome      outcome   exposure                    method nsnp             b
## 1      tZTCHi     4EBWWD ukb-d-I9_PAD UKB-b-1061                  MR Egger   11  0.0128701438
## 2      tZTCHi     4EBWWD ukb-d-I9_PAD UKB-b-1061           Weighted median   11 -0.0014678980
## 3      tZTCHi     4EBWWD ukb-d-I9_PAD UKB-b-1061 Inverse variance weighted   11 -0.0003429216
## 4      tZTCHi     4EBWWD ukb-d-I9_PAD UKB-b-1061               Simple mode   11 -0.0011382362
## 5      tZTCHi     4EBWWD ukb-d-I9_PAD UKB-b-1061             Weighted mode   11  0.0031988320
##            se      pval
## 1 0.007338586 0.1133702
## 2 0.002223515 0.5091447
## 3 0.002035624 0.8662213
## 4 0.003778148 0.7693813
## 5 0.003926683 0.4342500

save(expe,expo,dat,mr_results,file = "local.Rdata")

本地和在线结果是一样的，说明我自己攒的vcf整理代码没问题。
不过我也遇到了一个诡异的数据，本地与在线得出的结果beta值是相反的，花了我快两天时间debug，各种比较，扒源代码，最后发现是网站上的原始数据就不一样，在线版从api下载的数据与vcf里面的数据beta值就正负符号相反，我认为应该是个错误，已经发邮件问他们啦，可以蹲个后续，也欢迎大家发表见解。