2020-07-21 下载dbgap数据
今天把官网流程全部看了一遍按照这样操作了,看了一下数据正在下载,明天看一下有没有报错产生,不知道会有啥问题等待解决。
Source:
https://www.ncbi.nlm.nih.gov/books/NBK36439/
下载步骤
使用NCBI的SRA toolkit中的prefetch
命令行功能和cart
文件或者SRA accession
进行下载
- 下载并安装Aspera connect
Aspera:一个高速文件传输系统,方便下载数据。
下载链接:https://downloads.asperasoft.com/en/downloads/8?list
确保你安装的是connect- 选择并保存数据信息在
cart
文件中
(除了cart文件,也可以根据SRA accession下载,步骤5中详解)
- 登录dbgap
- 点击My Requests,查看批准的请求
-
查看request file
选择dbGap file selctor下载基因型和表型数据
选择SRA RUN selector下载SRA数据
-
Wait until the page loading is complete. Click on the “Help” icon on top of the page to see instruction/information about the selector).
-
选择数据并下载Cart文件(这里是non-SRA数据)
non-SRA cart文件 下载的SRA cart文件
- 编译SRA toolkit
- 下载最新的SRA Toolkit并解压
(https://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) - 在使用toolkit之前需要根据 Protected Data Usage Guide 进行编译,导入dbGaP repository key(如果SRA Toolkit版本高于2.10.2就不需要编译了)【最近把版本更新到3.0版本后发现,不再需要额外导入dbGaP repository key了】
编译步骤:
我使用的版本低于2.10.2需要编译:
Quick Toolkit Configuration
https://github.com/ncbi/sra-tools/wiki/03.-Quick-Toolkit-Configuration
$ vdb-config -i
A. 选择"Remote Access"
B. 转到"Cache"选择"local file-caching"并设置路径(必须是空文件夹)
C. 转到"cloud provider"并且选择"report cloud instance identity"
image.png
- 在编译SRA toolkit过程中导入"dbGaP repository key"
编译后会自动创建文件夹类似于~/ncbi/dbGap-XXXXX
(也叫做工作目录)
这个目录下会有子目录,比如sra
,refseq
等等。
【最近把版本更新到3.0版本后发现,不再需要额外导入dbGaP repository key了】在prefetch中增加了--ngc参数,下载时给出key即可。
prefetch --ngc prj_33085.ngc --cart cart_DAR116028_202209070105.krt
-
dbGaP repository key文件包括了SRA Toolkit所需要用来确定申请人和dbga数据所属项目的信息,那么如何下载dbGaP repository key呢?
在action位置找到对应的批准的数据对应的project的get dbGap repository key
,下载得到.ngc
格式的文件。
什么是cart文件或SRA accession?
- 数据块
cart文件中提供了dbgap的非SRA和SRA数据文件块 - 单个SRA
当得到单个的SRR accession时可以下载单个的SRA run
但是不管是以上哪种情况,在执行命令前,sratoolkit都要使用dbGaP repository key来编译。
- 使用prefetch进行数据下
在通过编译产生的dbGaP project directory
目录下,运行prefetch命令,把cart文件的地址写完整,
nohup和末尾的&可以后台运行
-X 99999999 是下载大小限制放大
> nohup prefetch -X 9999999999999 /public/home/liuxs/taozy/dbGap/cart_DAR94672_202007210554.krt &
sra解压成fastq文件报错,使用validate
检测
(wes) [myname@HPC-login sra]$ vdb-validate SRR7554958
2020-07-23T02:26:44 vdb-validate.2.10.0 info: Validating '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra'...
2020-07-23T02:26:44 vdb-validate.2.10.0 info: Validating encrypted file '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra'...
2020-07-23T02:27:31 vdb-validate.2.10.0 info: Encrypted file '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra' appears valid
2020-07-23T02:27:34 vdb-validate.2.10.0 info: Database 'SRR7554958.sra' metadata: md5 ok
2020-07-23T02:27:34 vdb-validate.2.10.0 info: Table 'PRIMARY_ALIGNMENT' metadata: md5 ok
2020-07-23T02:27:34 vdb-validate.2.10.0 info: Column 'GLOBAL_REF_START': checksums ok
2020-07-23T02:27:35 vdb-validate.2.10.0 info: Column 'HAS_MISMATCH': checksums ok
2020-07-23T02:27:36 vdb-validate.2.10.0 info: Column 'HAS_REF_OFFSET': checksums ok
2020-07-23T02:27:36 vdb-validate.2.10.0 info: Column 'MAPQ': checksums ok
2020-07-23T02:27:37 vdb-validate.2.10.0 info: Column 'MISMATCH': checksums ok
2020-07-23T02:27:37 vdb-validate.2.10.0 info: Column 'REF_LEN': checksums ok
2020-07-23T02:27:38 vdb-validate.2.10.0 info: Column 'REF_OFFSET': checksums ok
2020-07-23T02:27:38 vdb-validate.2.10.0 info: Column 'REF_OFFSET_TYPE': checksums ok
2020-07-23T02:27:38 vdb-validate.2.10.0 info: Column 'REF_ORIENTATION': checksums ok
2020-07-23T02:27:39 vdb-validate.2.10.0 info: Column 'SEQ_READ_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_SPOT_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Table 'REFERENCE' metadata: md5 ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CGRAPH_HIGH': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CGRAPH_INDELS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CGRAPH_LOW': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CGRAPH_MISMATCHES': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CIRCULAR': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CS_KEY': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'OVERLAP_REF_LEN': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'OVERLAP_REF_POS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'PRIMARY_ALIGNMENT_IDS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SECONDARY_ALIGNMENT_IDS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_LEN': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_START': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Table 'SECONDARY_ALIGNMENT' metadata: md5 ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'GLOBAL_REF_START': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'HAS_REF_OFFSET': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'MAPQ': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'MATE_REF_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'MATE_REF_ORIENTATION': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'MATE_REF_POS': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'REF_LEN': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'REF_OFFSET': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'REF_OFFSET_TYPE': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'REF_ORIENTATION': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_READ_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'SEQ_SPOT_ID': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'TEMPLATE_LEN': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'TMP_HAS_MISMATCH': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'TMP_MISMATCH': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Table 'SEQUENCE' metadata: md5 ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'ALIGNMENT_COUNT': checksums ok
2020-07-23T02:27:41 vdb-validate.2.10.0 info: Column 'CMP_ALTREAD': checksums ok
2020-07-23T02:27:44 vdb-validate.2.10.0 info: Column 'CMP_READ': checksums ok
2020-07-23T02:27:44 vdb-validate.2.10.0 info: Column 'PLATFORM': checksums ok
2020-07-23T02:27:47 vdb-validate.2.10.0 info: Column 'PRIMARY_ALIGNMENT_ID': checksums ok
2020-07-23T02:28:58 vdb-validate.2.10.0 info: Column 'QUALITY': checksums ok
2020-07-23T02:29:00 vdb-validate.2.10.0 info: Column 'RD_FILTER': checksums ok
2020-07-23T02:29:03 vdb-validate.2.10.0 info: Column 'READ_TYPE': checksums ok
2020-07-23T02:29:51 vdb-validate.2.10.0 info: Referential Integrity: SEQ_SPOT_ID <-> PRIMARY_ALIGNMENT_ID 76.3% complete
2020-07-23T02:29:53 vdb-validate.2.10.0 info: Referential Integrity: SEQ_SPOT_ID <-> PRIMARY_ALIGNMENT_ID 100.0% complete
2020-07-23T02:29:53 vdb-validate.2.10.0 info: Database '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra': SEQUENCE.PRIMARY_ALIGNMENT_ID <-> PRIMARY_ALIGNMENT.SEQ_SPOT_ID referential integrity ok
2020-07-23T02:30:10 vdb-validate.2.10.0 info: Referential Integrity: REF_ID <-> PRIMARY_ALIGNMENT_IDS 76.3% complete
2020-07-23T02:30:11 vdb-validate.2.10.0 info: Referential Integrity: REF_ID <-> PRIMARY_ALIGNMENT_IDS 100.0% complete
2020-07-23T02:30:11 vdb-validate.2.10.0 info: Database '/public/home/liuxs/ncbi/dbGaP-26086/sra/SRR7554958.sra': REFERENCE.PRIMARY_ALIGNMENT_IDS <-> PRIMARY_ALIGNMENT.REF_ID referential integrity ok
2020-07-23T02:30:11 vdb-validate.2.10.0 info: Database 'SRR7554958.sra' is consistent
表型数据解密
下载下来的表型数据后缀是.ncbi.enc
,需要进行解密
分为导入密钥和进行解密两个步骤
$ vdb-config --import xxxx.ngc
$ vdb-decrypt xx.ncbi_enc # 单个文件解密
$ vdb-decrypt ~/ncbi/dbGaP-26086/files/ # 整个表型数据存放的文件夹进行解密
解密完成之后,文件的后缀不见了,变成了正常的文件格式
【新的版本做了更新,vdb-config --import 失效了,此功能整合进vdb-decrypt --ngc】
部分sra文件下载失败的解决方法
提取下载失败的SRRXXX名字,放入一个新的文件中,对这个新的文件进行prefetch下载
步骤:
- 创建一个shell脚本
$ vi download.sh
shell脚本内容如下:
cat是逐行读取文件按内容,我的文件每行都是SRA序号,就是直接`prefetch`的对象。
-
nohup
提交shell脚本
开始下载...
文件整理:
- 从上到下分别是
cart file(selected accession for processing sra toolkit)
、key(密钥)
、下载的SRA内容
(full list of accession recordset)
[图片上传失败...(image-1963a2-1598250164533)]
- 下载表型
-
下载过程中出现的这些文件是做啥用的????
[图片上传失败...(image-f0c890-1598250164532)]