NGS数据下载

2020-02-24 本文已影响0人杨亮_SAAS

本来想重复冷泉港Lippman 2017年在Cell上的一篇文章：Bypassing Negative Epistasis on Yield in Tomato Imposed by a Domestication Gene，但是在下载数据的时候，遇到了一个大坑，萌生了对现有下载NGS数据总结的一个想法，归纳如下：

NGS数据下载的几种方式

NGS数据的存储方式一般有两种，SRA格式或fastq格式。其中，SRA格式是NCBI对NGS数据的一种压缩格式，需要利用sratools对sra文件进行解压缩操作；而在ENA (European Nucleotide Archive)中，直接以fastq.gz的压缩格式对测序数据进行存储。

1. Aspera Connect下载（首选）

参考原文及大宽翻译

Aspera connect是IBM的商业化高速文件下载软件，但可以免费下载NCBI和EBI的数据。速度可达200-500Mbps，几乎所有站点都超过10Mbps
如果Aspera connect不能下载，则推荐sratoolkit的prefetch功能

-最后，尽量使用sratoolkit中的fastq-dump和sam-dump命令。如果fastq-dump连接外部稳定，则推荐使用Biostar Handbook中的wonderdump脚本。

警告：尽量不要使用wget或curl命令来下载

2.1安装及添加环境变量

wget http://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz  
 
#解压缩 
tar zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
 
# install
bash aspera-connect-3.7.4.147727-linux-64.sh

# check the .aspera directory
cd # go to root directory
ls -a # if you could see .aspera, the installation is OK

# add environment variable
echo 'export PATH=~/.aspera/connect/bin:$PATH' >> ~/.bashrc
source ~/.bashrc   

#密钥备份到/home/的家目录（后面会用，否则报错）
cp ~/.aspera/connect/etc/asperaweb_id_dsa.openssh ~/

# check help file
ascp --help

2.2 Aspera使用

Official documents
命令行格式：

ascp [options] target-file storage-directory

参数说明：

-i（必选）: Use public key authentication and specify the private key file, the address normally is ~/.aspera/connect/etc/asperaweb_id_dsa.openssh.
-T: Disable encryption, otherwise downloading will be interrupted sometimes.
-l: Set the target transfer rate in Kbps, normally is 200m - 500m.
-k: Enable resuming partially transferred files, better set value 1.
-Q: Enable fair transfer policy, use it when download data from ENA database.
-P: Set the TCP port used for fasp session initiation, just use value 33001.

使用举例：

(1) 从SRA数据库下载数据：
首先要确定SRA数据是否在ftp-private.ncbi.nlm.nih.gov中，有时，数据也有可能在https://sra-downloadb.be-md.ncbi.nlm.nih.gov/（不知道是否年代比较久远的数据就在这里？）。SRA在Aspera中的用户名为anonftp，如要下载SRR949627.sra，命令行为：

ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -T -l 200m anonftp@ftp-private.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR949/SRR949627/SRR949627.sra ./

注意：

anonftp@ftp-private.ncbi.nlm.nih.gov后为“:”
sra文件在NCBI的存储路径相同，一般为/sra/sra-instant/reads/ByRun/sra/...，便于利用脚本进行批量操作

(2) 从ENA数据库下载数据：
测序数据的存储位置在fasp.sra.ebi.ac.uk，ENA在Aspera中的用户名为era-fasq，ena可以直接下载fastq.gz文件，不必再从sra文件转换了。地址去ENA搜索，再复制fastq.gz文件的地址，或者去ENA的ftp地址ftp.sra.ebi.ac.uk搜索，注意是ftp不是fasp。

ascp -v -P33001 -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -k 1 -QT -l 200m era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/SRR949/SRR949627/SRR949627_1.fastq.gz ./

注意：

era-fasp@fasp.sra.ebi.ac.uk 后为“:”
文件路径同样有规律可循，方便利用脚本进行批量操作

2. SRA toolkit的prefetch命令（不是太推荐，土豪实验室除外）

单个文件下载：

prefetch <SRA accession>

批量下载：
进入需要下载多个SRA文件的页面，一般是某个PRJ accession，然后右上角选send to-file,format选择accession list，保存为一个file（默认是SraAccList.txt），通过以下命令进行下载：

prefetch --option-file SraAccList.txt

3. 通过Linux wget或curl命令下载（不推荐）

首先，查找到NGS存放的位置，一般在NCBI或者ENAftp服务器中，然后通过wget或curl命令下载。

wget -c ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRR155/SRR1553608/SRR1553608.sra
# -c 为断点续传参数

curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRR155/SRR1553608/SRR1553608.sra

这俩的下载速度会让你怀疑人生，当然，土豪实验室或研究所除外。

NGS数据下载