癌症相关基因组学大数据库

2021-03-15 本文已影响0人所以suoyi

2021/03/15
这里有一些数据库的介绍，极其简单
一、High-throughput sequencing data
二、Processed genomic data
三、Cancer Terms

A Primer for Access to Repositories of Cancer-Related Genomic Big Data
https://link.springer.com/protocol/10.1007/978-1-4939-8868-6_1
DOI ：10.1007/978-1-4939-8868-6_1
日期：2019-01-01

一、High-throughput sequencing data

数据库（High-throughput sequencing data）	网址
TCGA （The Cancer Genome Atlas）	http://cancergenome.nih.gov/
TARGET （Therapeutically Applicable Research to Generate Effective Treatments）	https://ocg.cancer.gov/programs/target/
CCLE （Cancer Cell Line Encyclopedia）	https://portals.broadinstitute.org/ccle
ICGC （International Cancer Genome Consortium）	https://icgc.org/
SRA （Sequence Read Archive）	http://www.ncbi.nlm.nih.gov/sra

（一）TCGA

数据下载网址：通过NCI Genomic Data Commons (GDC) https://portal.gdc.cancer.gov/

gdc data

data category
下载方法：

download
1、直接下载：加入购物车，进入购物车，点击download --> cart
2、使用gdc-client下载数据：
(1) 点击download --> manifest
(2) 下载gdc-client，解压缩

gdc-client
(3) 加入环境变量
控制面板 ----> 系统和安全 ----> 系统 ---->高级系统设置 ---->环境变量 ----> path ----> 编辑 ----> 输入gdc-client windows 的路径 ----> 确定保存
(4) 打开cmd窗口（快捷键 win + R，输入cmd）,

gdc-client.exe -h
gdc-client download -m gdc_manifest.xxxx-xx-xx.txt
## manifest.txt 路径看情况写哦

(5) Linux/Mac OS:

gdc-client download --help

tcga cancer type

（二）TARGET

全面的基因组方法来确定驱动儿童癌症的分子变化
数据访问：https://ocg.cancer.gov/programs/target/data-matrix

target
数据内容包括：
（1）Acute Lymphoblastic Leukemia (ALL) 急性淋巴细胞白血病
（2）Acute Myeloid Leukemia (AML) 急性髓性白血病
（3）Kidney Tumors 肾脏肿瘤
（4）Neuroblastoma (NBL) 神经母细胞瘤
（5）Osteosarcoma (OS) 骨肉瘤
（6）Pan-cancer Model Systems (MDLS) 泛癌模型系统
【目标项目实验方法】

（三）CCLE

旨在对一大批人类癌症模型进行详细的遗传学和药理学表征，开发将不同药理学漏洞与基因组模式相关联的综合计算分析，并将细胞系整合基因组学转化为癌症患者分层。CCLE提供了对1100多个细胞系的基因组数据，分析和可视化的公共访问。
需注册登录

CCLE

statistics
【如何使用】

data_download

（四）ICGC

观察50种不同的肿瘤类型或亚型,并全面描述其基因组、转录组和表观遗传学的变化
https://dcc.icgc.org/releases
![据说是介样的][图片上传失败...(image-c9849a-1630464344340)]

https://dcc.icgc.org/repositories

据说是介样的

emmmmm 进不去
icgc-get工具：旨在缓解在不同存储库中使用不同软件的需求。（Mac OS and Linux）
（1）下载并安装客户端。
（2）运行icgc-get configure命令来设置环境。
（3）运行icgc-get check命令，以确保凭据正确。
（4）通过存储库浏览器生成清单ID（manifest ID）。在网页中选择了感兴趣的文件后，应单击下载文件按钮，下载manifest file。
（5）运行icgc-get Report命令，在下载之前检查您的请求。
（6）运行icgc-get download –m <manifest-id>来下载清单中的文件。

# 下载解压
curl https://dcc.icgc.org/api/v1/ui/software/icgc-get/linux/latest \
                              -o icgc-get.latest.zip -Lunzip icgc-get.latest.zip

# 配置  # 将被提示输出目录和日志位置，希望能够从哪个存储库下载，多个数据库以空格隔开
./icgc-get configuration
# 安装docker 
sudo apt-get update
sudo apt install docker.io
# 检查配置是否成功。
icgc-get check
# 检查想要下载的数据
./icgc-get report -m <icgc-get manifest ID>
# 下载
./icgc-get download -m b0100005-f125-4472-81ec-fcb40d139f91

（五）SRA

SRA包含涵盖许多不同疾病和物种的短读序列，也有大量与癌症相关的样本
SRA toolkit：https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software
安装该软件后，可以从终端或命令行下载公共访问数据。通常情况下，检索FASTQ格式：

fastq-dump –split-files <SRA ID Number>
# 为了下载受控数据，需要通过dbGAP系统请求访问。将得到一个dbGAP存储库密钥(“.ngc”文件)，用于访问受限数据。
# 在安装目录运行
vdb-config --i  # 将进入一个图形化的屏幕，可以配置该工具包

二、Processed genomic data

数据库（Processed genomic data）	网址
COSMIC （Catalogue of Somatic Mutations in Cancer）	http://cancer.sanger.ac.uk/cosmic
BioMuta	https://hive.biochemistry.gwu.edu/tools/biomuta/index.php
BioXpress	https://hive.biochemistry.gwu.edu/tools/bioxpress/index.php
ClinVar	http://www.ncbi.nlm.nih.gov/clinvar/
LNCipedia	http://www.lncipedia.org/
IntOGen (Integrative Onco Genomics)	https://www.intogen.org/
European Genome-phenome Archive	https://www.ebi.ac.uk/ega/
UniProt	http://www.uniprot.org/
cBioPortal	http://www.cbioportal.org/

（一）COSMIC

是一个包含与人类癌症相关的体细胞突变信息的数据存储库，该数据库包含专家指导和全基因组筛选分析得出的数据。

cosmic

tool

页面往下拉sample download
文件下载：https://cancer.sanger.ac.uk/cosmic/download

（二）BioMuta

是一个单核苷酸变异(SNV)数据库，包含从文献中挖掘的信息以及其他各种癌症突变数据库。

biomuta

默认搜索是以基因为中心的搜索，其中基因名称作为查询，以检索癌症中相关的nsSNV。

ATP7B

（三）BioXpress

是癌症样本中人类基因和miRNA表达的数据库。该数据库包括来自TCGA的数据、ICGC数据以及手动和半自动的文献挖掘证据。

bioxpress

KRAS

（四）Clinvar

例：c.1240C>G
1、xml格式：ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/
2、vcf格式：
https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar_20210828.vcf.gz
https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20210828.vcf.gz（暂且最新）

##fileformat=VCFv4.1
##fileDate=2020-12-26
##source=ClinVar
##reference=GRCh38
##ID=<Description="ClinVar Variation ID">
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">
##INFO=<ID=AF_TGP,Number=1,Type=Float,Description="allele frequencies from TGP">
##INFO=<ID=ALLELEID,Number=1,Type=Integer,Description="the ClinVar Allele ID">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease i
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier, e.g. OMIM:NN
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression.">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status for the Variation ID">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Clinical significance for this single variant">
##INFO=<ID=CLNSIGCONF,Number=.,Type=String,Description="Conflicting clinical significance for this single variant">
##INFO=<ID=CLNSIGINCL,Number=.,Type=String,Description="Clinical significance for a haplotype or genotype that includes this variant. Reported as pai
##INFO=<ID=CLNVC,Number=1,Type=String,Description="Variant type">
##INFO=<ID=CLNVCSO,Number=1,Type=String,Description="Sequence Ontology id for variant type">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="the variant's clinical sources reported as tag-value pairs of database and variant identifier">
##INFO=<ID=DBVARID,Number=.,Type=String,Description="nsv accessions from dbVar for the variant">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited b
##INFO=<ID=MC,Number=.,Type=String,Description="comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequen
##INFO=<ID=ORIGIN,Number=.,Type=String,Description="Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - s
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes. One or more of the following values may be added: 0 - unspecified, 1 - Paralog, 2 - byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       930188  846933  G       A       .       .ALLELEID=824438;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CLNHGVS=NC_000001.11:g.930188G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=SAMD11:1483;GENEINFO=SAMD11:148398;MC=SO:0001583|missense_variant;ORIGIN=1

3、变异或基因tsv格式：ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/
4、疾病的名称和基因-疾病的关系tsv格式

（五）LNCipedia

是一个人类注释长非编码RNA(IncRNAs）数据库。

lncipedia

format

download

（六）IntOGen

该数据库包含已被确定为驱动突变的基因的信息

intogen

（七）EGA （European Genome-phenome Archive）

欧洲基因组表型档案馆（EGA）可以永久存档和共享由生物医学研究项目产生的所有类型的个人可识别基因和表型数据。提供来自大量遗传研究的数据，并作为来自许多不同来源的数据的交换中心。
客户端下载EgaDemoClient：https://www.ebi.ac.uk/ega/sites/ebi.ac.uk.ega/files/documents/EGA_download_streamer_1.1.5.zip （页面不存在 - - -）

# 一旦下载并提取客户端，程序可以通过在终端或命令行中执行以下命令来运行：
java –jar EgaDemoClient.jar
# 登录
EGA> login my@email.com mypassword
# 请求的类型（eg. dataset）、dataset ID、秘钥、标签
EGA> request dataset EGAD00010000650 mykey.key request_EGAD00010000650
# 下载
EGA> download request_EGAD00010000650
# 下载数据集
EGA> decrypt <filename> <encryption key>

（八）UniProt

包含大量以蛋白质为中心的信息，这对癌症研究非常有帮助，尽管数据库本身并不是癌症特异性的。
1、UniProtKB/SwissProt：手动注释和审查蛋白质序列数据库
2、UniProtKB/TrEMBL：计算机注释的蛋白质序列数据库
3、UniRef：蛋白质簇的参考集，它隐藏冗余序列，通常用作参考数据集
4、UniParc：非冗余存档数据集，其中包含几乎所有已知的蛋白质序列，一旦序列被集成，就不会有任何更改或修改