全基因组关联分析(GWAS)知识整理NGS-测序相关知识癌症基因组及资源

Structure Variantion Pipeline(结构

2021-08-11  本文已影响0人  期待未来

Pipeline Overview

(搬运github)-(https://github.com/broadinstitute/gatk-sv

The pipeline consists of a series of modules that perform the following:

gCNV Training

Both the cohort and single-sample modes use the GATK gCNV depth calling pipeline, which requires a trained model as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend clustering them using the dosage score, and training a separate model for each cluster.

Module Descriptions

The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from GATKSVPipelineBatch.wdl, and example input files for each module can be found in /test.

Module 00a

Runs raw evidence collection on each sample.

Note: a list of sample IDs must be provided. Refer to the sample ID requirements for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

Inputs:

Outputs:

Module 00b

Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.

For large cohorts, we recommend dividing samples into smaller batches (~500 samples) with ~1:1 male:female ratio. Refer to the Batching section for further guidance on creating batches.

We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file.

Prerequisites:

Inputs:

Outputs:

Preliminary Sample QC

The purpose of sample filtering at this stage after Module00b is to prevent very poor quality samples from interfering with the results for the rest of the callset. In general, samples that are borderline are okay to leave in, but you should choose filtering thresholds to suit the needs of your cohort and study. There will be future opportunities (as part of Module03) for filtering before the joint genotyping stage if necessary. Here are a few of the basic QC checks that we recommend:

gCNV Training

Trains a gCNV model for use in Module 00c. The WDL can be found at /gcnv/trainGCNV.wdl.

Prerequisites:

Inputs:

Outputs:

Module 00c

Runs CNV callers (cnMOPs, GATK gCNV) and combines single-sample raw evidence into a batch. See above for more information on batching.

Prerequisites:

Inputs:

Outputs:

Module 01

Clusters SV calls across a batch.

Prerequisites:

Inputs:

Outputs:

Module 02

Generates variant metrics for filtering.

Prerequisites:

Inputs:

Outputs:

Module 03

Filters poor quality variants and filters outlier samples.

Prerequisites:

Inputs:

Outputs:

Merge Cohort VCFs

Combines filtered variants across batches. The WDL can be found at: /wdl/MergeCohortVcfs.wdl.

Prerequisites:

Inputs:

Outputs:

Module 04

Genotypes a batch of samples across unfiltered variants combined across all batches.

Prerequisites:

Inputs:

Outputs:

Module 04b

Re-genotypes probable mosaic variants across multiple batches.

Prerequisites:

Inputs:

Outputs:

Module 05/06

Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up.

Prerequisites:

Inputs:

Outputs:

Module 07 (in development)

Apply downstream filtering steps to the cleaned vcf to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.

Filterings methods include:

gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt

Module 08 (in development)

Add annotations, such as the inferred function and allele frequencies of variants, to final vcf.

Annotations methods include:

Module 09 (in development)

Visualize SVs with IGV screenshots and read depth plots.

Visualization methods include:

参考文献:A structural variation reference for medical and population genetics

上一篇下一篇

猜你喜欢

热点阅读