使用 EXCAVATOR2 对WES数据找CNV

2019-04-29 本文已影响7人因地制宜的生信达人

使用 EXCAVATOR2 对WES数据找CNV

工具首发于2013，于2016进行了重大更新，文章列表：

cd ~/biosoft
# https://sourceforge.net/projects/excavator2tool/?source=navbar
mkdir EXCAVATOR2 &&  cd EXCAVATOR2 
wget https://sourceforge.net/projects/excavator2tool/files/EXCAVATOR2_Package_v1.1.2.tgz
tar zxvf EXCAVATOR2_Package_v1.1.2.tgz 
# 软件400多M,里面有个pdf说明书。

说明书实在是太复杂了。软件只是是一个压缩包，解压即可使用，里面自带了perl,r,shell脚本，比较方便使用，而比较麻烦的是需要系统有Hmisc这个R包。

> library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Use suppressPackageStartupMessages() to eliminate package startup
messages.

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:base’:

    format.pval, round.POSIXt, trunc.POSIXt, units

>

有趣的是该软件需要R去编译两个fortran文件。

软件解析

软件也是3个步骤：

TargetPerla.pl
EXCAVATORDataPrepare.pl
EXCAVATORDataAnalysis.pl

第一个步骤是 TargetPerla.pl，处理一下参考基因组以及外显子坐标问题，需要五个参数：

the path to a source file (e.g. SourceTarget.txt), 是软件的配置文件：，
the path to the target input file，就是BED格式的坐标文件，需要前3列坐标
a “target name”,
the window size (i.e. 10000, 20000 or 500000)
the assembly (allowed options are: hg19 and hg38).

注意BED文件需要 sort -k1,1 -k2,2n *.bed | bedtools merge
作者给的例子是：perl TargetPerla.pl SourceTarget.txt myTarget.bed MyTarget_w50K 50000 hg19
软件本身也默认给了一些数据：

data/
├── [  74]  centromere
│   ├── [ 592]  CentromerePosition_hg19.txt
│   └── [ 592]  CentromerePosition_hg38.txt
├── [237M]  GCA_000001405.15_GRCh38.bw
├── [  28]  support
│   ├── [  65]  hg19
│   │   ├── [ 446]  ChromosomeCoordinate_HG19.txt
│   │   └── [ 23K]  GapHg19.UCSC.txt
│   └── [  65]  hg38
│       ├── [ 508]  ChromosomeCoordinate_HG38.txt
│       └── [ 43K]  GapHg38.UCSC.txt
├── [  28]  targets
│   ├── [   6]  hg19
│   └── [   6]  hg38
└── [216M]  ucsc.hg19.bw

其中附带的GCA_000001405.15_GRCh38.bw是来自于：ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz 最新版是：GenBank assembly accession: GCA_000001405.27 (latest).

一般来说，我们会有自己的参考基因组，作者推荐用GEM suite (http://gemlibrary.sourceforge.net/),来把自己的参考基因组转换成bw文件。

然后再走第一步，会产生一个文件夹给后续分析使用。

第二个步骤是：EXCAVATORDataPrepare.pl 对自己的测序bam文件进行一定的计算处理

作者给的示例代码是：

perl EXCAVATORDataPrepare.pl ExperimentalFilePrepare.w50000.txt \
--processors 6 --target MyTarget_w50000 --assembly hg19

其中 --target 参数是第一步的结果文件夹。
而ExperimentalFilePrepare.w50000.txt这个就是配置文件，包含3列，分别是bam文件的全路径，以及每个样本的输出结果文件夹，以及样本名。

第三个步骤是： EXCAVATORDataAnalysis.pl 判断CNV状态

主要分成5种CNV状态： 2-copy deletion, 1-copy deletion, normal, 1-copy duplication and N-copy amplification).
作者给的示例代码是：

perl EXCAVATORDataAnalysis.pl ExperimentalFileAnalysis.w50K.txt \
--processors 6 --target MyTarget_w50K --assembly hg19 \
--output /.../OutEXCAVATOR2/Results_MyProject_w50K

还是需要自己手动制作配置文件，一般是配对肿瘤外显子数据找cnv，所以需要在配置文件的第一列指定每个样本属于T,还是C，然后是第几个样本。
参加教程的 Figure 3: A typical well-formatted input file for EXCAVATORDataAnalysis.pl module and “paired” mode.

可能会需要修改软件运行参数，修改的前提是真正理解它们了。

## Omega parameter for the HSLM algorithm ##
0.1
## Theta parameter (baseline probability m_i changes its value) for the HSLM algorithm ##
1e-5
## D_norm parameter for the HSLM algorithm ##
10e5
## Cellularity parameter for the FastCall Calling algorithm ##
1
## Threshold d for the truncated gaussian distribution of the FastCall Calling algorithm ##
0.5
## Threshold u for the truncated gaussian distribution of the FastCall Calling algorithm ##
0.35
## Segment with a number of exons smaller than a threshold are filtered out ##

实战

未完待续

使用 EXCAVATOR2 对WES数据找CNV

使用 EXCAVATOR2 对WES数据找CNV

软件解析

第一个步骤是 TargetPerla.pl，处理一下参考基因组以及外显子坐标问题，需要五个参数：

第二个步骤是：EXCAVATORDataPrepare.pl 对自己的测序bam文件进行一定的计算处理

第三个步骤是： EXCAVATORDataAnalysis.pl 判断CNV状态

实战

猜你喜欢

热点阅读

使用 EXCAVATOR2 对WES数据找CNV

使用 EXCAVATOR2 对WES数据找CNV

软件解析

第一个步骤是 TargetPerla.pl， 处理一下参考基因组以及外显子坐标问题，需要五个参数：

第二个步骤是：EXCAVATORDataPrepare.pl 对自己的测序bam文件进行一定的计算处理

第三个步骤是： EXCAVATORDataAnalysis.pl 判断CNV状态

实战

猜你喜欢

热点阅读

第一个步骤是 TargetPerla.pl，处理一下参考基因组以及外显子坐标问题，需要五个参数：