diploS/HIC鉴定选择信号
diploS / HIC
使用深度卷积神经网络来识别种群基因组数据中的硬选择和软选择区域。
使用diploS / HIC
进行分析的工作流程包括四个基本部分。
1)使用模拟为diploS / HIC
生成训练集。
2)diploS / HIC
培训和绩效评估。
3)从基因组数据计算dipoS / HIC
特征向量。
4)使用训练有素的网络对经验数据进行预测。
此处提供的软件可以处理最后三个部分。 人口遗传模拟必须使用诸如discoal之类的单独软件进行。
1. 安装
使用conda安装scikit-allel,这将为我们提供基础包(numpy,scipy,scikit-learn等)
conda install -c conda-forge scikit-allel
tensorflow
和keras
是用于处理事物的深度学习部分的库
pip install tensorflow
pip install keras
现在准备安装diploS / HIC
git clone https://github.com/kern-lab/diploSHIC.git
cd diploSHIC
python setup.py install
2. 软件使用
将要连接的主程序是diploSHIC.py
。 该脚本具有四个运行模式,这些模式允许用户执行监督的机器学习过程中的每个主要步骤。
diploSHIC.py
使用python
中的argparse
模块尝试提供完整的基于命令行的帮助菜单。
(1)使用方法
$ python diploSHIC.py -h
usage: diploSHIC.py [-h] {train,predict,fvecSim,fvecVcf} ...
calculate feature vectors, train, or predict with diploSHIC
possible modes (enter 'python diploSHIC.py modeName -h' for modeName's help message:
{fvecSim,makeTrainingSets,train,fvecVcf,predict}
sub-command help
fvecSim Generate feature vectors from simulated data
makeTrainingSets Combine feature vectors from muliple fvecSim runs into
5 balanced training sets
train train and test a shic CNN
fvecVcf Generate feature vectors from data in a VCF file
predict perform prediction using an already-trained SHIC CNN
optional arguments:
-h, --help show this help message and exit
(2)准备模拟训练/测试数据
所有版本的S / HIC都需要模拟数据来进行训练,该软件提供了一个脚本generateSimLaunchScript.py,该脚本演示了如何模拟训练集。
(3)特征向量生成模型
- fvecSim mode
fvecSim运行模式用于将ms样式的输出转换为与diploSHIC.py兼容的特征向量。
$ python diploSHIC.py fvecSim -h
usage: diploSHIC.py fvecSim [-h] [--totalPhysLen TOTALPHYSLEN]
[--numSubWins NUMSUBWINS]
[--maskFileName MASKFILENAME]
[--chrArmsForMasking CHRARMSFORMASKING]
[--unmaskedFracCutoff UNMASKEDFRACCUTOFF]
[--outStatsDir OUTSTATSDIR]
[--ancFileName ANCFILENAME] [--pMisPol PMISPOL]
shicMode msOutFile fvecFileName
required arguments:
shicMode specifies whether to use original haploid SHIC (use
'haploid') or diploSHIC ('diploid')
msOutFile path to simulation output file (must be same format
used by Hudson's ms)
fvecFileName path to file where feature vectors will be written
optional arguments:
-h, --help show this help message and exit
--totalPhysLen TOTALPHYSLEN
Length of simulated chromosome for converting infinite
sites ms output to finite sites (default=1100000)
--numSubWins NUMSUBWINS
The number of subwindows that our chromosome will be
divided into (default=11)
--maskFileName MASKFILENAME
Path to a fasta-formatted file that contains masking
information (marked by 'N'). If specified, simulations
will be masked in a manner mirroring windows drawn
from this file.
--chrArmsForMasking CHRARMSFORMASKING
A comma-separated list (no spaces) of chromosome arms
from which we want to draw masking information (or
'all' if we want to use all arms. Ignored if
maskFileName is not specified.
--unmaskedFracCutoff UNMASKEDFRACCUTOFF
Minimum fraction of unmasked sites, if masking
simulated data
--outStatsDir OUTSTATSDIR
Path to a directory where values of each statistic in
each subwindow are recorded for each rep
--ancFileName ANCFILENAME
Path to a fasta-formatted file that contains inferred
ancestral states ('N' if unknown). This is used for
masking, as sites that cannot be polarized are masked,
and we mimic this in the simulted data. Ignored in
diploid mode which currently does not use ancestral
state information
--pMisPol PMISPOL The fraction of sites that will be intentionally
polarized to better approximate real data
该模型接受三个参数,然后提供选项。 参数是“ shicMode”,指的是是否计算单倍体或二倍体摘要统计信息,输入文件的名称和输出文件的名称。
- fvecVcf mode
fvecVcf模式用于根据VCF文件计算特征向量。
$ python diploSHIC.py fvecVcf -h
usage: diploSHIC.py fvecVcf [-h] [--targetPop TARGETPOP]
[--sampleToPopFileName SAMPLETOPOPFILENAME]
[--winSize WINSIZE] [--numSubWins NUMSUBWINS]
[--maskFileName MASKFILENAME]
[--unmaskedFracCutoff UNMASKEDFRACCUTOFF]
[--ancFileName ANCFILENAME]
[--statFileName STATFILENAME]
[--segmentStart SEGMENTSTART]
[--segmentEnd SEGMENTEND]
shicMode chrArmVcfFile chrArm chrLen
required arguments:
shicMode specifies whether to use original haploid SHIC (use
'haploid') or diploSHIC ('diploid')
chrArmVcfFile VCF format file containing data for our chromosome arm
(other arms will be ignored)
chrArm Exact name of the chromosome arm for which feature
vectors will be calculated
chrLen Length of the chromosome arm
fvecFileName path to file where feature vectors will be written
optional arguments:
-h, --help show this help message and exit
--targetPop TARGETPOP
Population ID of samples we wish to include
--sampleToPopFileName SAMPLETOPOPFILENAME
Path to tab delimited file with population
assignments; format: SampleID popID
--winSize WINSIZE Length of the large window (default=1100000)
--numSubWins NUMSUBWINS
Number of sub-windows within each large window
(default=11)
--maskFileName MASKFILENAME
Path to a fasta-formatted file that contains masking
information (marked by 'N'); must have an entry with
title matching chrArm
--unmaskedFracCutoff UNMASKEDFRACCUTOFF
Fraction of unmasked sites required to retain a
subwindow
--ancFileName ANCFILENAME
Path to a fasta-formatted file that contains inferred
ancestral states ('N' if unknown); must have an entry
with title matching chrArm. Ignored for diploid mode
which currently does not use ancestral state
information.
--statFileName STATFILENAME
Path to a file where statistics will be written for
each subwindow that is not filtered out
--segmentStart SEGMENTSTART
Left boundary of region in which feature vectors are
calculated (whole arm if omitted)
--segmentEnd SEGMENTEND
Right boundary of region in which feature vectors are
calculated (whole arm if omitted)
此模式有五个参数,并且还有许多选项。 必需的参数是“ shicMode”,即是否要计算单倍体或二倍体摘要统计信息,输入文件的名称,要为其计算统计信息的染色体,该染色体的长度以及输出文件的名称。
(4)training the CNN and prediction
准备好特征向量文件后,我们就可以训练和测试我们的CNN,然后最后对经验数据进行预测。
(5)格式化训练集
训练模型:
这是diploSHIC.py的训练模式的帮助消息
$ python diploSHIC.py train -h
usage: diploSHIC.py train [-h] [--epochs EPOCHS] [--numSubWins NUMSUBWINS]
trainDir testDir outputModel
required arguments:
trainDir path to training set files
testDir path to test set files, can be same as trainDir
outputModel file name for output model, will create two files one
with structure one with weights
optional arguments:
-h, --help show this help message and exit
--epochs EPOCHS max epochs for training CNN (default = 100)
--numSubWins NUMSUBWINS
number of subwindows that our chromosome is divided
into (default = 11)
训练模式用于训练深度学习分类器。 它的必需参数是trainDir(保留训练特征向量的目录),testDir(保留测试特征向量的目录)和outputModel受训网络的文件名。diploSHIC.py在训练和测试目录中需要五个名为hard.fvec,soft.fvec,neut.fvec,linkedSoft.fvec和linkedHard.fvec的文件。 训练和测试目录可以是同一目录,在这种情况下,保留20%的培训示例用于测试和验证。
训练模式有两个选项,用于特征向量的子窗口数和用于网络的训练时期数。
(6)预测模型
训练完分类器后,便可以使用diploSHIC.py的预测模型对经验数据进行分类。
$ python diploSHIC.py predict -h
usage: diploSHIC.py predict [-h] [--numSubWins NUMSUBWINS]
modelStructure modelWeights predictFile
predictFileOutput
required arguments:
modelStructure path to CNN structure .json file
modelWeights path to CNN weights .h5 file
predictFile input file to predict
predictFileOutput output file name
optional arguments:
-h, --help show this help message and exit
--numSubWins NUMSUBWINS
number of subwindows that our chromosome is divided
into (default = 11)
预测模式将训练模式输出的两个模型文件作为输入,经验特征向量的输入文件,以及预测输出的文件名。
3. 简单示例
该示例首先从模拟计算特征向量开始,然后以预测基因组数据。 目录test /和training /各自包含格式正确的二倍体特征向量,可以将其输入到diploSHIC中。
首先,我们将训练diploSHIC CNN,训练纪元的数量限制为10个。
$ python diploSHIC.py train training/ testing/ fooModel --epochs 10
优化完成后,我们训练后的网络将包含在两个文件fooModel.json和fooModel.weights.hdf5中。 diploSHIC.py的最后输出为我们提供了有关保留的测试数据的丢失和准确性的信息。
evaluation on test set:
diploSHIC loss: 0.404791
diploSHIC accuracy: 0.846800
--epochs的值默认设置100就足够了。 现在我们有了训练好的模型,我们可以对一些经验数据进行预测。 在仓库中有一个名为testEmpirical.fvec的文件,我们将其用作输入
$ python diploSHIC.py predict fooModel.json fooModel.weights.hdf5 testEmpirical.fvec testEmpirical.preds
输出预测将保存在testEmpirical.preds中
一个完整测试案例
参考:https://github.com/kr-colab/diploSHIC