生物信息

diploS/HIC鉴定选择信号

2021-05-13  本文已影响0人  DumplingLucky

diploS / HIC使用深度卷积神经网络来识别种群基因组数据中的硬选择和软选择区域。

使用diploS / HIC进行分析的工作流程包括四个基本部分。
1)使用模拟为diploS / HIC生成训练集。
2)diploS / HIC培训和绩效评估。
3)从基因组数据计算dipoS / HIC特征向量。
4)使用训练有素的网络对经验数据进行预测。
此处提供的软件可以处理最后三个部分。 人口遗传模拟必须使用诸如discoal之类的单独软件进行。

1. 安装

使用conda安装scikit-allel,这将为我们提供基础包(numpy,scipy,scikit-learn等)

conda install -c conda-forge scikit-allel

tensorflowkeras是用于处理事物的深度学习部分的库

pip install tensorflow 
pip install keras

现在准备安装diploS / HIC

git clone https://github.com/kern-lab/diploSHIC.git
cd diploSHIC 
python setup.py install

2. 软件使用

将要连接的主程序是diploSHIC.py。 该脚本具有四个运行模式,这些模式允许用户执行监督的机器学习过程中的每个主要步骤。
diploSHIC.py使用python中的argparse模块尝试提供完整的基于命令行的帮助菜单。
(1)使用方法

$ python diploSHIC.py -h
usage: diploSHIC.py [-h] {train,predict,fvecSim,fvecVcf} ...

calculate feature vectors, train, or predict with diploSHIC

possible modes (enter 'python diploSHIC.py modeName -h' for modeName's help message:
  {fvecSim,makeTrainingSets,train,fvecVcf,predict}
                        sub-command help
    fvecSim             Generate feature vectors from simulated data
    makeTrainingSets    Combine feature vectors from muliple fvecSim runs into
                        5 balanced training sets
    train               train and test a shic CNN
    fvecVcf             Generate feature vectors from data in a VCF file
    predict             perform prediction using an already-trained SHIC CNN

optional arguments:
  -h, --help            show this help message and exit

(2)准备模拟训练/测试数据
所有版本的S / HIC都需要模拟数据来进行训练,该软件提供了一个脚本generateSimLaunchScript.py,该脚本演示了如何模拟训练集。
(3)特征向量生成模型

$ python diploSHIC.py fvecSim -h
usage: diploSHIC.py fvecSim [-h] [--totalPhysLen TOTALPHYSLEN]
                            [--numSubWins NUMSUBWINS]
                            [--maskFileName MASKFILENAME]
                            [--chrArmsForMasking CHRARMSFORMASKING]
                            [--unmaskedFracCutoff UNMASKEDFRACCUTOFF]
                            [--outStatsDir OUTSTATSDIR]
                            [--ancFileName ANCFILENAME] [--pMisPol PMISPOL]
                            shicMode msOutFile fvecFileName

required arguments:
  shicMode              specifies whether to use original haploid SHIC (use
                        'haploid') or diploSHIC ('diploid')
  msOutFile             path to simulation output file (must be same format
                        used by Hudson's ms)
  fvecFileName          path to file where feature vectors will be written

optional arguments:
  -h, --help            show this help message and exit
   --totalPhysLen TOTALPHYSLEN
                        Length of simulated chromosome for converting infinite
                        sites ms output to finite sites (default=1100000)
  --numSubWins NUMSUBWINS
                        The number of subwindows that our chromosome will be
                        divided into (default=11)
  --maskFileName MASKFILENAME
                        Path to a fasta-formatted file that contains masking
                        information (marked by 'N'). If specified, simulations
                        will be masked in a manner mirroring windows drawn
                        from this file.
  --chrArmsForMasking CHRARMSFORMASKING
                        A comma-separated list (no spaces) of chromosome arms
                        from which we want to draw masking information (or
                        'all' if we want to use all arms. Ignored if
                        maskFileName is not specified.
  --unmaskedFracCutoff UNMASKEDFRACCUTOFF
                        Minimum fraction of unmasked sites, if masking
                        simulated data
  --outStatsDir OUTSTATSDIR
                        Path to a directory where values of each statistic in
                        each subwindow are recorded for each rep
  --ancFileName ANCFILENAME
                        Path to a fasta-formatted file that contains inferred
                        ancestral states ('N' if unknown). This is used for
                        masking, as sites that cannot be polarized are masked,
                        and we mimic this in the simulted data. Ignored in
                        diploid mode which currently does not use ancestral
                        state information
  --pMisPol PMISPOL     The fraction of sites that will be intentionally
                        polarized to better approximate real data

该模型接受三个参数,然后提供选项。 参数是“ shicMode”,指的是是否计算单倍体或二倍体摘要统计信息,输入文件的名称和输出文件的名称。

$ python diploSHIC.py fvecVcf -h
usage: diploSHIC.py fvecVcf [-h] [--targetPop TARGETPOP]
                            [--sampleToPopFileName SAMPLETOPOPFILENAME]
                            [--winSize WINSIZE] [--numSubWins NUMSUBWINS]
                            [--maskFileName MASKFILENAME]
                            [--unmaskedFracCutoff UNMASKEDFRACCUTOFF]
                            [--ancFileName ANCFILENAME]
                            [--statFileName STATFILENAME]
                            [--segmentStart SEGMENTSTART]
                            [--segmentEnd SEGMENTEND]
                            shicMode chrArmVcfFile chrArm chrLen

required arguments:
  shicMode              specifies whether to use original haploid SHIC (use
                        'haploid') or diploSHIC ('diploid')
  chrArmVcfFile         VCF format file containing data for our chromosome arm
                        (other arms will be ignored)
  chrArm                Exact name of the chromosome arm for which feature
                        vectors will be calculated
  chrLen                Length of the chromosome arm
  fvecFileName          path to file where feature vectors will be written

optional arguments:
  -h, --help            show this help message and exit
  --targetPop TARGETPOP
                        Population ID of samples we wish to include
  --sampleToPopFileName SAMPLETOPOPFILENAME
                        Path to tab delimited file with population
                        assignments; format: SampleID popID
  --winSize WINSIZE     Length of the large window (default=1100000)
  --numSubWins NUMSUBWINS
                        Number of sub-windows within each large window
                        (default=11)
  --maskFileName MASKFILENAME
                        Path to a fasta-formatted file that contains masking
                        information (marked by 'N'); must have an entry with
                        title matching chrArm
  --unmaskedFracCutoff UNMASKEDFRACCUTOFF
                        Fraction of unmasked sites required to retain a
                        subwindow
  --ancFileName ANCFILENAME
                        Path to a fasta-formatted file that contains inferred
                        ancestral states ('N' if unknown); must have an entry
                        with title matching chrArm. Ignored for diploid mode
                        which currently does not use ancestral state
                        information.
  --statFileName STATFILENAME
                        Path to a file where statistics will be written for
                        each subwindow that is not filtered out
  --segmentStart SEGMENTSTART
                        Left boundary of region in which feature vectors are
                        calculated (whole arm if omitted)
  --segmentEnd SEGMENTEND
                        Right boundary of region in which feature vectors are
                        calculated (whole arm if omitted)

此模式有五个参数,并且还有许多选项。 必需的参数是“ shicMode”,即是否要计算单倍体或二倍体摘要统计信息,输入文件的名称,要为其计算统计信息的染色体,该染色体的长度以及输出文件的名称。
(4)training the CNN and prediction
准备好特征向量文件后,我们就可以训练和测试我们的CNN,然后最后对经验数据进行预测。
(5)格式化训练集
训练模型:
这是diploSHIC.py的训练模式的帮助消息

$ python diploSHIC.py train -h
usage: diploSHIC.py train [-h] [--epochs EPOCHS] [--numSubWins NUMSUBWINS]
                          trainDir testDir outputModel

required arguments:
  trainDir              path to training set files
  testDir               path to test set files, can be same as trainDir
  outputModel           file name for output model, will create two files one
                        with structure one with weights

optional arguments:
  -h, --help            show this help message and exit
  --epochs EPOCHS       max epochs for training CNN (default = 100)
  --numSubWins NUMSUBWINS
                        number of subwindows that our chromosome is divided
                        into (default = 11)

训练模式用于训练深度学习分类器。 它的必需参数是trainDir(保留训练特征向量的目录),testDir(保留测试特征向量的目录)和outputModel受训网络的文件名。diploSHIC.py在训练和测试目录中需要五个名为hard.fvec,soft.fvec,neut.fvec,linkedSoft.fvec和linkedHard.fvec的文件。 训练和测试目录可以是同一目录,在这种情况下,保留20%的培训示例用于测试和验证。

训练模式有两个选项,用于特征向量的子窗口数和用于网络的训练时期数。
(6)预测模型
训练完分类器后,便可以使用diploSHIC.py的预测模型对经验数据进行分类。

$ python diploSHIC.py predict -h
usage: diploSHIC.py predict [-h] [--numSubWins NUMSUBWINS]
                            modelStructure modelWeights predictFile
                            predictFileOutput

required arguments:
  modelStructure        path to CNN structure .json file
  modelWeights          path to CNN weights .h5 file
  predictFile           input file to predict
  predictFileOutput     output file name

optional arguments:
  -h, --help            show this help message and exit
  --numSubWins NUMSUBWINS
                        number of subwindows that our chromosome is divided
                        into (default = 11)

预测模式将训练模式输出的两个模型文件作为输入,经验特征向量的输入文件,以及预测输出的文件名。

3. 简单示例

该示例首先从模拟计算特征向量开始,然后以预测基因组数据。 目录test /和training /各自包含格式正确的二倍体特征向量,可以将其输入到diploSHIC中。
首先,我们将训练diploSHIC CNN,训练纪元的数量限制为10个。

$ python diploSHIC.py train training/ testing/ fooModel --epochs 10

优化完成后,我们训练后的网络将包含在两个文件fooModel.json和fooModel.weights.hdf5中。 diploSHIC.py的最后输出为我们提供了有关保留的测试数据的丢失和准确性的信息。

evaluation on test set:
diploSHIC loss: 0.404791
diploSHIC accuracy: 0.846800

--epochs的值默认设置100就足够了。 现在我们有了训练好的模型,我们可以对一些经验数据进行预测。 在仓库中有一个名为testEmpirical.fvec的文件,我们将其用作输入

$ python diploSHIC.py predict fooModel.json fooModel.weights.hdf5 testEmpirical.fvec testEmpirical.preds

输出预测将保存在testEmpirical.preds中
一个完整测试案例
参考:https://github.com/kr-colab/diploSHIC

上一篇下一篇

猜你喜欢

热点阅读