DNA甲基化分析软件之HOME

2019-05-28 本文已影响48人热衷组培的二货潜

HOME: a histogram based machine learning approach for effective identification of differentially methylated regions

虽然软件发的期刊影响因子不高，但是是Ryan lister实验室成员发表的。此实验室发表过不少高水平期刊文章。Genome Research、Nature Communications、Genome Biology、Current Opinion in Plant Biology、elife等等。其中此文章的一作Akanksha Srivastava发表期刊如下：

Eichten SR, Stuart T, Srivastava A, Lister R, Borevitz JO (2016) *DNA methylation profiles of diverse Brachypodium distachyon aligns with underlying genetic diversity. *Genome Research 10.1101/gr.205468.116. pdf, pubmed
Narsai R^#, Gouil Q, Secco D, Srivastava A, Karpievitch YV, Liew LC, Lister R, Lewsey MG^#, Whelan J (2017) *Extensive transcriptomic and epigenomic remodelling occurs during Arabidopsis thaliana germination. *Genome Biology 10.1186/s13059-017-1302-3, pdf
Srivastava A, Karpievitch YV, Eichten SR, Borevitz JO, Lister R (2019) *HOME: A histogram based machine learning approach for effective identification of differentially methylated regions. *BMC Bioinformatics 10.1186/s12859-019-2845-y, pdf

不难看出此一作主要是从事分析DNA甲基化这一块工作的，常年与其他人合作（感觉有点惨，分析的大都就是这样与人合作基本拿不到排名第一的一作）。此工具就是将其之前发表的两篇文章中的内容整个起来的，包括16年Genome Research中的DNA甲基化的时间序列分析。

把文章看了一遍，我觉得有必要把这一段内容先贴出来

HOME

HOME（histogram of methylation）是用于鉴别差异DNA甲基化区域（DMR）鉴定的python包。该方法使用甲基化特征的直方图和线性支持向量机（SVM）来从全基因组亚硫酸氢盐测序（WGBS）数据中鉴定DMR。HOME可以识别成对和时间序列DMR(有或没有重复都可以)。

安装

HOME是为python 2.7编写的，并在Linux系统上进行了测试。建议在安装HOME包之前先为python 2.7设置虚拟环境。
作者这里提供了怎么创建虚拟环境，哈哈，不是通过conda而是virtualenv(virtualenv is a tool to create isolated Python environments)

virtualenv -p <path_to_python2.7> <env_name>

激活python2.7环境

source <env_name>/bin/activate

作者提供三种方法安装HOME
- 1

pip install git+https://github.com/ListerLab/HOME.git

or

git clone https://github.com/ListerLab/HOME.git
cd ./HOME
pip install -r requirements.txt
python setup.py install

git clone https://github.com/ListerLab/HOME.git
cd ./HOME
conda env create    *assuming the conda environment is activated and R is already installed in it*
source activate HOMEenv
python setup.py install

输入文件

# 染色体名  位点  链方向  DNA甲基化类型  甲基化的C的数目  总的胞嘧啶C的数目
# Chromosome number, position, strand, type (CG/CHG/CHH) where H is anything but G, methylated reads and total number of reads.
chr1    15814   +   CG  12  14
chr1    15815   -   CG  15  21
chr1    15816   -   CHG 1   9
chr1    15821   -   CHH 7   22
chr1    15823   -   CHH 0   2
chr1    15825   -   CHH 11  19

需要注意，作者的说明书有点不用心啊。是sample_file_CG.txt而不是csv结尾，并且里面作者提供的是如下，我们需要将其改为全路径。sed 's/\.\//\/public\/home\/qliu\/biosoft\/HOME\//g' sample_file_CG.txt

sample1 ./testcase/CG/sample1_r1.txt    ./testcase/CG/sample1_r2.txt
sample2 ./testcase/CG/sample2_r1.txt    ./testcase/CG/sample2_r2.txt
sample3 ./testcase/CG/sample3_r1.txt    ./testcase/CG/sample3_r2.txt

修改后

sample1 /public/home/qliu/biosoft/HOME/testcase/CG/sample1_r1.txt       /public/home/qliu/biosoft/HOME/testcase/CG/sample1_r2.txt
sample2 /public/home/qliu/biosoft/HOME/testcase/CG/sample2_r1.txt       /public/home/qliu/biosoft/HOME/testcase/CG/sample2_r2.txt
sample3 /public/home/qliu/biosoft/HOME/testcase/CG/sample3_r1.txt       /public/home/qliu/biosoft/HOME/testcase/CG/sample3_r2.txt

寻找差异DMR

HOME-pairwise   -t [CG/CHG/CHH/CHN/CNN]  -i [sample_file_fullpath]  -o [output_directorypath]

`运行路径一定要在HOME文件目录下即`:`/public/home/qliu/biosoft/HOME/`，否则会报错`Fatal error: cannot open file './scripts/HOME_R.R': No such file or directory`,不知道是我没掌握精髓还是这个太。。

HOME-pairwise -t CG -i /public/home/qliu/biosoft/HOME/testcase/sample_file_CG.txt -o ./pairwise_CG_outputpath
# 运行结果
Preparing the DMRs from HOME.....
GOOD LUCK !
DMRs for sample1_VS_sample2_10 done
DMRs for sample1_VS_sample2_13 done
DMRs for sample1_VS_sample2_12 done
DMRs for sample1_VS_sample3_10 done
DMRs for sample1_VS_sample3_12 done
DMRs for sample1_VS_sample3_13 done
DMRs for sample2_VS_sample3_12 done
DMRs for sample2_VS_sample3_10 done
DMRs for sample2_VS_sample3_13 done

HOME的默认参数相对宽松。要以更严格的设置运行HOME，请将默认参数更改为以下或更高：

HOME-pairwise   -t CG   -i  /public/home/qliu/biosoft/HOME/testcase/sample_file_CG.txt -o ./pairwise_stringent_CG_outputpath --delta 0.2 --minc 5 
# 运行结果
Preparing the DMRs from HOME.....
GOOD LUCK !
DMRs for sample1_VS_sample2_13 done
DMRs for sample1_VS_sample2_10 done
DMRs for sample1_VS_sample2_12 done
DMRs for sample1_VS_sample3_13 done
DMRs for sample1_VS_sample3_10 done
DMRs for sample1_VS_sample3_12 done
DMRs for sample2_VS_sample3_13 done
DMRs for sample2_VS_sample3_10 done
DMRs for sample2_VS_sample3_12 done
Congratulations the DMRs are ready

结果比较, 好吧这里没区别，应该是示例数据问题。但是这里很好的是运行过程中是分染色体分别进行运行，加快了运行进程。

Delta指两个样品的平均甲基化水平差异。

# 由于示例文件只提供了三条染色体的甲基化信息所以后面只有三条染色体的结果：
[CG]cut -f 1 sample1_r1.txt |sort |uniq -c
   2998 10
   2999 12
   2999 13

[sample1_VS_sample2]$ wc -l ./*
  3 ./HOME_DMRs_10.txt
  3 ./HOME_DMRs_12.txt
  3 ./HOME_DMRs_13.txt
  9 total

[sample1_VS_sample2]$ wc -l ./*
  3 ./HOME_DMRs_10.txt
  3 ./HOME_DMRs_12.txt
  3 ./HOME_DMRs_13.txt
  9 total


# pairwise_stringent_CG_outputpath：HOME_DMRs_12.txt (END)
chr     start   end     status  numC    mean_Meth1      mean_Meth2      delta   avg_coverage1   avg_coverage2   len
12      122818  123202  hyper   14      0.9421879271112584      0.1152926299222629      0.8268952971889956      60      60      384
12      179697  180225  hyper   9       0.9698766175534637      0.12380942497017339     0.8460671925832903      58      62      528

# HOME_pairwise_DMRs：HOME_DMRs_12.txt (END)
chr     start   end     status  numC    mean_Meth1      mean_Meth2      delta   avg_coverage1   avg_coverage2   len
12      122818  123202  hyper   14      0.9421879271112584      0.1152926299222629      0.8268952971889956      60      60      384
12      179697  180225  hyper   9       0.9698766175534637      0.12380942497017339     0.8460671925832903      58      62      528

还可以进行时间序列DNA甲基化分析

这里，Max_delta是比较样品中的最大平均甲基化差异。置信度得分考虑了C的长度，数量和SVM得分。值越高表示DMR越自信，也就是可信度越高。Comb1-n表示每种样品组合的成对比较。它报告start：end：state：每个成对比较的delta。

[HOME]$ HOME-timeseries  -t CG -i /public/home/qliu/biosoft/HOME/testcase/sample_file_CG.txt -o ./time_CG_outputpath
Preparing the DMRs from HOME.....
GOOD LUCK !
DMRs for 13 done
DMRs for 10 done
DMRs for 12 done
Congratulations the DMRs are ready

# 可以看到此文件中包含了两两相互比较的三种情况得到的结果。
# Timeseries_HOME_DMR_10.txt (END)
chr     start   end     numC    len     max_delta       confidence_scores       sample1_VS_sample2      sample1_VS_sample3      sample2_VS_sample3
10      122795  123264  17      469     0.6694115959991053      0.3984762292738861      122795:123264:hypo:0.669        122795:123264:hypo:0.353        122795:123264:hyper:0.318
10      179697  180278  11      581     0.7046148506166738      0.3052977823501635      179697:180278:hypo:0.705        179697:180278:hypo:0.348        179697:180278:hyper:0.36

参数翻译- -

HOME-pairwise

参数	默认	功能
-sc --scorecutoff	0.1	每个胞嘧啶C位点的分类打分
-p --pruncutoff	0.1	从两端检查连续Cs的SVM分数以细化边界
-npp -–numprocess	8	运行的线程数
-ml --minlength	50	输出的DMR的最小长度
-ncb --numcb	5	输出的DMR分离时所需要包含的少胞嘧啶C的数目
-md -–mergedist	500	如果两个DMR之间的举例小于500就合并为一个DMR
-prn --prunningC	3	number of consecutives Cs to be considered for pruning for boundary refinement2（翻译拿不准）
-ns --numsamples	all	被用来计算DMR的样本，默认是所有
-sp --startposition	1st position	在进行`timeseries DMR`计算时候放置为第一个的样本开始位置
-BSSeeker2 --BSSeeker2	False	input CGmap file from BSSeeker2
-mc --minc	3	一个DMR中最少包含的胞嘧啶C的数目
-sin --singlechrom	False	单染色体的并行计算; `npp`将用于每条染色体的并行运行
-d --delta	0.1	得到DMR的最小要求的平均DNA甲基化差异
-wrt --withrespectto	all	用于DMR的样品要求与特定样品成对比较
-Keepall --Keepall	False	保持所有胞嘧啶位置存在于至少一个重复中

看到BSSeeker2我是不咋喜欢的，要不是参数说明太详细了，毕竟我是bismark死忠粉。

HOME-timeseries

参数	默认	功能
-sc --scorecutoff	0.5	每个胞嘧啶C位点的分类打分
-npp -–numprocess	5	运行的线程数
-ml --minlength	50	输出的DMR的最小长度
-ns --numsamples	all	被用来计算DMR的样本，默认是所有
-sp --startposition	1st position	在进行`timeseries DMR`计算时候放置为第一个的样本开始位置
-BSSeeker2 --BSSeeker2	False	input CGmap file from BSSeeker2
-mc --minc	4	一个DMR中最少包含的胞嘧啶C的数目
-d --delta	0.1	得到DMR的最小要求的平均DNA甲基化差异
-sin --singlechrom	False	单染色体的并行计算; `npp`将用于每条染色体的并行运行
-Keepall --Keepall	False	保持所有胞嘧啶位置存在于至少一个重复中

DNA甲基化分析软件之HOME

把文章看了一遍，我觉得有必要把这一段内容先贴出来

HOME

`运行路径一定要在HOME文件目录下即`:`/public/home/qliu/biosoft/HOME/`，否则会报错`Fatal error: cannot open file './scripts/HOME_R.R': No such file or directory`,不知道是我没掌握精髓还是这个太。。

参数翻译- -

从参数来看一般默认即可。

猜你喜欢

热点阅读

DNA甲基化分析软件之HOME

把文章看了一遍，我觉得有必要把这一段内容先贴出来

HOME

运行路径一定要在HOME文件目录下即:/public/home/qliu/biosoft/HOME/，否则会报错Fatal error: cannot open file './scripts/HOME_R.R': No such file or directory,不知道是我没掌握精髓还是这个太。。

参数翻译- -

从参数来看一般默认即可。

猜你喜欢

热点阅读

`运行路径一定要在HOME文件目录下即`:`/public/home/qliu/biosoft/HOME/`，否则会报错`Fatal error: cannot open file './scripts/HOME_R.R': No such file or directory`,不知道是我没掌握精髓还是这个太。。