DNA甲基化测序数据处理(二):差异分析
2018-07-22 本文已影响119人
面面的徐爷
前言
衔接上一篇数据比对后的结果,使用R包DSS进行处理。
Input文件准备
我们先来复习一下上一节课得到的数据结果:
*.bismark.cov.gz
文件
$head test_data_bismark_bt2.deduplicated.bismark.cov
1 975476 975476 100 1 0
1 975488 975488 100 1 0
1 975490 975490 100 1 0
1 2224487 2224487 100 1 0
1 2224489 2224489 100 1 0
1 2224514 2224514 100 1 0
1 2224520 2224520 100 1 0
1 2313220 2313220 0 0 1
1 9313902 9313902 100 1 0
1 9313914 9313914 100 1 0
# 第一列代表chromosome
# 第二,三列代表location
# 第四列代表甲基化百分比
# 第五列代表甲基化数目
# 第六列代表未甲基化数目
这里我们使用的R包为DSS,使用Bioconductor进行安装。
这个包可以对甲基化数据做两件事:
- 找出差异甲基化位点
- 根据甲基化位点找出差异化甲基化区域
DSS包对差异甲基化的检测基于β负二项分布的严格沃尔德检验。
DSS包可以对无重复数据进行处理
输入数据的格式如下:
- 第一列为染色体
- 第二列为位置
- 第三列为total reads
- 第四列为甲基化的reads
chr pos N X
chr18 3014904 26 2
chr18 3031032 33 12
chr18 3031044 33 13
chr18 3031065 48 24
# 根据这个特性,我们可以将上面的输出文件通过Linux或者R进行简单的处理得到input文件。
# 非常简单,所以这里请自行转化,凡事都需要自己摸索。
Call DML or DMR
DML是甲基化差异位点,DMR为甲基化差异区域。
使用DSS包自带的数据进行演示
1. 载入两个包DSS和bsseq(需要该包构建obj对象)
library(DSS)
require(bsseq)
path <- file.path(system.file(package="DSS"), "extdata")
dat1.1 <- read.table(file.path(path, "cond1_1.txt"), header=TRUE)
dat1.2 <- read.table(file.path(path, "cond1_2.txt"), header=TRUE)
dat2.1 <- read.table(file.path(path, "cond2_1.txt"), header=TRUE)
dat2.2 <- read.table(file.path(path, "cond2_2.txt"), header=TRUE)
BSobj <- makeBSseqData( list(dat1.1, dat1.2, dat2.1, dat2.2),
c("C1","C2", "N1", "N2") )[1:1000,]
# 查看BSobj的类型
BSobj
An object of type 'BSseq' with
1000 methylation loci
4 samples
has not been smoothed
All assays are in-memory、
2. 利用DMLtest函数call DML,分为一下几个步骤:
- 预测所有CpG位点的平均甲基化水平
- 对每个CpG位点估算其甲基化水平的dispersions
- 进行沃尔德检验
在第一步过程中,smoothing可以更好帮助估算甲基化(针对whole-genome BS-seq)。
而RRBS不需要smoothing。
根据甲基化水平进行loci的差异分析
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"))
head(dmlTest)
chr pos mu1 mu2 diff diff.se stat phi1 phi2 pval fdr
1 chr18 3014904 0.3817233 0.4624549 -0.08073162 0.24997034 -0.3229648 0.300542998 0.01706260 0.74672190 0.9985094
2 chr18 3031032 0.3380579 0.1417008 0.19635711 0.11086362 1.7711592 0.008911745 0.04783892 0.07653423 0.6792127
3 chr18 3031044 0.3432172 0.3298853 0.01333190 0.12203116 0.1092500 0.010409029 0.01994821 0.91300423 0.9985094
4 chr18 3031065 0.4369377 0.3649218 0.07201587 0.10099395 0.7130711 0.010320888 0.01603200 0.47580174 0.9985094
5 chr18 3031069 0.2933572 0.5387464 -0.24538920 0.13178800 -1.8619996 0.012537553 0.02320887 0.06260315 0.6158797
6 chr18 3031082 0.3526311 0.3905718 -0.03794068 0.07847999 -0.4834440 0.007665696 0.01145531 0.62878051 0.9985094
dmlTest.sm <- DMLtest(BSobj, group1=c("C1", "C2"), group2=c("N1", "N2"), smoothing=TRUE)
head(dmlTest.sm)
chr pos mu1 mu2 diff diff.se stat phi1 phi2 pval fdr
1 chr18 3014904 0.3693669 0.4566563 -0.08728939 0.29967322 -0.2912819 0.30054300 0.01706260 0.7708357 0.9656515
2 chr18 3031032 0.3433882 0.3679732 -0.02458503 0.03970109 -0.6192533 0.03177894 0.28323422 0.5357495 0.8639036
3 chr18 3031044 0.3412867 0.3678807 -0.02659404 0.04032823 -0.6594397 0.02536938 0.02080295 0.5096134 0.8596522
4 chr18 3031065 0.3358830 0.3511983 -0.01531533 0.04799161 -0.3191252 0.01123412 0.01621926 0.7496316 0.9652417
5 chr18 3031069 0.3358830 0.3511983 -0.01531533 0.03205500 -0.4777830 0.02832889 0.05857316 0.6328047 0.8968029
6 chr18 3031082 0.3358830 0.3511983 -0.01531533 0.05846593 -0.2619531 0.01682981 0.01368466 0.7933576 0.9745116
3. 根据dmlTest来call DML
dmls <- callDML(dmlTest.sm, p.threshold=0.001)
dmls
chr pos mu1 mu2 diff diff.se stat phi1 phi2 pval fdr postprob.overThreshold
447 chr18 3973699 0.8530694 0.3432547 0.5098147 0.05726793 8.902272 0.03662109 0.98947516 5.471481e-19 4.962634e-16 1.0000000
709 chr18 4564190 0.7773858 0.1977036 0.5796822 0.10294267 5.631116 0.21725415 0.02952656 1.790467e-08 5.413180e-06 0.9999984
710 chr18 4564237 0.7773858 0.1977036 0.5796822 0.15431085 3.756587 0.02931516 0.26238558 1.722462e-04 7.439396e-03 0.9990652
dmls <- callDML(dmlTest, p.threshold=0.001)
dmls
chr pos mu1 mu2 diff diff.se stat phi1 phi2 pval fdr postprob.overThreshold
450 chr18 3976129 0.01027497 0.939033927 -0.9287590 0.06544340 -14.191789 0.052591567 0.02428826 1.029974e-45 2.499403e-43 1.0000000
451 chr18 3976138 0.01027497 0.939033927 -0.9287590 0.06544340 -14.191789 0.052591567 0.02428826 1.029974e-45 2.499403e-43 1.0000000
638 chr18 4431501 0.01331553 0.943056638 -0.9297411 0.09273779 -10.025483 0.053172411 0.07746835 1.177826e-23 1.429096e-21 1.0000000
639 chr18 4431511 0.01327049 0.943056638 -0.9297862 0.09270080 -10.029969 0.053121697 0.07746835 1.125518e-23 1.429096e-21 1.0000000
710 chr18 4564237 0.91454619 0.011930005 0.9026162 0.05260037 17.159883 0.009528898 0.04942849 5.302004e-66 3.859859e-63 1.0000000
782 chr18 4657576 0.98257334 0.067835497 0.9147378 0.06815000 13.422418 0.010424723 0.06755651 4.468885e-41 8.133371e-39 1.0000000
582 chr18 4340682 0.95398081 0.030390730 0.9235901 0.10935874 8.445508 0.085494283 0.04540643 3.027264e-17 2.754810e-15 1.0000000
583 chr18 4340709 0.95398081 0.030390730 0.9235901 0.10935874 8.445508 0.085494283 0.04540643 3.027264e-17 2.754810e-15 1.0000000
340 chr18 3542732 0.95023554 0.034383112 0.9158524 0.11937407 7.672122 0.089137013 0.04474741 1.691739e-14 1.368429e-12 1.0000000
395 chr18 3723448 0.06570765 0.751990744 -0.6862831 0.09825286 -6.984866 0.011958092 0.01646418 2.851278e-12 2.075730e-10 1.0000000
188 chr18 3370113 0.01488553 0.787980174 -0.7730946 0.11380172 -6.793347 0.054190769 0.02752024 1.095611e-11 7.250956e-10 1.0000000
400 chr18 3785543 0.79337804 0.118353679 0.6750244 0.12183251 5.540593 0.017720841 0.02442007 3.014493e-08 1.688116e-06 0.9999988
683 chr18 4494490 0.77615275 0.104235359 0.6719174 0.12407577 5.415380 0.009219023 0.08210742 6.115879e-08 3.180257e-06 0.9999980
783 chr18 4657592 0.96858371 0.266894590 0.7016891 0.13077255 5.365722 0.064228633 0.07721117 8.062618e-08 3.668491e-06 0.9999979
642 chr18 4431618 0.43034706 0.965539700 -0.5351926 0.09578178 -5.587625 0.012064922 0.07497569 2.301966e-08 1.396526e-06 0.9999972
189 chr18 3370141 0.01488553 0.676293642 -0.6614081 0.12554112 -5.268458 0.054190769 0.02138845 1.375746e-07 5.891428e-06 0.9999961
738 chr18 4635185 0.44671015 0.007342128 0.4393680 0.08148745 5.391849 0.010069744 0.05154034 6.973627e-08 3.384534e-06 0.9999844
330 chr18 3542403 0.02875655 0.714168320 -0.6854118 0.15294560 -4.481409 0.041892077 0.03368033 7.415192e-06 2.841190e-04 0.9999354
185 chr18 3347936 0.49560648 0.017291187 0.4783153 0.10184167 4.696656 0.012173893 0.04533502 2.644552e-06 1.069574e-04 0.9998983
92 chr18 3217241 0.03876653 0.767643956 -0.7288774 0.17547605 -4.153714 0.038095316 0.03753383 3.271214e-05 9.602621e-04 0.9998319
396 chr18 3723468 0.23043341 0.951779714 -0.7213463 0.18220738 -3.958930 0.173291593 0.07965750 7.528626e-05 1.838513e-03 0.9996786
40 chr18 3047980 0.07206183 0.658535570 -0.5864737 0.14396467 -4.073734 0.012517547 0.02615351 4.626536e-05 1.247451e-03 0.9996373
614 chr18 4353399 0.01355584 0.547298573 -0.5337427 0.12855458 -4.151876 0.053422013 0.02055735 3.297603e-05 9.602621e-04 0.9996300
99 chr18 3226251 0.97443783 0.217343279 0.7570945 0.19807591 3.822244 0.012911947 0.27481269 1.322425e-04 3.105566e-03 0.9995532
613 chr18 4353379 0.01355584 0.529341634 -0.5157858 0.13033377 -3.957422 0.053422013 0.02080043 7.576289e-05 1.838513e-03 0.9992902
894 chr18 4921874 0.35815864 0.009172558 0.3489861 0.07955244 4.386868 0.009774839 0.05010151 1.149944e-05 3.986472e-04 0.9991255
895 chr18 4921881 0.35815864 0.009172558 0.3489861 0.07955244 4.386868 0.009774839 0.05010151 1.149944e-05 3.986472e-04 0.9991255
注意,这里使用smoothing进行操作,事实上使用的示例数据我也不清楚是RRBS还是WGBS,但是使用dmlTest call DML会有更多的结果。
当然,用户也可以指定差异的阈值,只有差异大于阈值的才会被call出来。
dmls2 <- callDML(dmlTest, delta=0.1, p.threshold=0.001)
head(dmls2)
chr pos mu1 mu2 diff diff.se stat phi1 phi2 pval fdr postprob.overThreshold
450 chr18 3976129 0.01027497 0.9390339 -0.9287590 0.06544340 -14.19179 0.052591567 0.02428826 1.029974e-45 2.499403e-43 1
451 chr18 3976138 0.01027497 0.9390339 -0.9287590 0.06544340 -14.19179 0.052591567 0.02428826 1.029974e-45 2.499403e-43 1
638 chr18 4431501 0.01331553 0.9430566 -0.9297411 0.09273779 -10.02548 0.053172411 0.07746835 1.177826e-23 1.429096e-21 1
639 chr18 4431511 0.01327049 0.9430566 -0.9297862 0.09270080 -10.02997 0.053121697 0.07746835 1.125518e-23 1.429096e-21 1
710 chr18 4564237 0.91454619 0.0119300 0.9026162 0.05260037 17.15988 0.009528898 0.04942849 5.302004e-66 3.859859e-63 1
782 chr18 4657576 0.98257334 0.0678355 0.9147378 0.06815000 13.42242 0.010424723 0.06755651 4.468885e-41 8.133371e-39 1
4. 根据dmlTest Call DMR
dmrs <- callDMR(dmlTest.sm, p.threshold=0.01)
dmrs
chr start end length nCG meanMethy1 meanMethy2 diff.Methy areaStat
23 chr18 4921637 4922059 423 27 0.11002940 0.01809674 0.09193266 86.29588
7 chr18 3507919 3508022 104 10 0.07524915 0.03294316 0.04230600 30.55943
15 chr18 4340682 4340753 72 4 0.89237955 0.35052968 0.54184987 10.83526
dmrs <- callDMR(dmlTest, p.threshold=0.01)
dmrs
chr start end length nCG meanMethy1 meanMethy2 diff.Methy areaStat
27 chr18 4657576 4657639 64 4 0.506453 0.318348 0.188105 14.34236
我们可以发现,smoothing前后得到的结果差异还是很大,所以针对不同的实验类型我们需要注意是否使用smoothing。
同理,也可以使用的delta参数以及调整p.threshold得到合适的结果。
dmrs2 <- callDMR(dmlTest, delta=0.1, p.threshold=0.05)
dmrs2
chr start end length nCG meanMethy1 meanMethy2 diff.Methy areaStat
31 chr18 4657576 4657639 64 4 0.5064530 0.3183480 0.188105 14.34236
19 chr18 4222533 4222608 76 4 0.7880276 0.3614195 0.426608 12.91667
5. 可视化
DSS包提供了一个不是很美观的可视化函数,用户其实可以使用coverage结果在R里面作图。
showOneDMR(dmrs[1,], BSobj)
结语
分析到此就告一段落了,随后就是介绍对差异甲基化区域的注释以及可视化文献作图。下一篇教程随后发。