孟佳课题组|m6A数据Peak识别软件exomePeak2使用指
2021-03-10 本文已影响0人
信你个鬼
exomePeak使用感受:
由于是R语言,因此真实项目数据跑起来耗时比较久;其次,peak calling与diff peak是两个独立的过程,结果会有一些不一致的地方。
随着分析项目的增多,当然可以曾加一些认识,上次的稿子中就有一些认知是错误的,仅供大家做参考。具体见:
exomePeak使用过程疑问:https://www.jianshu.com/p/42d4651295c1
后来,作者又出了一个exomePeak2,并不是在原有的exomePeak基础上进行的更新。
主页:https://rdrr.io/github/ZhenWei10/exomePeak2/man/exomePeak2.html
exomePeak2与exomePeak最大的区别在于exomePeak2可以输出中间结果,应该还有其他不一样的地方,来看看吧。
功能:exomePeak2使用bam文件进行peak calling以及peak统计,它整合了meRIP-seq数据分析的常规分析内容:
- 1.使用scanMeripBAM检查BAM的index
- 2.使用exomePeakCalling识别外显子区域的被修饰的peaks
- 3.normalizeGC计算GC偏倚
- 4.glmM或glmDM构建线性模型来计算差异位点
- 5.exportResults输出peak结果
看一下函数
exomePeak2(
bam_ip = NULL,
bam_input = NULL,
bam_treated_ip = NULL,
bam_treated_input = NULL,
txdb = NULL,
bsgenome = NULL,
genome = NA,
gff_dir = NULL,
mod_annot = NULL,
paired_end = FALSE,
library_type = c("unstranded", "1st_strand", "2nd_strand"),
fragment_length = 100,
binding_length = 25,
step_length = binding_length,
peak_width = fragment_length/2,
pc_count_cutoff = 5,
bg_count_cutoff = 50,
p_cutoff = 1e-05,
p_adj_cutoff = NULL,
log2FC_cutoff = 1,
parallel = FALSE,
background_method = c("all", "Gaussian_mixture", "m6Aseq_prior", "manual"),
manual_background = NULL,
correct_GC_bg = TRUE,
qtnorm = FALSE,
glm_type = c("DESeq2", "Poisson", "NB"),
LFC_shrinkage = c("apeglm", "ashr", "Gaussian", "none"),
export_results = TRUE,
export_format = c("CSV", "BED", "RDS"),
table_style = c("bed", "granges"),
save_plot_GC = TRUE,
save_plot_analysis = FALSE,
save_plot_name = "",
save_dir = "exomePeak2_output",
peak_calling_mode = c("exon", "full_tx", "whole_genome")
)
与exomePeak相比
增了加双端数据paired_end ,文库类型library_type ,count阈值pc_count_cutoff ,bg_count_cutoff 等参数。还有一些输出结果的更改。
没有了那个要求在group中call peak一致性的参数。这个参数应该是在分步计算里面。
在运行之前,示例代码使用USCS中的注释文件,最好先安装好那个注释包
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
library(BSgenome.Hsapiens.UCSC.hg19)
指定进行peak calling的样本,
rm(list = ls())
options(stringsAsFactors = F)
#BiocManager::install("exomePeak2")
library(exomePeak2)
# 设置随机种子
set.seed(1)
GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")
f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)
f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
INPUT_BAM = c(f1,f2,f3)
分步骤计算过程,大致分为7个步骤:
## Peak Calling and Visualization in Multiple Steps
#The exomePeak2 package can achieve peak calling and peak statistics calulation with multiple functions.
## 1. 检查bam文件
MeRIP_Seq_Alignment <- scanMeripBAM(
bam_ip = IP_BAM,
bam_input = INPUT_BAM,
paired_end = FALSE)
# 同时含有处理组和非处理组
MeRIP_Seq_Alignment <- scanMeripBAM(
bam_ip = IP_BAM,
bam_input = INPUT_BAM,
bam_treated_input = TREATED_INPUT_BAM,
bam_treated_ip = TREATED_IP_BAM,
paired_end = FALSE)
# str函数可以方便快速的查看s4对象的结构和内容
str(MeRIP_Seq_Alignment,max.level = 3)
#2. 使用bam文件进行 peak calling
SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
gff_dir = GENE_ANNO_GTF,
genome = "hg19")
#可选,用来评估数据
SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
gff_dir = GENE_ANNO_GTF,
genome = "hg19",
mod_annot = MOD_ANNO_GRANGE)
# 查看peak结果
str(SummarizedExomePeaks, max.level = 4)
#3. 计算size factors用来对GC偏倚进行矫正
SummarizedExomePeaks <- normalizeGC(SummarizedExomePeaks)
#4. 使用glmM构造peak统计量
SummarizedExomePeaks <- glmM(SummarizedExomePeaks)
# 可选,如果有差异分析,就分析此步骤
SummarizedExomePeaks <- glmDM(SummarizedExomePeaks)
#5. Generate the scatter plot between GC content and log2 Fold Change (LFC).
p <- plotLfcGC(SummarizedExomePeaks)
# 点的大小有点小,取出来数据重新加工
library(ggplot2)
data <- p$data
head(data)
p1 <- ggplot(data,aes(x=GC_idx,y=Log2FC,color=Label)) + geom_point(size=2) + theme_classic()
p1
#6. Generate the bar plot for the sequencing depth size factors.
plotSizeFactors(SummarizedExomePeaks)
#7. Export the modification peaks and the peak statistics with user decided format.
exportResults(SummarizedExomePeaks, format = "BED")
GC含量散点图与SizeFactor图
image-20210310144152649.png image-20210310144955116.png使用示例结果中的一个差异peak在IGV中查看peak分布
相对于exomePeak,作者还输出了每个peak在每个样本中的read count数,就用第一个peak示例吧,差异FC比较大,p值也显著。peak大概100个bp这么长。
image-20210310151256844.png image-20210310151134620.png至于具体效果如何,大家自行判断吧。
请等待后续更新。