tsRNA分类分析:seqac包
标题:tsRNA分类分析:seqac包
tRNA的二级结构如下:
image-20230903154931207.png根据其二级结构,tsRNA分类如下:
早期研究报道,tiRNAs通过血管生成素(ANG)切割成熟tRNA产生,而tRFs来源于Dicer或ANG切割成熟tRNA或tRNA前体。根据成熟或前体tRNA转录本中的裂解位点,可以分为以下几类:
- tRF-1:来源于前体tRNA,在3' trailer部由RNase Z or ELAC2酶裂解产生。
- tRF-3:来自成熟体tRNA的3'端
- tRF-5:来自成熟体tRNA的5'端
- tRF-i: 来自成熟体tRNA的中间区域
- tiRNAs:are generated by specific cleavage by angiogenin in the anticodon loops of mature tRNAs
今天主要给大家介绍一个软件,可以根据tsRNA的序列以及tRNA的二级结构文件来鉴定tsRNA的分类,软件为Seqpac。
Seqpac主要用于small RNA Sequence数据分析,教程最新版链接:
软件分析输入为fq文件,seqpac workflow的核心对象为:
The PAC object: 这个对象有三个部分,Phenotypic information (P), Annotations (A) and Counts (C)。
这个PAC对象有两种格式, S3 list或者S4,并且可以使用函数进行互转,as(pac-object, "list")将S4转为S3, as.PAC(pac-object)将S3转为S4。
The Targeting objects:用于访问PAC对象内的特定数据子集。
seqpac的分析步骤主要为四步:
- Generate a PAC object from fastq-files
- Preprocess the PAC object
- Annotate the PAC object
- Analyze the PAC object.
示例数据
seqpac包中自带了一个示例数据,6 sRNA fastq文件。
image-20230823222638876.png样本组织类型为果蝇胚胎,测序平台为Illumina NextSeq 500,测序读长为75个碱基。这里抽取原始fq的部分read进行演示。
安装包:
## Installation
devtools::install_github("Danis102/seqpac", upgrade="never",
build_manual=TRUE, build_vignettes=TRUE)
首先快速一览整个包的分析步骤:
# Setup
library(seqpac)
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
fastq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
full.names = TRUE)
ref_tRNA <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
# Trim NEB adapter, generate counts and create PAC-object (pheno, anno, counts)
count_list <- make_counts(fastq, plot = FALSE,
parse="default_neb", trimming="seqpac")
pac <- make_PAC(count_list$counts)
pheno(pac)$groups <- c(rep("gr1",3), rep("gr2",3)) #Add groups to pheno table
# Preprocess PAC-object and creat means
pac <- PAC_norm(pac) # Default normalization is cpm
pac <- PAC_filter(pac, size=c(15,60)) # Here, filter on sequence length
pac <- PAC_summary(pac, norm = "cpm", type = "means",
pheno_target=list("groups", unique(pheno(pac)$groups)))
# Quickly align (annotate) and plot PAC-object
map_tRNA <- PAC_mapper(pac, ref=ref_tRNA, override=TRUE)
plts <- PAC_covplot(pac, map_tRNA, summary_target=list("cpmMeans_groups"))
plts[[13]]
详细步骤
1.首先加载fq数据
# Use the ShortRead-package to randomly sample a fastq file
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
fq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE, full.names = TRUE)
fq
[1] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq1.fastq.gz"
[2] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq2.fastq.gz"
[3] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq3.fastq.gz"
[4] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq4.fastq.gz"
[5] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq5.fastq.gz"
[6] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq6.fastq.gz"
2.生成count矩阵
使用make_counts函数
count_list <- make_counts(input=fq, trimming="seqpac", threads=16, parse="default_neb")
# 查看count_list的结构
str(count_list, max.level = 1)
List of 3
$ counts :'data.frame': 8300 obs. of 6 variables:
$ progress_report:'data.frame': 6 obs. of 19 variables:
$ evidence_plots :List of 2
3.构造样本表型矩阵
使用make_pheno函数
Sample_ID <- colnames(count_list$counts)
pheno_fl <- data.frame(Sample_ID=Sample_ID,
Treatment=c(rep("heat", times=1),
rep("control", times=2)),
Batch=rep(c("1", "2", "3"), times=1))
pheno <- make_pheno(pheno=pheno_fl,
progress_report=count_list$progress_report,
counts=count_list$counts)
每个样本的表型为一个数据框:
image-20230903161349444.png4.构造PAC对象
###---------------------------------------------------------------------
## Generate PAC object (default S4)
pac_master <- make_PAC(pheno=pheno, counts=count_list$counts)
pac_master
PAC object with:
6 samples
8300 sequences
mean total counts: 4410 (min:4346/max:4476)
best sequence: 230 mean counts
worst sequence: 0 mean counts
至此,数据的分析对象已经构建好了,PAC对象,后续的分析基本上都是对这个PAC进行操作。
tRNA class analysis
我们本次主要关注tRNA class analysis,此包的其他分析可以见分析教程。
首先加载包中内置的PAC对象
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
# 查看加载的对象
pac
比对:
这个包内置比对工具为bwa,此处mismatches为0,还讲序列上下增加了3bp,参考序列为tRNA序列,来自GtRNAdb数据库:http://gtrnadb.ucsc.edu/。
###---------------------------------------------------------------------
## tRNA analysis in Seqpac
# First create an annotation blank PAC with group means
anno(pac) <- anno(pac)[,1, drop=FALSE]
pac_trna <- PAC_summary(pac, norm = "cpm", type = "means",
pheno_target=list("stage"), merge_pac = TRUE)
# Then reannotate only tRNA using the PAC_mapper function
ref <- "/some/path/to/trna/fasta/recerence.fa"
# or use the provided fasta
ref <- system.file("extdata/trna2", "tRNA2.fa",
package = "seqpac", mustWork = TRUE)
map_object <- PAC_mapper(pac_trna, ref=ref, N_up = "NNN", N_down = "NNN",
mismatches=0, threads=6, report_string=TRUE,
override=TRUE)
比对完后的map_object对象是一个list对象,如下:
image-20230903174938936.png每一个tRNA上比对到的tsRNA如下,每一行为一个tsRNA的序列:
image-20230903175956563.png先看看比对结果,PAC_covplot可以绘制每一个tRNA上的短片段tsRNA的比对情况:
# Hint: By adding N_up ad N_down you can make sure that modified fragments (like
# 3' -CAA in mature tRNA are included).
## Plot tRNA using xseq=TRUE gives you the reference sequence as X-axis:
# (OBS! Long reference will not show well.)
cov_tRNA <- PAC_covplot(pac_trna, map_object,
summary_target = list("cpmMeans_stage"),
xseq=TRUE, style="line",
color=c("red", "black", "blue"))
选一个tRNA查看,这里对每类stage进行的绘制:
image-20230903175253710.png也可以指定某个tRNA绘制查看:
# Targeting a single tRNA using a summary data.frame
PAC_covplot(pac_trna, map=map_object, summary_target= list("cpmMeans_stage"),
map_target="tRNA-Lys-CTT-1-9")
结果如下:
image-20230903175431933.png此外,还可以筛选出tRNA上比对read最多的进行展示:
# Find tRNAs with many fragments
# 1st extract number of rows from each alignment
n_tRFs <- unlist(lapply(map_object, function(x){nrow(x[[2]])}))
# The test data is highly down an filterd sampled, but still some with tRNA have
# a few alignment
table(n_tRFs)
names(map_object)[n_tRFs>2]
# Lets select a few of them and plot them
selct <- names(map_object)[n_tRFs>2][c(1, 16, 27, 37)]
cov_plt <- PAC_covplot(pac_trna, map=map_object,
summary_target= list("cpmMeans_stage"),
map_target=selct)
cowplot::plot_grid(plotlist=cov_plt, nrow=2, ncol=2)
选取的四个tRNA绘图结果如下:
image-20230903180248037.png接下来就是解析每个tsRNA的分类了:
解理tsRNA的分类,如5 '和3 'halves等,更复杂。这种分类涉及到tRNA环(A-loop、anticodon loop和T-loop)中可能发生切割的位置。有一些外部软件可以从二级结构模型中预测环路。预测tRNA二级结构最流行的工具之一是tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/)。这个软件的输出是一个ss文件,它以以下格式存储循环信息:
image-20230903180717224.pngSs-files for most common genomes are readily available for download at http://gtrnadb.ucsc.edu/。
根据ss文件,结合比对结果就可以使用tRNA_class函数对tsRNA进行分类注释了:
###---------------------------------------------------------------------
## Generate range types using a ss-file
# Download ss object from GtRNAdb
# ("http://gtrnadb.ucsc.edu/")
ss_file <- "/some/path/to/GtRNAdb/trna.ss"
# or for testing use the provided ss-file in secpac
ss_file <- system.file("extdata/trna2", "tRNA2.ss",
package = "seqpac", mustWork = TRUE)
# Classify fragments according to loop cleavage (small loops are omitted)
map_object_ss <- map_rangetype(map_object, type="ss",
ss=ss_file, min_loop_width=4)
# Remove reference tRNAs with no hits
map_object_ss <- map_object_ss[
!unlist(lapply(map_object_ss, function(x){x[[2]][1,1] == "no_hits"}))]
# Now we have quite comprehensive tRNA loop annotations
names(map_object_ss)
map_object_ss[[1]][[2]]
# Don't forget to check ?map_rangetype to obtain more options
###---------------------------------------------------------------------
## Function classifying 5'-tRF, 5'halves, i-tRF, 3'-tRF, 3'halves
# Does all tRNAs have 3 loops?
table(unlist(lapply(map_object_ss, function(x){unique(x$Alignments$n_ssloop)})))
# Set tolerance for classification as a terminal tRF
tolerance <- 5 # 2 nucleotides from start or end of full-length tRNA)
### Important:
# We set N_up and N_down to "NNN" in the PAC_mapper step. To make sure
# that we have a tolerance that include the original tRNA sequence
# we set terminal= 2+3 (5).
## tRNA classifying function
# Apply the tRNA_class function and make a tRNA type column
pac_trna <- tRNA_class(pac_trna, map=map_object_ss, terminal=tolerance)
##
## -- Anno target was specified, will retain: 29 of 9131 seqs.
anno(pac_trna)$type <- paste0(anno(pac_trna)$decoder, anno(pac_trna)$acceptor)
anno(pac_trna)[1:5, 1:4]
得到的分类结果如下:
image-20230903181248338.png查看每种tsRNA类别的占比:
###---------------------------------------------------------------------
## Plot tRNA types
# Now use PAC_trna to generate some graphs based on grand means
trna_result <- PAC_trna(pac_trna, norm="cpm", filter = NULL,
join = TRUE, top = 15, log2fc = TRUE,
pheno_target = list("stage", c("Stage1", "Stage3")),
anno_target_1 = list("type"),
anno_target_2 = list("class"))
cowplot::plot_grid(trna_result$plots$Expression_Anno_1$Grand_means,
trna_result$plots$Log2FC_Anno_1,
trna_result$plots$Percent_bars$Grand_means,
nrow=1, ncol=3)
结果如下:
image-20230903181537012.png每个tsRNA的最终注释结果如下:
anno_result <- anno(pac_trna)
image-20230903182329699.png
常见文献中tsRNA分类结果展示可以如下:
image-20230903182439352.pngref:https://doi.org/10.1186/s12958-022-00978-3 Reproductive Biology and Endocrinology (2022) 20:106
根据这个分类注释做个性化绘制吧~