tsRNA分类分析：seqac包

2023-09-02 本文已影响0人信你个鬼

标题：tsRNA分类分析：seqac包

tRNA的二级结构如下：

image-20230903154931207.png

根据其二级结构，tsRNA分类如下：

早期研究报道，tiRNAs通过血管生成素（ANG）切割成熟tRNA产生，而tRFs来源于Dicer或ANG切割成熟tRNA或tRNA前体。根据成熟或前体tRNA转录本中的裂解位点，可以分为以下几类：

tRF-1：来源于前体tRNA，在3' trailer部由RNase Z or ELAC2酶裂解产生。
tRF-3：来自成熟体tRNA的3'端
tRF-5：来自成熟体tRNA的5'端
tRF-i：来自成熟体tRNA的中间区域
tiRNAs：are generated by specific cleavage by angiogenin in the anticodon loops of mature tRNAs

image-20230619222644080.png

今天主要给大家介绍一个软件，可以根据tsRNA的序列以及tRNA的二级结构文件来鉴定tsRNA的分类，软件为Seqpac。

Seqpac主要用于small RNA Sequence数据分析，教程最新版链接：

https://www.bioconductor.org/packages/release/workflows/vignettes/seqpac/inst/doc/seqpac_-_A_guide_to_sRNA_analysis_using_sequence-based_counts.html

软件分析输入为fq文件，seqpac workflow的核心对象为：

The PAC object：这个对象有三个部分，Phenotypic information (P), Annotations (A) and Counts (C)。

这个PAC对象有两种格式， S3 list或者S4，并且可以使用函数进行互转，as(pac-object, "list")将S4转为S3， as.PAC(pac-object)将S3转为S4。

The Targeting objects：用于访问PAC对象内的特定数据子集。

seqpac的分析步骤主要为四步：

Generate a PAC object from fastq-files
Preprocess the PAC object
Annotate the PAC object
Analyze the PAC object.

示例数据

seqpac包中自带了一个示例数据，6 sRNA fastq文件。

image-20230823222638876.png

样本组织类型为果蝇胚胎，测序平台为Illumina NextSeq 500，测序读长为75个碱基。这里抽取原始fq的部分read进行演示。

安装包：

## Installation
devtools::install_github("Danis102/seqpac", upgrade="never", 
 build_manual=TRUE, build_vignettes=TRUE)

首先快速一览整个包的分析步骤：

# Setup
 library(seqpac)
 
 sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
 fastq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
                 full.names = TRUE)
 ref_tRNA <- system.file("extdata/trna", "tRNA.fa",
                         package = "seqpac", mustWork = TRUE)
 
 # Trim NEB adapter, generate counts and create PAC-object (pheno, anno, counts)
 count_list  <- make_counts(fastq, plot = FALSE, 
                             parse="default_neb", trimming="seqpac")
 pac <- make_PAC(count_list$counts)
 pheno(pac)$groups <- c(rep("gr1",3), rep("gr2",3)) #Add groups to pheno table
 
 # Preprocess PAC-object and creat means
 pac <- PAC_norm(pac)                        # Default normalization is cpm
 pac <- PAC_filter(pac, size=c(15,60))       # Here, filter on sequence length
 pac <- PAC_summary(pac, norm = "cpm", type = "means", 
                    pheno_target=list("groups", unique(pheno(pac)$groups)))
 
 # Quickly align (annotate) and plot PAC-object
 map_tRNA <- PAC_mapper(pac, ref=ref_tRNA, override=TRUE)
 plts <- PAC_covplot(pac, map_tRNA, summary_target=list("cpmMeans_groups"))
 plts[[13]]

详细步骤

1.首先加载fq数据

# Use the ShortRead-package to randomly sample a fastq file
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
fq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE, full.names = TRUE)
fq
[1] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq1.fastq.gz"
[2] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq2.fastq.gz"
[3] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq3.fastq.gz"
[4] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq4.fastq.gz"
[5] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq5.fastq.gz"
[6] "C:/Users/86187/AppData/Local/R/win-library/4.2/seqpac/extdata/seqpac_fq6.fastq.gz"

2.生成count矩阵

使用make_counts函数

count_list <- make_counts(input=fq, trimming="seqpac", threads=16, parse="default_neb")

# 查看count_list的结构
str(count_list, max.level = 1)

List of 3
 $ counts         :'data.frame':    8300 obs. of  6 variables:
 $ progress_report:'data.frame':    6 obs. of  19 variables:
 $ evidence_plots :List of 2

3.构造样本表型矩阵

使用make_pheno函数

Sample_ID <- colnames(count_list$counts)
pheno_fl <- data.frame(Sample_ID=Sample_ID,
                    Treatment=c(rep("heat", times=1), 
                                rep("control", times=2)),
                    Batch=rep(c("1", "2", "3"), times=1))

pheno <- make_pheno(pheno=pheno_fl, 
                    progress_report=count_list$progress_report, 
                    counts=count_list$counts)

每个样本的表型为一个数据框：

image-20230903161349444.png

4.构造PAC对象

###--------------------------------------------------------------------- 
## Generate PAC object (default S4)
pac_master <- make_PAC(pheno=pheno, counts=count_list$counts)
pac_master

PAC object with: 
   6 samples
   8300 sequences
   mean total counts: 4410 (min:4346/max:4476)
   best sequence: 230 mean counts
   worst sequence: 0 mean counts

至此，数据的分析对象已经构建好了，PAC对象，后续的分析基本上都是对这个PAC进行操作。

tRNA class analysis

我们本次主要关注tRNA class analysis，此包的其他分析可以见分析教程。

首先加载包中内置的PAC对象

load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", 
                 package = "seqpac", mustWork = TRUE))
                 
# 查看加载的对象
pac

比对：

这个包内置比对工具为bwa，此处mismatches为0，还讲序列上下增加了3bp，参考序列为tRNA序列，来自GtRNAdb数据库：http://gtrnadb.ucsc.edu/。

###---------------------------------------------------------------------
## tRNA analysis in Seqpac 

# First create an annotation blank PAC with group means
anno(pac) <- anno(pac)[,1, drop=FALSE]
pac_trna <- PAC_summary(pac, norm = "cpm", type = "means", 
                        pheno_target=list("stage"), merge_pac = TRUE)

# Then reannotate only tRNA using the PAC_mapper function
ref <- "/some/path/to/trna/fasta/recerence.fa"  
# or use the provided fasta
ref <- system.file("extdata/trna2", "tRNA2.fa", 
                          package = "seqpac", mustWork = TRUE)          

map_object <- PAC_mapper(pac_trna, ref=ref, N_up = "NNN", N_down = "NNN", 
                         mismatches=0, threads=6, report_string=TRUE,
                         override=TRUE)

比对完后的map_object对象是一个list对象，如下：

image-20230903174938936.png

每一个tRNA上比对到的tsRNA如下，每一行为一个tsRNA的序列：

image-20230903175956563.png

先看看比对结果，PAC_covplot可以绘制每一个tRNA上的短片段tsRNA的比对情况：

# Hint: By adding N_up ad N_down you can make sure that modified fragments (like
# 3' -CAA in mature tRNA are included).

## Plot tRNA using xseq=TRUE gives you the reference sequence as X-axis:
# (OBS! Long reference will not show well.)                     
cov_tRNA <- PAC_covplot(pac_trna, map_object, 
                        summary_target = list("cpmMeans_stage"), 
                        xseq=TRUE, style="line", 
                        color=c("red", "black", "blue"))

选一个tRNA查看，这里对每类stage进行的绘制：

image-20230903175253710.png

也可以指定某个tRNA绘制查看：

# Targeting a single tRNA using a summary data.frame 
PAC_covplot(pac_trna, map=map_object, summary_target= list("cpmMeans_stage"), 
            map_target="tRNA-Lys-CTT-1-9")

结果如下：

image-20230903175431933.png

此外，还可以筛选出tRNA上比对read最多的进行展示：

# Find tRNAs with many fragments
# 1st extract number of rows from each alignment 
n_tRFs <- unlist(lapply(map_object, function(x){nrow(x[[2]])}))
# The test data is highly down an filterd sampled, but still some with tRNA have
# a few alignment
table(n_tRFs)
names(map_object)[n_tRFs>2]
# Lets select a few of them and plot them
selct <- names(map_object)[n_tRFs>2][c(1, 16, 27, 37)]
cov_plt <- PAC_covplot(pac_trna, map=map_object, 
                       summary_target= list("cpmMeans_stage"), 
                       map_target=selct)
                       
cowplot::plot_grid(plotlist=cov_plt, nrow=2, ncol=2)

选取的四个tRNA绘图结果如下：

image-20230903180248037.png

接下来就是解析每个tsRNA的分类了：

解理tsRNA的分类，如5 '和3 'halves等，更复杂。这种分类涉及到tRNA环(A-loop、anticodon loop和T-loop)中可能发生切割的位置。有一些外部软件可以从二级结构模型中预测环路。预测tRNA二级结构最流行的工具之一是tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/)。这个软件的输出是一个ss文件，它以以下格式存储循环信息：

image-20230903180717224.png

Ss-files for most common genomes are readily available for download at http://gtrnadb.ucsc.edu/。

根据ss文件，结合比对结果就可以使用tRNA_class函数对tsRNA进行分类注释了：

###---------------------------------------------------------------------
## Generate range types using a ss-file

# Download ss object from GtRNAdb 
# ("http://gtrnadb.ucsc.edu/")
ss_file <- "/some/path/to/GtRNAdb/trna.ss"
# or for testing use the provided ss-file in secpac
ss_file <- system.file("extdata/trna2", "tRNA2.ss", 
                          package = "seqpac", mustWork = TRUE)

# Classify fragments according to loop cleavage (small loops are omitted)       
map_object_ss <- map_rangetype(map_object, type="ss", 
                               ss=ss_file, min_loop_width=4)            

# Remove reference tRNAs with no hits
map_object_ss <-  map_object_ss[
  !unlist(lapply(map_object_ss, function(x){x[[2]][1,1] == "no_hits"}))]

# Now we have quite comprehensive tRNA loop annotations
names(map_object_ss)
map_object_ss[[1]][[2]]

# Don't forget to check ?map_rangetype to obtain more options 


###---------------------------------------------------------------------
## Function classifying 5'-tRF, 5'halves, i-tRF, 3'-tRF, 3'halves

# Does all tRNAs have 3 loops?
table(unlist(lapply(map_object_ss, function(x){unique(x$Alignments$n_ssloop)})))

# Set tolerance for classification as a terminal tRF
tolerance <- 5  # 2 nucleotides from start or end of full-length tRNA)

### Important:
# We set N_up and N_down to "NNN" in the PAC_mapper step. To make sure
# that we have a tolerance that include the original tRNA sequence
# we set terminal= 2+3 (5).  
## tRNA classifying function 
# Apply the tRNA_class function and make a tRNA type column
pac_trna <- tRNA_class(pac_trna, map=map_object_ss, terminal=tolerance)
## 
## -- Anno target was specified, will retain: 29 of 9131 seqs.
anno(pac_trna)$type <- paste0(anno(pac_trna)$decoder, anno(pac_trna)$acceptor)
anno(pac_trna)[1:5, 1:4]

得到的分类结果如下：

image-20230903181248338.png

查看每种tsRNA类别的占比：

###---------------------------------------------------------------------
## Plot tRNA types

# Now use PAC_trna to generate some graphs based on grand means
trna_result <- PAC_trna(pac_trna, norm="cpm", filter = NULL,
  join = TRUE, top = 15, log2fc = TRUE,
  pheno_target = list("stage", c("Stage1", "Stage3")), 
  anno_target_1 = list("type"),
  anno_target_2 = list("class"))
  
cowplot::plot_grid(trna_result$plots$Expression_Anno_1$Grand_means,
               trna_result$plots$Log2FC_Anno_1,
               trna_result$plots$Percent_bars$Grand_means,
               nrow=1, ncol=3)

结果如下：

image-20230903181537012.png

每个tsRNA的最终注释结果如下：

anno_result <- anno(pac_trna)

image-20230903182329699.png

常见文献中tsRNA分类结果展示可以如下：

image-20230903182439352.png

ref：https://doi.org/10.1186/s12958-022-00978-3 Reproductive Biology and Endocrinology (2022) 20:106

根据这个分类注释做个性化绘制吧~