全长转录组 | Iso-Seq 三代测序数据分析流程 (PacB

2024-01-26 本文已影响0人三代测序说

使用三代长度长测序进行全长转录组高通量测序为数千种新转录本的发现铺平了道路，甚至在注释良好的哺乳动物物种中亦是如此；也为深入转录本水平表征基因的变化提供了强有力的技术手段。

Functional IsoTranscriptomics (FIT)是美国弗罗里达大学（University of Florida）Ana Conesa 教授团队（Genomics of Gene Expression Lab, ConesaLab）开发的在转录本isoform水平上进行生物信息学分析的流程，旨在提供一个全长转录组end-to-end的解决方案（图1）。SQANTI 3 构成了FIT流程的第一个模块，其设计目的是使长读序列定义的转录组的质量控制和过滤成为可能，这些转录本通常含有artifacts和假阳性。因此，对全长转录组进行校正是进行FIT分析的前提，且对产生可靠的、在生物学上合理的结论/假设至关重要。SQANTI 3 是SQANTI 工具（发布）的最新版本，该版本合并 SQANT 1 和 SQANTI 2 中的功能并加入了新的功能，更好的对全长转录本进行深度表征。

图1. Functional IsoTranscriptomics， FIT流程

SQANTI 3 主要设计用来完成两个同等重要的任务（图2）：

长读序列定义的转录组分类和质量控制（SQANTI3 QC）。SQANTI 3 定义的 isoform 对于参考注释的类别和亚类，加上一系列的转录本层面的属性和描述，允许用户仔细检查他们isoform模型的属性，以及鉴别在文库制备和原始数据处理过程中产生的潜在问题。
长读序列定义转录组的 artifacts 过滤（SQANTI3 filter）。利用 SQANTI 3 的计算，用户可以他们的转录组中移除潜在的假阳性isoforms。考虑到当前长读序列测序实验流程的biases和pitfalls，这一点尤其重要。

为了深入了解这两个步骤，我们鼓励阅读SQANTI原文。然而，最近我们在SQANTI 3 工作流程中加入了最后一个步骤（SQANTI3 rescue）：

参考转录本的补救（SQANTI3 rescue）。旨在通过参考基因组注释去除artifacts。该模块意在保持转录组的多样性，即防止失去那些有可信junction chains的转录本。通过运行SQANTI3 rescue程序，SQANTI 3 将选择已经被去除的aritfacts可信对应的参考转录本，并将它们添加回过滤后的转录组中。

图2. SQANTI 3的工作流程

用 SQANTI 3 生成高质量转录组之后，下游步骤包括（图1）：

对isoform模型进行功能注释，包括位置定义的功能特征，如motifs、结构域等。IsoAnnot是一个用于对isoform进行de novo注释的工具。目前正在开发中，但是用户可以在 SQANTI 3 内部或外部运行IsoAnnotLite，从其他已注释的转录组中推断功能特征。
使用tappAS进行基于表达的功能分析。tappAS是一个Java GUI应用程序，其利用表达和结构域和motifs注释信息，以深入了解可变剪接isoform表达对功能的影响。

在运行 SQANTI 3 之前：推荐的长读序列处理流程：

以下是我们建议的工作流程，包括生成 SQANTI 3 输入文件的最佳方式以及在质量控制和过滤后该如何进行：

混样（Sample pooling ）：尽管我们知道一些用户可能从多个重复实验和/或样品中获取了长读序列数据，但我们建议将所有长读样品数据合并起来，以构建每个实验的单一转录组。
使用您偏好的转录组构建工具处理长读数据。我们不建议在未经处理的长读序列（raw long reads)上使用 SQANTI 3，因为它不是作为长读序列数据质量控制的工具设计的。
转录本模型的合并。通常，长读序列处理流程会产生大量高度冗余的isoform模型。我们建议在使用SQANTI 3 前，先用cDNA_Cupcake(现在是isoseq collapse）或TAMA Collapse等工具来合并冗余的isoform，以供isoform的数量和质量。
质量控制和过滤：我们强烈建议用户尽可能仔细地检查他们的长读序列定义转录组，包括筛选转录组以移除可能的假阳性isoform，这在由长读序列生成的转录组中很常见。
使用短读/长读和相应的工具对过滤后的转录组进行定量。我们不推荐将输入到SQANTI 3的表达量估算用于下游分析，这些仅用于质量控制目的。一旦从转录组中移除了所有artifacts，这些序列可以被用来获得更准确的定量。

一、SQANTI3 v5.2 安装

SQANTI3的下载：

$ wget https://github.com/ConesaLab/SQANTI3/archive/refs/tags/v5.2.tar.gz
$ tar -xvf v5.2.tar.gz

SQANTI3的conda环境创建：

$ cd SQANTI3
$ conda env create -f SQANTI3.conda_env.yml
$ source activate SQANTI3.env

安装gtfToGenePred，将运行程序加入到路径当中

#将 gtfToGenePred 加入 PATH，这里根据自己的路径进行相应修改。
$ echo PATH=/mnt/data/home/mli/Desktop/Software/SQANTI3-5.2/utilities:$PATH >> ~/.bashrc && source ~/.bashrc
#将 SQANTI3 python脚本加入 PATH，这里根据自己的路径进行相应修改。
$ echo PATH=/mnt/data/home/mli/Desktop/Software/SQANTI3-5.2:$PATH >> ~/.bashrc && source ~/.bashrc

安装cDNA_Cupcake软件

#首先激活SQANTI3
$ source activate SQANTI3.env

#然后下载cDNA_Cupcake
(SQANTI3.env)$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
(SQANTI3.env)$ cd cDNA_Cupcake

#安装cDNA_Cupcake
(SQANTI3.env)$ python setup.py build
(SQANTI3.env)$ python setup.py install

二、运行SQANTI3

1. 激活SQANTI3 conda 环境

(base)-bash-4.1$ conda activate SQANTI3.env
(SQANTI3.env)-bash-4.1$

2. 把 cDNA_Cupcake/sequence 文件夹路径加入$PYTHONPATH

#官方教程
(SQANTI3.env)-bash-4.1$ export PYTHONPATH=$PYTHONPATH:<path_to>/cDNA_Cupcake/sequence/
(SQANTI3.env)-bash-4.1$ export PYTHONPATH=$PYTHONPATH:<path_to>/cDNA_Cupcake/

#实际运行
(SQANTI3.env)-bash-4.1$ export PYTHONPATH=$PYTHONPATH:/mnt/data/home/mli/Desktop/Software/cDNA_Cupcake-master/sequence/
(SQANTI3.env)-bash-4.1$ export PYTHONPATH=$PYTHONPATH:/mnt/data/home/mli/Desktop/Software/cDNA_Cupcake-master

3. 运行SQANTI3 QC sqanti3_qc.py

官方示例

#进入 SQANTI3-5.2 文件夹
#修改  UHR_chr22_short_reads.fofn   中文件路径为如下：
example/UHR_Rep1_chr22.R1.fastq.gz example/UHR_Rep1_chr22.R2.fastq.gz
example/UHR_Rep2_chr22.R1.fastq.gz example/UHR_Rep2_chr22.R2.fastq.gz

#运行
$ sqanti3_qc.py example/UHR_chr22.gtf example/gencode.v38.basic_chr22.gtf example/GRCh38.p13_chr22.fasta    \
                     --CAGE_peak data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed    \
                     --polyA_motif_list data/polyA_motifs/mouse_and_human.polyA_motif.txt    \
                     -o UHR_chr22 -d example/SQANTI3_output -fl example/UHR_abundance.tsv    \
                     --short_reads example/UHR_chr22_short_reads.fofn --cpus 4 --report both

图3. Sqanti3_QC官方示例结果文件

实际案例，以第一部分URHH数据为例：

$ python sqanti3_qc.py example/UHRR.collapsed.gff example/GRCh38.gtf example/GRCh38.fa    \
      -o UHRR -d example/SQANTI3_UHRR_Output -fl example/UHRR.collapsed.flnc_count.txt \
      --isoAnnotLite --gff3 example/human_tappas_gencode_annotation_file.gff3 \
      --cpus 20 --report both

图4. UHRR数据Sqanti3_QC结果文件

具体的结果文件解读可以参照官方文档：Understanding-the-output-of-SQANTI3-QC 。

小贴士:
1). 我在其它文件夹运行时，老报错，经过输入文件的各种测试，估计是依赖文件的问题，因为示例能够成功运行；而在SQANTI3-5.2文件里运行就可以正常运行，只是需要把文件包括参考基因组及其注释文件都拷贝到exapmle文件夹里。
2). isoseq collapse 所产生的 *.collapsed.gff是.gff2格式，和sqanti3_qc.py输入文件.gtf格式一样，所以是通用的。不需要用gffread进行转化。
3). gffread的安装和使用。

$ conda install -c bioconda gffread
#GFF转换GTF
$ gffread input.gff3 -T -o out.gtf
#GTF转换GFF3
$ gffread input.gtf -o out.gff3

4). cDNA_cupcake 已经停止更新了，被官方iso-seq所吸纳取代，因此iso-seq collapse 和TAMA collapse的输出.gff文件都被官方建议作为sqanti3_qc.py的输入文件。

4. SQANTI3 QC 的参数

usage: sqanti3_qc.py [-h] [--min_ref_len MIN_REF_LEN] [--force_id_ignore] 
                     [--aligner_choice {minimap2,deSALT,gmap,uLTRA}] 
                     [--CAGE_peak CAGE_PEAK] [--polyA_motif_list POLYA_MOTIF_LIST]
                     [--polyA_peak POLYA_PEAK] [--phyloP_bed PHYLOP_BED] 
                     [--skipORF] [--is_fusion] [--orf_input ORF_INPUT] 
                     [--fasta] [-e EXPRESSION] [-x GMAP_INDEX] 
                     [-t CPUS] [-n CHUNKS] [-o OUTPUT] [-d DIR]
                     [-c COVERAGE] [-s SITES] [-w WINDOW] [--genename] 
                     [-fl FL_COUNT] [-v] [--saturation] 
                     [--report {html,pdf,both,skip}] 
                     [--isoAnnotLite] [--gff3 GFF3]
                     [--short_reads SHORT_READS] [--SR_bam SR_BAM] 
                     [--isoform_hits] 
                     [--ratio_TSS_metric {max,mean,median,3quartile}]
                     isoforms annotation genome

Structural and Quality Annotation of Novel Transcript Isoforms

positional arguments:
  isoforms              Isoforms (FASTA/FASTQ) or GTF format. It is recommended to 
                        provide them in GTF format, but if it is needed to map the sequences 
                        to the genome use a FASTA/FASTQ file with the
                        --fasta option.

  annotation            Reference annotation file (GTF format)

  genome                Reference genome (Fasta format)

options:
  -h, --help            show this help message and exit
  --min_ref_len MIN_REF_LEN
                        Minimum reference transcript length (default: 200 bp)

  --force_id_ignore     Allow the usage of transcript IDs non related with 
                        PacBio's nomenclature (PB.X.Y)

  --aligner_choice {minimap2,deSALT,gmap,uLTRA}

  --CAGE_peak CAGE_PEAK
                        FANTOM5 Cage Peak (BED format, optional)

  --polyA_motif_list POLYA_MOTIF_LIST
                        Ranked list of polyA motifs (text, optional)

  --polyA_peak POLYA_PEAK
                        PolyA Peak (BED format, optional)

  --phyloP_bed PHYLOP_BED
                        PhyloP BED for conservation score (BED, optional)

  --skipORF             Skip ORF prediction (to save time)

  --is_fusion           Input are fusion isoforms, must supply GTF as input

  --orf_input ORF_INPUT
                        Input fasta to run ORF on. By default, ORF is run on genome-corrected 
                        fasta - this overrides it. If input is fusion (--is_fusion), this must be provided for ORF prediction.

  --fasta               Use when running SQANTI by using as input a FASTA/FASTQ with the sequences of isoforms

  -e EXPRESSION, --expression EXPRESSION
                        Expression matrix (supported: Kallisto tsv)

  -x GMAP_INDEX, --gmap_index GMAP_INDEX
                        Path and prefix of the reference index created by gmap_build. Mandatory 
                        if using GMAP unless -g option is specified.

  -t CPUS, --cpus CPUS  Number of threads used during alignment by aligners. (default: 10)

  -n CHUNKS, --chunks CHUNKS
                        Number of chunks to split SQANTI3 analysis in for speed up (default: 1).

  -o OUTPUT, --output OUTPUT
                        Prefix for output files.

  -d DIR, --dir DIR     Directory for output files. Default: Directory where the script was run.

  -c COVERAGE, --coverage COVERAGE
                        Junction coverage files (provide a single file, comma-delmited filenames, 
                        or a file pattern, ex: "mydir/*.junctions").

  -s SITES, --sites SITES
                        Set of splice sites to be considered as canonical (comma-separated list of splice 
                        sites). Default: GTAG,GCAG,ATAC.

  -w WINDOW, --window WINDOW
                        Size of the window in the genomic DNA screened for Adenine content downstream of TTS

  --genename            Use gene_name tag from GTF to define genes. Default: gene_id used to define genes

  -fl FL_COUNT, --fl_count FL_COUNT
                        Full-length PacBio abundance file

  -v, --version         Display program version number.

  --saturation          Include saturation curves into report

  --report {html,pdf,both,skip}
                        select report format --html --pdf --both --skip

  --isoAnnotLite        Run isoAnnot Lite to output a tappAS-compatible gff3 file

  --gff3 GFF3           Precomputed tappAS species specific GFF3 file. It will serve as reference 
                        to transfer functional attributes

  --short_reads SHORT_READS
                        File Of File Names (fofn, space separated) with paths to FASTA or FASTQ from 
                        Short-Read RNA-Seq. If expression or 
                        coverage files are not provided, Kallisto (just for pair-end
                        data) and STAR, respectively, will be run to calculate them.

  --SR_bam SR_BAM       Directory or fofn file with the sorted bam files of Short Reads RNA-Seq mapped 
                        against the genome

  --isoform_hits        Report all FSM/ISM isoform hits in a separate file

  --ratio_TSS_metric {max,mean,median,3quartile}
                        Define which statistic metric should be reported in the ratio_TSS column

5. 运行SQANTI3 FILTER sqanti3_filter.py

通过设定过滤规则过滤Running the rules filter

#教程
$ python sqanti3_filter.py rules path/to/classification.txt 

#实际运行
$ python sqanti3_filter.py rules example/SQANTI3_UHRR_Output/UHRR_classification.txt

通过机器学习算法过滤Running the machine learning filter

#教程
$ python sqanti3_filter.py ml path/to/classification.txt

#实际运行
$ python sqanti3_filter.py ml example/SQANTI3_UHRR_Output/UHRR_classification.txt

6. 运行SQANTI3 RESCUE sqanti3_rescue.py

#rules实际运行
$ python sqanti3_rescue.py rules --isoforms *_corrected.fasta --gtf *.filtered.gtf -f GRCh38.gtf -k UHRR_classification.txt

#机器学习实际运行
$ python sqanti3_rescue.py ml --isoforms *_corrected.fasta --gtf *.filtered.gtf -f GRCh38.gtf -k UHRR_classification.txt

三、安装SQANTI3可能遇到的问题

1. 如果遇到下面这个问题：

Error compiling Cython file:
------------------------------------------------------------
...
                exon_tree.insert_interval(Interval(e_start+offset, i+offset, index))
                index += 1
                tag = False
            elif baseC[i] > 0 and (altC_pos[i] > threshSplit or altC_neg[i+1] < -threshSplit): # alt. junction found!
                # end the current exon at i and start a new one at i + 1
                print "alt. junction found at", i
                      ^
------------------------------------------------------------

cupcake/tofu/branch/c_branch.pyx:30:22: Syntax error in simple statement list
Traceback (most recent call last):
  File "/mnt/data/home/mli/Desktop/Software/cDNA_Cupcake-master/setup.py", line 25, in <module>
    ext_modules = cythonize(ext_modules),
  File "/mnt/data/home/mli/miniforge-pypy3/envs/SQANTI3.env/lib/python3.10/site-packages/Cython/Build/Dependencies.py", line 1154, in cythonize
    cythonize_one(*args)
  File "/mnt/data/home/mli/miniforge-pypy3/envs/SQANTI3.env/lib/python3.10/site-packages/Cython/Build/Dependencies.py", line 1321, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: cupcake/tofu/branch/c_branch.pyx

解决方法:

编辑 setup.py文件中的 第25行：

把 ext_modules = cythonize(ext_modules),
改为 ext_modules = cythonize(ext_modules, language_level = "2"),
再次运行：
$ python setup.py build

2. 如果遇到下面这个问题：

SQANTI3.env) mli@ca496d1fda31:~$ sqanti3_qc.py 
Traceback (most recent call last):
  File "/mnt/data/home/mli/Desktop/Software/SQANTI3-5.2/sqanti3_qc.py", line 20, in <module>
    from scipy import mean
ImportError: cannot import name 'mean' from 'scipy' (/mnt/data/home/mli/miniforge-pypy3/envs/SQANTI3.env/lib/python3.10/site-packages/scipy/__init__.py)

解决方法:
经过测试，应该是python版本的问题。SQANTI3.cond_env.yml第29行，- python>=3.7.6，这样conda自动安装最新的python3.10。在python环境中运行from scipy import mean报错。我把- python>=3.7.6改成-python = 3.8.13。

删除SQANTI3旧环境：

$ conda env remove --name SQANTI3.env

重新运行：

$ conda env create -f SQANTI3.conda_env.yml

然后就可以了，cheers！

全长转录组 | Iso-Seq 三代测序数据分析流程 (PacB

SQANTI 3 主要设计用来完成两个同等重要的任务（图2）：

用 SQANTI 3 生成高质量转录组之后，下游步骤包括（图1）：

在运行 SQANTI 3 之前：推荐的长读序列处理流程：

一、SQANTI3 v5.2 安装

二、运行SQANTI3

三、安装SQANTI3可能遇到的问题

猜你喜欢

热点阅读

全长转录组 | Iso-Seq 三代测序数据分析流程 (PacB

SQANTI 3 主要设计用来完成两个同等重要的任务 （图2）：

用 SQANTI 3 生成高质量转录组之后，下游步骤包括（图1）：

在运行 SQANTI 3 之前：推荐的长读序列处理流程：

一、SQANTI3 v5.2 安装

二、运行SQANTI3

三、安装SQANTI3可能遇到的问题

猜你喜欢

热点阅读

SQANTI 3 主要设计用来完成两个同等重要的任务（图2）：