单细胞转录组数据分析|| scVelo 教程：RNA速率分析工具

2020-03-30 本文已影响0人周运来就是我

什么是RNA 速率（RNA velocity）

RNA速率能够通过叠加剪接信息来推断细胞分化的方向性。

随着令人震惊的发现，未剪接和剪接的mRNA丰度可以在标准的单细protocols胞中进行区分，La Manno等人(2018)引入了RNA速度(RNA velocity)的概念。通过将测量结果与潜在的mRNA剪接动力学联系起来，探索定向轨迹的推断：特定基因的转录诱导导致(新转录的)前体未剪接mRNA的增加，而相反，转录的抑制或缺失导致未剪接mRNA的减少。因此，通过将未剪接的mRNA与成熟的剪接mRNA进行区分，可以近似地得到mRNA丰度的变化，即其时间导数，即RNA速度。跨mrna的速度组合可以用来估计单个细胞的未来状态。细胞的运动是通过将速度投影到低维嵌入中来实现的。

scVelo如何估计RNA速度

RNA速度估计目前有三种方法:

稳态/确定性模型(在velocyto中使用)
随机模型(使用二阶矩)
动态模型(使用基于概率的框架)

velocyto中使用的稳态/确定性模型对速度的估计如下:在假定转录阶段(诱导和抑制)持续足够长的时间以达到稳态平衡(活跃和不活跃)的情况下，速度被量化为实际观测值如何偏离稳态平衡。平衡mRNA水平近似于在假定的上下分位数的稳定状态下的线性回归。这种简化是通过假设一个跨基因的通用剪接率和数据中反映的稳态mRNA水平来实现的。这可能导致速度估计和细胞状态的错误，因为这些假设经常被违背，特别是当一个种群包含多个异质亚种群动态时。

随机模型的目标是更好地捕捉稳态，但与稳态模型的假设相同。它是通过处理转录，剪接和降解作为概率事件，从而纳入二阶矩。也就是说，稳态水平不仅与mRNA水平近似，而且与内在表达变异性近似。

动态模型(最强大的，同时计算最耗资源)解决了每个基因的剪接动力学的全部动态。因此，它使RNA速度适应广泛变化的规格，如非平稳群体，因为它不依赖于限制一个共同的剪接率或稳态被采样。通过迭代估计反应速率和潜在细胞特异性变量的可识别参数，即转录状态和细胞内潜伏时间，在基于概率的期望最大化框架中求解剪接动力学。更准确地说，我们明确地模拟了四种转录状态，以考虑基因活动的所有可能配置：两个动态瞬态(诱导和抑制)和两个稳态(活跃和不活跃)在每个动态转变后可能达到的状态。对于每个状态下的每个观测值，计算一个模型的最优潜伏期，以获得一个映射到非剪接/剪接动态的学习曲线上。然后，细胞到曲线的映射产生各自状态的可能性，并通过最大可能的整体反应速率参数。

这产生了更一致的速度估计和更好的识别转录状态。该模型进一步能够以一种基于概率的方式系统地识别动态驱动基因，从而找到控制细胞命运转变的关键驱动因素。此外，动态模型推断了一个普遍的细胞内潜伏时间共享的基因，使相关基因和识别转录变化的机制(如分支点)。

为了得到最好的结果，我们显然推荐使用更优的动态模型。如果运行时很重要，我们建议使用随机模型，它是用来近似动态模型的。随机模型在30k细胞上用时不到一分钟。然而，动力仍然可以花一个小时，然而，提高效率正在进行中。

安装

#conda install -c conda-forge numba pytables louvain
pip install -U scvelo

## or # git clone git+https://github.com/theislab/scvelo
## pip install -e scvelo

比对

为了获得MRNA剪切与为剪切的信息，我们需要比对后的bam文件，比对可以采用不同的方法:

scVelo in action

如果我们已经熟悉scanpy的基本操作，理解scVelo 就不是难事了，因为二者的数据结构是一样的。

# -*- coding: utf-8 -*-
"""
Created on Fri Mar 27 08:34:49 2020
https://scvelo.readthedocs.io/
@author: Administrator
"""
import scvelo as scv
import scanpy as sc

scv.logging.print_version()
scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.set_figure_params('scvelo')  # for beautified visualization
scv.settings.set_figure_params('scvelo')


# filename = 'E:\learnscanpy\write\pbmc3k.h5ad'
# adata = scv.read(filename, cache=False)

loomf = 'E:\learnscanpy\data\possorted_genome_bam_AYDSF.loom'
adata = scv.read(loomf, cache=False)
adata.var_names_make_unique

这里的possorted_genome_bam_AYDSF.loom是我们10X的数据比对后用velocyto处理得到的loom文件。

我们来看看数据结构：

adata
Out[6]: 
AnnData object with n_obs × n_vars = 7292 × 33694 
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced'

adata.var
Out[7]: 
                     Accession Chromosome       End     Start Strand
FAM138A        ENSG00000237613          1     36081     34554      -
RP11-34P13.7   ENSG00000238009          1    133723     89295      -
RP11-34P13.8   ENSG00000239945          1     91105     89551      -
RP11-34P13.14  ENSG00000239906          1    140339    139790      -
FO538757.2     ENSG00000279457          1    200322    184923      -
                       ...        ...       ...       ...    ...
AC006386.1     ENSG00000279115          Y  25308107  25307702      +
AC006328.4     ENSG00000280301          Y  25473714  25463994      +
CSPG4P1Y       ENSG00000240450          Y  25486705  25482908      +
CDY1           ENSG00000172288          Y  25624902  25622162      +
TTTY3          ENSG00000231141          Y  25733388  25728490      +

[33694 rows x 5 columns]

adata.layers['matrix']
Out[8]: 
<7292x33694 sparse matrix of type '<class 'numpy.float32'>'
    with 17201128 stored elements in Compressed Sparse Row format>

数据预处理

检测基因选择(用最少的计数检测)和高变异性(离散度)。-根据每个细胞的初始大小对其进行归一化，并对X取对数。


scv.pp.filter_genes(adata, min_shared_counts=10)
scv.pp.normalize_per_cell(adata)
scv.pp.filter_genes_dispersion(adata, n_top_genes=3000)
scv.pp.log1p(adata)

此外，我们需要在PCA空间中计算最近邻的一阶和二阶矩(基本上是均值和无中心方差)。确定性速度估计需要一阶矩，而随机估计也需要二阶矩。

scv.pp.filter_and_normalize(adata, min_shared_counts=30, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

adata
Out[15]: 
AnnData object with n_obs × n_vars = 7292 × 1999 
    obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm'
    uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu'
    obsp: 'distances', 'connectivities'

计算velocity 和velocity 图

通过拟合前体(未剪接的)和成熟(剪接的)mRNA丰度之间的比值来获得基因特异性速度，该比值很好地解释了稳定状态(恒定转录状态)，然后计算观察到的丰度如何偏离稳定状态下的期望。

scv.tl.velocity(adata)

computing velocities
    finished (0:00:01) --> added 
    'velocity', velocity vectors for each individual cell (adata.layers)

这计算了在高维空间中势元跃迁与速度矢量的(余弦)相关性。结果的velocity 图nobs×nobs，并总结了可能的细胞状态变化(从一个细胞到另一个细胞的过渡)，这些变化通过速度向量得到了很好的解释。如果你设置approx=True，它是在一个减少的PCA空间上计算的，有50个PC。

然后，通过在余弦相关上应用高斯核，将速度图转换为一个过渡矩阵，该高斯核赋予与速度向量相关的细胞状态变化的高概率。可以通过scv.tl.transition_matrix访问马尔可夫转移矩阵。由此产生的转换矩阵可用于以下所示的各种应用程序。例如，它通过简单地应用相对于转移概率的平均转移，即scv.tl. velocity_embedded，来将速度放入低维嵌入中。此外，通过scv.tl.terminal_states,我们可以沿着马尔科夫链追溯细胞的起源和潜在的命运，从而获得轨迹中的根细胞和终点。

scv.tl.velocity_graph(adata)

computing velocity graph

100%    finished (0:00:04) --> added 
    'velocity_graph', sparse matrix with cosine correlations (adata.uns)
Out[18]: 
AnnData object with n_obs × n_vars = 7292 × 1999 
    obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm', 'velocity_gamma', 'velocity_r2', 'velocity_genes'
    uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key', 'velocity_settings', 'velocity_graph', 'velocity_graph_neg'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu', 'velocity', 'variance_velocity'
    obsp: 'distances', 'connectivities'

可视化结果

在可视化之前我们先计算一下umap:

scv.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.leiden(adata,adjacency = adata.obsp['connectivities'])

vdata =  adata 
#adata.uns['neighbors']['indices']
#adata.uns['connectivities_key']
#adata.obsp['connectivities'].tocoo()
scv.tl.umap(adata)

scv.pl.velocity_embedding_stream(adata, basis='umap', color=['leiden','initial_size_spliced'])

computing velocity embedding
    finished (0:00:01) --> added
    'velocity_umap', embedded velocity vectors (adata.obsm)

基因排序

help(scv.tl.rank_velocity_genes) # 改造函数
Help on function rank_velocity_genes in module scvelo.tools.rank_velocity_genes:

rank_velocity_genes(data, vkey='velocity', n_genes=10, groupby=None, match_with=None, resolution=None, min_counts=None, min_r2=None, min_dispersion=None, min_likelihood=None, copy=False)
    Rank genes for velocity characterizing groups.
    
    This applies a differential expression test (Welch t-test with overestimated variance to be conservative) on
    velocity expression, to find genes in a cluster that show dynamics that is transcriptionally regulated differently
    compared to all other clusters (e.g. induction in that cluster and homeostasis in remaining population).
    If no clusters are given, it priorly computes velocity clusters by applying louvain modularity on velocity expression.
    
    .. code:: python
    
        scv.tl.rank_velocity_genes(adata, groupby='clusters')
        scv.pl.scatter(adata, basis=adata.uns['rank_velocity_genes']['names']['Beta'][:3])
        pd.DataFrame(adata.uns['rank_velocity_genes']['names']).head()
    
    .. image:: https://user-images.githubusercontent.com/31883718/69626017-11c47980-1048-11ea-89f4-df3769df5ad5.png
       :width: 600px
    
    .. image:: https://user-images.githubusercontent.com/31883718/69626572-30774000-1049-11ea-871f-e8a30c42f10e.png
       :width: 600px
    
    Arguments
    ----------
    data : :class:`~anndata.AnnData`
        Annotated data matrix.
    vkey: `str` (default: `'velocity'`)
        Key of velocities computed in `tl.velocity`
    n_genes : `int`, optional (default: 100)
        The number of genes that appear in the returned tables.
    groupby: `str`, `list` or `np.ndarray` (default: `None`)
        Key of observations grouping to consider.
    match_with: `str` or `None` (default: `None`)
        adata.obs key to separatively rank velocities on.
    resolution: `str` or `None` (default: `None`)
        Resolution for louvain modularity.
    min_counts: `float` (default: None)
        Minimum count of genes for consideration.
    min_r2: `float` (default: None)
        Minimum r2 value of genes for consideration.
    min_dispersion: `float` (default: None)
        Minimum dispersion norm value of genes for consideration.
    min_likelihood: `float` between `0` and `1` or `None` (default: `None`)
        Only rank velocity of genes with a likelihood higher than min_likelihood.
    copy: `bool` (default: `False`)
        Return a copy instead of writing to data.
    
    Returns
    -------
    Returns or updates `data` with the attributes
    rank_velocity_genes : `.uns`
        Structured array to be indexed by group id storing the gene
        names. Ordered according to scores.
    velocity_score : `.var`
        Storing the score for each gene for each group. Ordered according to scores.

scv.tl.rank_velocity_genes(adata, match_with='leiden', resolution=.4)

import pandas as pd
pd.DataFrame(adata.uns['rank_velocity_genes']['names']).head()

Out[27]: 
        0 0       1 1      4 2       7 3  ...      0 9      0 10  10 11      5 12
0      GNLY      NKG7      HCK      CCL5  ...     CD36  HLA-DQB1  PTGS1      LST1
1      MNDA     RGS18    SMCO4     CD160  ...    DMXL2  HLA-DQA1   CD22      CD22
2   CLEC10A   METTL17    DMXL2    SH2D1B  ...     CD74    CCDC50  RGS18    CCDC50
3  HLA-DPA1  HLA-DPA1  CYP27A1     RGS18  ...  CLEC10A   TSPAN13    MMD  HLA-DQB1
4   HLA-DRA     DDX11      FGR  HLA-DPA1  ...      MMD      CD22    FGR  HLA-DQA1

[5 rows x 13 columns]

scv.pl.velocity_embedding_grid(adata, color='GNLY',
                               layer=['velocity', 'spliced'], arrow_size=1.5)

特定基因的剪切状态，velocity动态以及表达情况，直接整理成可以发表的形式：

var_names = ['GNLY', 'RGS18', 'DMXL2', 'SH2D1B']
scv.pl.velocity(adata, var_names=var_names, colorbar=True, ncols=2)

help(scv.pl.velocity_graph)
scv.pl.velocity_graph(adata,color='GNLY')

动力学分析

 help(scv.tl.recover_dynamics)
Help on function recover_dynamics in module scvelo.tools.dynamical_model:

recover_dynamics(data, var_names='velocity_genes', n_top_genes=None, max_iter=10, assignment_mode='projection', t_max=None, fit_time=True, fit_scaling=True, fit_steady_states=True, fit_connected_states=None, fit_basal_transcription=None, use_raw=False, load_pars=None, return_model=None, plot_results=False, steady_state_prior=None, add_key='fit', copy=False, **kwargs)
    Recovers the full splicing kinetics of specified genes.
    
    The model infers transcription rates, splicing rates, degradation rates,
    as well as cell-specific latent time and transcriptional states, estimated iteratively by expectation-maximization.
    
    .. image:: https://user-images.githubusercontent.com/31883718/69636459-ef862800-1056-11ea-8803-0a787ede5ce9.png
    
    Arguments
    ---------
    data: :class:`~anndata.AnnData`
        Annotated data matrix.
    var_names: `str`,  list of `str` (default: `'velocity_genes`)
        Names of variables/genes to use for the fitting.
    n_top_genes: `int` or `None` (default: `None`)
        Number of top velocity genes to use for the dynamical model.
    max_iter:`int` (default: `10`)
        Maximal iterations in the EM-Algorithm.
    assignment_mode: `str` (default: `projection`)
        Determined how times are assigned to observations.
        If `projection`, observations are projected onto the model trajectory.
        Else uses an inverse approximating formula.
    t_max: `float`, `False` or `None` (default: `None`)
        Total range for time assignments.
    fit_scaling: `bool` or `float` or `None` (default: `True`)
        Whether to fit scaling between unspliced and spliced or keep initially given scaling fixed.
    fit_time: `bool` or `float` or `None` (default: `True`)
        Whether to fit time or keep initially given time fixed.
    fit_steady_states: `bool` or `None` (default: `True`)
        Allows fitting of observations to steady states next to repression and induction.
    fit_connected_states: `bool` or `None` (default: `None`)
        Restricts fitting to neighbors given by connectivities.
    fit_basal_transcription: `bool` or `None` (default: `None`)
        Enables model to incorporate basal transcriptions.
    use_raw: `bool` or `None` (default: `None`)
        if True, use .layers['sliced'], else use moments from .layers['Ms']
    load_pars: `bool` or `None` (default: `None`)
        Load parameters from past fits.
    return_model: `bool` or `None` (default: `True`)
        Whether to return the model as :DynamicsRecovery: object.
    plot_results: `bool` or `None` (default: `False`)
        Plot results after parameter inference.
    steady_state_prior: list of `bool` or `None` (default: `None`)
        Mask for indices used for steady state regression.
    add_key: `str` (default: `'fit'`)
        Key to add to parameter names, e.g. 'fit_t' for fitted time.
    copy: `bool` (default: `False`)
        Return a copy instead of writing to `adata`.
    
    Returns
    -------
    Returns or updates `adata`

scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)

adata
Out[37]: 
AnnData object with n_obs × n_vars = 7292 × 1999 
    obs: 'initial_size_unspliced', 'initial_size_spliced', 'initial_size', 'n_counts', 'velocity_self_transition', 'leiden', 'velocity_clusters'
    var: 'Accession', 'Chromosome', 'End', 'Start', 'Strand', 'means', 'dispersions', 'dispersions_norm', 'velocity_gamma', 'velocity_r2', 'velocity_genes', 'velocity_score', 'fit_alpha', 'fit_beta', 'fit_gamma', 'fit_t_', 'fit_scaling', 'fit_std_u', 'fit_std_s', 'fit_likelihood', 'fit_u0', 'fit_s0', 'fit_pval_steady', 'fit_steady_u', 'fit_steady_s', 'fit_alignment_scaling', 'fit_r2'
    uns: 'pca', 'neighbors', 'connectivities_key', 'distances_key', 'velocity_settings', 'velocity_graph', 'velocity_graph_neg', 'leiden', 'umap', 'leiden_colors', 'rank_velocity_genes', 'recover_dynamics'
    obsm: 'X_pca', 'X_umap', 'velocity_umap'
    varm: 'PCs', 'loss'
    layers: 'matrix', 'ambiguous', 'spliced', 'unspliced', 'Ms', 'Mu', 'velocity', 'variance_velocity', 'fit_t', 'fit_tau', 'fit_tau_', 'velocity_u'
    obsp: 'distances', 'connectivities'

scv.pl.scatter(adata, color='latent_time', fontsize=24, size=100,
               color_map='gnuplot', perc=[2, 98], colorbar=True, rescale_color=[0,1])

top_genes = adata.var_names[adata.var.fit_likelihood.argsort()[::-1]][:300]
scv.pl.heatmap(adata, var_names=top_genes, tkey='latent_time', n_convolve=100, col_color='leiden')

 top_genes
Out[43]: 
Index(['TBXAS1', 'NKG7', 'PRAM1', 'HCK', 'SMCO4', 'CD22', 'SHTN1', 'CD74',
       'HLA-DQA1', 'HLA-DRA',
       ...
       'GSPT1', 'GALNS', 'NDUFAB1', 'NDUFV2-AS1', 'DHX40', 'PPM1D', 'BCAS3',
       'METTL2A', 'TANC2', 'DDX42'],
      dtype='object', length=300)

scv.pl.scatter(adata, basis=top_genes[:10], legend_loc='none',
               size=80, frameon=False, ncols=5, fontsize=20,color='leiden')

scv.pl.scatter(adata, x='latent_time', y=top_genes[:4], fontsize=16, size=100,
               n_convolve=None, frameon=False, legend_loc='none',color='leiden')