pySCENIC

2019-06-26 本文已影响25人五谷吃不完了

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data

R环境中操作流程见:https://www.jianshu.com/p/b0fd795ad05c

分析步骤：

0. Preliminary work
1. Inference of co-expression modules
2. Prune modules for targets with cis regulatory footprints (aka RcisTarget)
3. Cellular regulon enrichment matrix (aka AUCell)

0. 前期参数准备


import os
import glob
import pickle
import pandas as pd
import numpy as np

from dask.diagnostics import ProgressBar

from arboreto.utils import load_tf_names
from arboreto.algo import grnboost2

from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, load_motifs
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell

# 这个包我在导入的时候出现了如下报错：（Segmentation fault）
# 因为该函数只涉及到出图，因此没有导入的必要（后期结果直接导入到R中进行出图）
# import seaborn as sns

DATA_FOLDER="~/tmp"
RESOURCES_FOLDER="~/resources"
DATABASE_FOLDER = "~/databases/"
SCHEDULER="123.122.8.24:8786"
DATABASES_GLOB = os.path.join(DATABASE_FOLDER, "mm9-*.feather")
MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.mgi-m0.001-o0.0.tbl")
MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'mm_tfs.txt')
SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "GSE60361_C1-3005-Expression.txt")
REGULONS_FNAME = os.path.join(DATA_FOLDER, "regulons.p")
MOTIFS_FNAME = os.path.join(DATA_FOLDER, "motifs.csv")

# 读入表达矩阵，表达矩阵的格式：横坐标是基因，纵坐标是细胞
ex_matrix = pd.read_csv(SC_EXP_FNAME, sep='\t', header=0, index_col=0).T
ex_matrix.shape

# 导入转录因子
tf_names = load_tf_names(MM_TFS_FNAME)

# 导入数据库
db_fnames = glob.glob(DATABASES_GLOB)
def name(fname):
    return os.path.basename(fname).split(".")[0]   
dbs = [RankingDatabase(fname=fname, name=name(fname)) for fname in db_fnames]
dbs

1. Inference of co-expression modules

# 两条命令解决
adjacencies = grnboost2(ex_matrix, tf_names=tf_names, verbose=True) #耗时
modules = list(modules_from_adjacencies(adjacencies, ex_matrix))

2. Prune modules for targets with cis regulatory footprints (aka RcisTarget)

# Calculate a list of enriched motifs and the corresponding target genes for all modules.
with ProgressBar():
    df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME)

# Create regulons from this table of enriched motifs.
regulons = df2regulons(df)

# Save the enriched motifs and the discovered regulons to disk.
df.to_csv(MOTIFS_FNAME)
with open(REGULONS_FNAME, "wb") as f:
    pickle.dump(regulons, f)

3. Cellular regulon enrichment matrix (aka AUCell)

auc_mtx = aucell(ex_matrix, regulons, num_workers=4)
# 这一步出图
# sns.clustermap(auc_mtx, figsize=(8,8))

将结果导入到R

首先将pySCENIC中得到的结果保存成.loom格式

# 还是在刚才的环境中（Python）
from pyscenic.export import export2loom
export2loom(ex_mtx = ex_matrix, auc_mtx = auc_mtx, regulons = regulons, out_fname = "your path way/xxx.loom")
# 这里会有提示（见图1），按照提示改一下

export2loom(ex_mtx = ex_matrix, auc_mtx = auc_mtx, regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons], out_fname = "your path way/xxx.loom")
# 这一句话运行完毕后，会在指定目录下生成xxx.loom文件，这就是导入R所需要的文件

图1
然后进入R环境

# 保存xxx.loom文件的路径
pyScenicDir <- "pySCENIC_example/output"

library(SCENIC)
library(SCopeLoomR)

pyScenicLoomFile <- file.path(pyScenicDir, "SCENIC.loom")
loom <- open_loom(pyScenicLoomFile, mode="r")

# Read information from loom file:
regulons_incidMat <- get_regulons(loom)
regulons <- regulonsToGeneLists(regulons_incidMat)
regulonsAUC <- get_regulonsAuc(loom)
regulonsAucThresholds <- get_regulonThresholds(loom)
embeddings <- get_embeddings(loom)

exprMat <- get_dgem(loom)
cellInfo <- get_cellAnnotation(loom)
clusterings <- get_clusterings_withName(loom)

close_loom(loom)

以上过程完成之后，pySCENIC中的结果就成功导入到R环境中了，接下来就可以在R中出图（可参考：SCENIC—Single Cell rEgulatory Network Inference and Clustering）

所用数据集

# Transcription factors:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/allTFs_hg38.txt 
# Motif to TF annotation database:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/motifs.tbl
# Ranking databases:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/genome-ranking.feather
# Finally, get a small sample expression matrix (loom format):
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/expr_mat.loom
# Alternatively, in tsv format:
wget https://raw.githubusercontent.com/aertslab/scenic-nf/master/example/expr_mat.tsv

参考链接：
Importing pySCENIC results
pyscenic
数据集

补充

当我用真实数据跑这一套流程的时候，在“将pySCENIC中得到的结果保存成.loom格式”中出现了如下报错：

网上搜到比较靠谱的解释是HDF5文件类型大小有限制（HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata.）

因为不是很懂怎么解决这个问题，再加上我只需要regulonAUC这个对象（对应着pyscenic当中的auc_mtx），所以我后面就没有用到上述提到的方法导入R

# 接上面的 Cellular regulon enrichment matrix (aka AUCell)
AUC_FNAME=os.path.join(DATA_FOLDER, "auc.tsv")
regulons = [r.rename(r.name.replace('(+)',' ('+str(len(r))+'g)')) for r in regulons]
auc_mtx = aucell(ex_matrix, regulons, num_workers=4)
auc_mtx.to_csv(AUC_FNAME)

# 进入R环境
# cellInfo就是SeuratObject@metadata
library(SCENIC) 
library(SCopeLoomR)
library(AUCell)
pyScenicDir=DATA_FOLDER
regulonAUC <- importAUCfromText(file.path(pyScenicDir,  "auc..tsv"))
regulonAUC <- regulonAUC[onlyNonDuplicatedExtended(rownames(regulonAUC)),]
regulonActivity_byCellType <- sapply(split(rownames(cellInfo), cellInfo$CellType),
                                     function(cells) rowMeans(getAUC(regulonAUC)[,cells]))
regulonActivity_byCellType_Scaled <- t(scale(t(regulonActivity_byCellType), center = T, scale=T))

pheatmap::pheatmap(regulonActivity_byCellType_Scaled, #fontsize_row=3, 
                   color=colorRampPalette(c("blue","white","red"))(100), breaks=seq(-3, 3, length.out = 100),
                   treeheight_row=10, treeheight_col=10, border_color=NA)

还有一种方法，因为这个包里面限速的步骤是GENIE3（R里面），也可以单独把这一步在Python中运行（GRNBoost，其他步骤可在R中运行），生成的结果保存到R中的/int目录下。具体操作：
https://github.com/aertslab/SCENIC/issues/40
https://arboreto.readthedocs.io/en/latest/examples.html