利用MAGeCK算法处理CRISPR Screen数据

2022-03-10 本文已影响0人凯凯何_Boy

上次文章结尾时候提到了MAGeCK RRA算法处理，这次我们就来学习一下，Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout(MAGeCK) 是一个可以从全基因组CRISPR-CAS9筛查技术中识别重要基因计算工具。Mageck是由Wei Li 和 Shirley Liu lab共同开发维护的。

Tutorial地址：https://sourceforge.net/p/mageck/wiki/

Paper：Li, et al. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology 15:554 (2014)

安装

Conda安装(推荐)

conda create -c bioconda -n mageckenv mageck  # 环境中python版本要大于3
source activate mageckenv
conda update mageck
source deactivate

Docker安装

可以follow这个流程https://sourceforge.net/p/mageck/wiki/demo/#tutorial-3-run-mageck-on-docker-image

源码安装

最新版的MAGeCK(0.5.9)下载地址为：https://sourceforge.net/projects/mageck/files/latest/download

tar xvzf mageck-0.5.4.tar.gz
cd mageck-0.5.4
python setup.py install
## 如果你想要安装MAGeCK安装到你自己的目录下
python setup.py install --user
## $HOME是你的root目录
python setup.py install --prefix=$HOME
## 设置环境变量
export PATH=$PATH:$HOME/bin

MAGECK RRA算法

下载测试数据

我们直接使用 Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library这篇文章中的数据集中的两个样本为示例，在本文中，作者对小鼠ESC细胞进行了CRISPR / CAS9筛选，并鉴定了小鼠ESC细胞中必需的基因。

准备文件：

测序数据：对照组 ERR376998 (one replicate of plasmid)、处理组 ERR376999 (one replicate of ESC)
sgRNA库文件：yusa_library.csv 包含测序sgRNAid、序列、靶向基因

sgRNA文库

数据下载地址：http://www.ebi.ac.uk/ena/data/view/ERP003292

sgRNA库文件：https://sourceforge.net/projects/mageck/files/libraries/

数据的下载方式可以参考一文。

数据下载后进行解压

gunzip ERR376998.fastq.gz
gunzip ERR376999.fastq.gz

count获取样本比对数据

mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --fastq ERR376998.fastq  ERR376999.fastq

参数	解释
-l yusa_library.csv	提供的sgRNA信息包含sgRNA id,序列和靶向的基因(支持csv与tab分割)
-n demo	输出文件的前缀
--sample-label L1,CTRL	L1 (test1.fastq) and CTRL (test2.fastq)两个样本的标签
--fastq test1.fastq test2.fastq	测序的fastq序列
--pdf-report	对测序数据的统计信息提供pdf报告文档
--trim-5 （可选）	切除5段接头序列

我们打开ERR376998文件可以看到各序列的前23个碱基是一致的，我们可以设置--trim-5参数手动切除这些序列，但是从version 0.5.6版本后, MAGeCK现在可以自动识别fastq文件中一致序列进行切除，因此这个参数是可选的。

image-20220301135529297

test比较样本差异

当我们得到count矩阵后，就可以用test命令进行组间差异分析

mageck test -k escneg.count.txt -t ESC1 -c plasmid -n esccp

参数	解释
-k escneg.count.txt	count命令得到的各样本read count矩阵
-t	指定处理组数据
-c	指定对照组数据
-n demo	输出文件的前缀

如果成功，可以看到"esccp.gene_summary.txt"文件, 可以看到两个核糖体基因 RPS5和RP19在阴性筛选的前几名

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
GTF2B   5       2.0462e-10      2.5851e-07      0.000707        1       5       1.0     1.0     1.0     19150   0
RPS5    5       5.9353e-10      2.5851e-07      0.000707        2       5       1.0     1.0     1.0     19149   0
RPL19   4       2.695e-09       2.5851e-07      0.000707        3       4       1.0     1.0     1.0     19148   0
KIF18B  5       1.0136e-08      2.5851e-07      0.000707        4       5       1.0     1.0     1.0     19146   0

如果我们想看阳性筛选排序，可以将第11列进行排序，发现 TPR53在前几名

## 排序
sort -k 11,11n esccp.gene_summary.txt | less

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
ZFP945  5       1.0     1.0     0.999999        19150   0       9.6166e-07      5.4287e-06      0.05198 1  5
TRP53   5       0.95411 0.95409 0.999999        17901   0       1.0347e-06      5.4287e-06      0.05198 2  4
PDAP1   5       0.85937 0.86223 0.999999        15753   1       7.6412e-06      2.8178e-05      0.174505  3       2

文件信息如下：

Column	Content
id	Gene ID
num	The number of targeting sgRNAs for each gene
neg\|score	The RRA lo value of this gene in negative selection
neg\|p-value	The raw p-value (using permutation) of this gene in negative selection
neg\|fdr	The false discovery rate of this gene in negative selection
neg\|rank	The ranking of this gene in negative selection
neg\|goodsgrna	The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in negative selection.
neg\|lfc	The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the --gene-lfc-method option
pos\|score	The RRA lo value of this gene in positive selection
pos\|p-value	The raw p-value (using permutation) of this gene in positive selection
pos\|fdr	The false discovery rate of this gene in positive selection
pos\|rank	The ranking of this gene in positive selection
pos\|goodsgrna	The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in positive selection.
pos\|lfc	The log fold change of this gene in positive selection

同样我们可以添加--pdf-report参数，生成一个统计报告，便于分享展示。

image-20220301141757379

【关于筛选】：这是一项重要步骤，根据研究目的不同，CRISRP共同量测序分为阳性筛选和阴性筛选。阳性筛选指的是施加一定筛选压力，经过文库扰动后野生型细胞致死，仅有获得抗性的细胞存活，与阳性筛选不同，阳性筛选是通过比较不同筛选时间点的sgRNA丰度差，获得缺失的sgRNA，分析所对应的基因，从而找出可能影响细胞生长的候选基因。

MAGeCK MLE算法

除了传统的RRA算法，从0.5版本以后，MAGeCK提供了一个新的算法MLE，计算一个称为beta score的指标代表基因的重要程度，其实和转录组中差异分析中的log fold change指标类似，相比于RRA算法，这种方法有以下优势：

每个基因只会得到一个分数，而不是RRA中的两个分数：一个用于阳性选择，一个用于阴性选择。
它允许在多种条件(多组情况)下直接比较。
它包含各个sgRNA的干扰效率信息。

构建design matrix文件

这个design matrix文件表明那个样本被哪个条件所影响，通常是一个二进制矩阵，指示哪个样本（第一列指示）受到哪个条件的影响（由第一行表示）。

对于以上的数据对照组 ERR376998 (one replicate of plasmid)、处理组 ERR376999 (one replicate of ESC)，我们可以构建这样的矩阵designmat.txt

Samples baseline        plasmid ESC1
plasmid 1       0       0
ESC1    1       1      0

注意：

矩阵必须包含一个表头代表条件标签
第一列是样本的标签
第二列是必须是一个"baseline"列，所有值必须为1
表里的数字必须是0 或 1
必须设置一个初始状态的样本（比如day 0或者plasmid）

run the module

mageck mle -k escneg.count.txt -d designmat.txt -n yusa

执行成功后，会生成3列文件：log.file、 gene_summary file (包含 gene beta scores)、 sgrna_summary file (包含sgRNA efficiency probability predictions).,查看gene_summary.file：

gene_summary file

文件信息如下：

Column	Content
Gene	Gene ID
sgRNA	The number of targeting sgRNAs for each gene
plasmid\|beta;ESC1\|beta	The beta scores of this gene in conditions "plasmid" and "ESC1", respectively. The conditions are specified in the design matrix as an input of the mle subcommand.
plasmid\|p-value	The raw p-value (using permutation) of this gene
plasmid\|fdr	The false discovery rate of this gene
plasmid\|z	The z-score associated with Wald test
plasmid\|wald-p-value	The p value using Wald test
plasmid\|wald-fdr	The false discovery rate of the Wald test

其他参数

plot

我们可以使用plot命令对单独挑选出的基因绘图

usage: mageck plot [-h] -k COUNT_TABLE -g GENE_SUMMARY [--genes GENES]
                   [-s SAMPLES] [-n OUTPUT_PREFIX]
                   [--norm-method {none,median,total}] [--keep-tmp]

Parameter	Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE	Provide a tab-separated count table.
-g GENE_SUMMARY, --gene-summary GENE_SUMMARY	The gene summary file generated by the test command.
-h, --help	show this help message and exit
--genes GENES	A list of genes to be plotted, separated by comma. Default: none.
-s SAMPLES, --samples SAMPLES	A list of samples to be plotted, separated by comma. Default: using all samples in the count table.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX	The prefix of the output file(s). Default sample1.
--norm-method {none,median,total}	Method for normalization, default median.
--keep-tmp	Keep intermediate files.

## 举例子 单独绘制 GRHL2 BIK这两个基因
mageck plot -k escneg.count.txt -g esccp.gene_summary.txt --genes GRHL2,BIK

image-20220301151425139

可以看到plot命令将这两个基因每个sgRNA的变化，阳性和阴性筛选中按照P值和RRA score排序的图形绘制了除了，多个基因直接中间以，分割即可。

MAGECKGsea

MAGECKGSEA是使用C ++的基因设定富集分析（GSEA）的快速实现。与官方GSEA软件相比，主要优势是其易于使用和极快的运行速度。

作者这里也提供了gsea示例文件供我们练习：https://bitbucket.org/liulab/mageck/src/master/gsea/，其实只用到两个参数

 mageckGSEA -r demo1.txt -g kegg.ribosome.gmt  -c 1 -p 10000

Parameter	Explanation
-r rank_file, --rank_file rank_file	(required) Rank file. The first column of the rank file must be the gene name.
-g gmt_file, --gmt_file gmt_file	(required) The pathway annotation in GMT format.
-p perm_time, --perm_time perm_time	Permutations, default 1000.
-c score_column	The column for gene scores.

生成GSEA结果文件如下：

Pathway Size    ES  p   p_permutation   FDR Ranking Hits    LFC
KEGG_RIBOSOME   88  0.3262  0.00240772  0.0045  0.0045  0   32  0

Item	Explanation
Pathway	The name of the pathway
Size	The size of the pathway, i.e., the number of genes
ES	Enrichment Score (ES) in GSEA
p	The p value of ES
p_permutation	The permutation p value of ES (usually more accurate than p
FDR	False Discovery Rate of p_permutation
Ranking	The ranking of this pathway
Hits	The number of genes that are ranked before ES score. See "Leading Edge" analysis of GSEA
LFC	Log fold change (not implemented)

速度确实挺快，感觉可作为官方GSEA软件做基因集富集一种替代方式，只需要自己制作好GMT文件即可，不懂该格式的可以参考GSEA官网介绍：

Advanced tutorial

包含允许read mapping误差、纠正拷贝数变异影响、Docker iamge运行MAGeCK及更复杂设计实验条件下的操作，感兴趣看教程https://sourceforge.net/p/mageck/wiki/advanced_tutorial/

输入/出格式

更详细的参数介绍可直接看源文档

输入：https://sourceforge.net/p/mageck/wiki/input/

输出：https://sourceforge.net/p/mageck/wiki/output/

其它工具

除了MAGeCK, 该团队还开发了其他的软件与算法:

MAGeCK-VISPR: 制动CRISPR//Cas9筛查的开发的集质控，分析和可视化一体的workflow
MAGeCKFlute: 一个交互分析的pipelineR包
scMAGeCK: 一个结合单细胞转录组数据与Crispr筛查数据综合分析的计算模型
Vispr-online：在线版分析筛选数据集

后续有时间再来慢慢学习吧~~~