decoupleR：丰富你的通路富集分析结果

2022-12-13 本文已影响0人生命数据科学

图片

**Article name: **decoupleR: ensemble of computational methods to infer biological activities from omics data
**Journal: Bioinformatics Advances
Doi: https://doi.org/10.1093/bioadv/vbac016
IF: **创刊

在生信分析过程中，哪些基因有可能导致这个通路的激活/抑制，其实是比较关键的，今天的推送就是为此而来！

环境配置

首先，吸取教训，将我的配置环境分享出来，这样读者在安装包时也能有个参考。


> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified Han)_Hong Kong SAR.utf8 
[2] LC_CTYPE=Chinese (Simplified Han)_Hong Kong SAR.utf8   
[3] LC_MONETARY=Chinese (Simplified Han)_Hong Kong SAR.utf8
[4] LC_NUMERIC=C                                           
[5] LC_TIME=Chinese (Simplified Han)_Hong Kong SAR.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggrepel_0.9.1   pheatmap_1.0.12 ggplot2_3.3.6   tidyr_1.2.1     tibble_3.1.7   
[6] dplyr_1.0.9     decoupleR_2.3.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3        pillar_1.8.1        compiler_4.2.1      cellranger_1.1.0   
 [5] RColorBrewer_1.1-3  BiocManager_1.30.18 tools_4.2.1         lifecycle_1.0.3    
 [9] gtable_0.3.1        lattice_0.20-45     pkgconfig_2.0.3     rlang_1.0.6        
[13] Matrix_1.4-1        DBI_1.1.3           cli_3.3.0           rstudioapi_0.13    
[17] withr_2.5.0         generics_0.1.3      vctrs_0.4.1         grid_4.2.1         
[21] tidyselect_1.2.0    glue_1.6.2          R6_2.5.1            fansi_1.0.3        
[25] readxl_1.4.1        purrr_0.3.4         magrittr_2.0.3      scales_1.2.1       
[29] ellipsis_0.3.2      assertthat_0.2.1    colorspace_2.0-3    utf8_1.2.2         
[33] munsell_0.5.0

需要安装的包


library(decoupleR)
library(dplyr)
library(tibble)
library(tidyr)
library(ggplot2)
library(pheatmap)
library(ggrepel)

正片开始

如果所有的包都成功library之后，就可以正式分析了，这一步主要是把所用到的3个数据整合到了一个list中，而我们自己分析的话仅需符合它的条件即可。需要注意的是，示例数据中并非count值，后续使用的也不是count值所使用的分析方法，仅为举栗子，自己分析时输入输出可做调整。


inputs_dir <- system.file("extdata", package = "decoupleR")
data <- readRDS(file.path(inputs_dir, "bk_data.rds"))
# count矩阵
> summary(data$counts)
     gene           PANC1.WT.Rep1    PANC1.WT.Rep2    PANC1.WT.Rep3    PANC1.FOXA2KO.Rep1
 Length:11093       Min.   : 6.198   Min.   : 6.265   Min.   : 6.167   Min.   : 6.254    
 Class :character   1st Qu.: 7.726   1st Qu.: 7.772   1st Qu.: 7.785   1st Qu.: 7.718    
 Mode  :character   Median : 9.187   Median : 9.237   Median : 9.236   Median : 9.203    
                    Mean   : 9.378   Mean   : 9.417   Mean   : 9.402   Mean   : 9.400    
                    3rd Qu.:10.749   3rd Qu.:10.808   3rd Qu.:10.776   3rd Qu.:10.803    
                    Max.   :18.526   Max.   :18.481   Max.   :18.197   Max.   :18.489    
                    NA's   :67       NA's   :140      NA's   :82       NA's   :62        
 PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO.Rep3
 Min.   : 6.220     Min.   : 6.247    
 1st Qu.: 7.776     1st Qu.: 7.775    
 Median : 9.226     Median : 9.259    
 Mean   : 9.404     Mean   : 9.420    
 3rd Qu.:10.779     3rd Qu.:10.825    
 Max.   :18.333     Max.   :18.230    
 NA's   :77         NA's   :157
 
 # 实验设计
> data$design
# A tibble: 6 × 2
  sample             condition    
  <chr>              <chr>        
1 PANC1.WT.Rep1      PANC1.WT     
2 PANC1.WT.Rep2      PANC1.WT     
3 PANC1.WT.Rep3      PANC1.WT     
4 PANC1.FOXA2KO.Rep1 PANC1.FOXA2KO
5 PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO
6 PANC1.FOXA2KO.Rep3 PANC1.FOXA2KO

# 差异分析结果
> head(data$limma_ttop)
# A tibble: 6 × 7
  ID      logFC AveExpr      t    P.Value adj.P.Val     B
  <chr>   <dbl>   <dbl>  <dbl>      <dbl>     <dbl> <dbl>
1 RHBDL2  -1.82    8.60 -12.8  0.00000303    0.0336  3.95
2 PLEKHH2 -1.57    7.70 -10.8  0.00000993    0.0548  3.27
3 HEG1    -1.73    8.64  -9.79 0.0000194     0.0548  2.84
4 CLU     -1.79   12.2   -9.76 0.0000198     0.0548  2.83
5 FHL1     2.09    9.61   8.95 0.0000355     0.0788  2.43
6 RBP4    -1.73    7.39  -8.53 0.0000490     0.0862  2.19

关于前期差异分析流程，可以移步gzh

原文还对数据进行了预处理


# 删除NA值，第一列作为行名
counts <- data$counts %>%
  dplyr::mutate_if(~ any(is.na(.x)), ~ if_else(is.na(.x),0,.x)) %>% 
  column_to_rownames(var = "gene") %>% 
  as.matrix()
head(counts)
#>          PANC1.WT.Rep1 PANC1.WT.Rep2 PANC1.WT.Rep3 PANC1.FOXA2KO.Rep1 PANC1.FOXA2KO.Rep2 PANC1.FOXA2KO.Rep3
#> NOC2L        10.052588     11.949123     12.057774          12.312291          12.139918          11.494205
#> PLEKHN1       7.535115      8.125993      8.714880           8.048196           8.290154           8.621239
#> PERM1         6.281242      6.424582      6.589668           6.293285           6.486136           6.775344
#> ISG15        10.938252     11.469081     11.425415          11.549986          11.371464          11.178157
#> AGRN          6.956335      7.196108      7.522550           7.061549           7.485534           7.071555
#> C1orf159      9.546224      9.788721      9.794589           9.850830

# 提取gene与t value的关系
deg <- data$limma_ttop %>%
    select(ID, t) %>% 
    filter(!is.na(t)) %>% 
    column_to_rownames(var = "ID") %>%
    as.matrix()
head(deg)
#>                  t
#> RHBDL2  -12.810588
#> PLEKHH2 -10.794453
#> HEG1     -9.788112
#> CLU      -9.761618
#> FHL1      8.950191
#> RBP4     -8.529074

然后利用PROGENy数据库，获取通路中基因的weigh和*P *值，为后续分析做准备。（仅支持人和小鼠），top主要限制通路中基因的数量（P排序）


net <- get_progeny(organism = 'human', top = 100)
[2022-11-04 23:00:15] [SUCCESS] [OmnipathR] Loaded 700257 annotation records from cache.
> net
# A tibble: 1,400 × 4
   source   target  weight  p_value
   <chr>    <chr>    <dbl>    <dbl>
 1 Androgen TMPRSS2  11.5  2.38e-47
 2 Androgen NKX3-1   10.6  2.21e-44
 3 Androgen MBOAT2   10.5  4.63e-44
 4 Androgen KLK2     10.2  1.94e-40
 5 Androgen SARG     11.4  2.79e-40
 6 Androgen SLC38A4   7.36 1.25e-39
 7 Androgen MTMR9     6.13 2.53e-38
 8 Androgen ZBTB16   10.6  1.57e-36
 9 Androgen KCNN2     9.47 7.71e-36
10 Androgen OPRK1    -5.63 1.11e-35
# … with 1,390 more rows
# ℹ Use `print(n = ...)` to see more rows

既然有了权重，自然可以使用加权平均的方法来计算每个样本的得分，这里使用的是wmean方法，当然也可以使用show_methods()查看其他可以使用的方法，输入数据主要为count值和上一步得到了net。


sample_acts <- run_wmean(mat=counts, net=net, .source='source', .target='target',
                  .mor='weight', times = 100, minsize = 5)
> show_methods()
# A tibble: 12 × 2
   Function      Name                                                                      
   <chr>         <chr>                                                                     
 1 run_aucell    AUCell                                                                    
 2 run_consensus Consensus score between methods                                           
 3 run_fgsea     Fast Gene Set Enrichment Analysis (FGSEA)                                 
 4 run_gsva      Gene Set Variation Analysis (GSVA)                                        
 5 run_mdt       Multivariate Decision Trees (MDT)                                         
 6 run_mlm       Multivariate Linear Model (MLM)                                           
 7 run_ora       Over Representation Analysis (ORA)                                        
 8 run_udt       Univariate Decision Tree (UDT)                                            
 9 run_ulm       Univariate Linear Model (ULM)                                             
10 run_viper     Virtual Inference of Protein-activity by Enriched Regulon analysis (VIPER)
11 run_wmean     Weighted Mean (WMEAN)                                                     
12 run_wsum      Weighted Sum (WSUM)

选择好看的颜色和渐变效果

palette_length = 100
my_color = colorRampPalette(c("Darkblue", "white","red"))(palette_length)
my_breaks <- c(seq(-3, 0, length.out=ceiling(palette_length/2) + 1),
               seq(0.05, 3, length.out=floor(palette_length/2)))

然后绘制好看的热图，总体聚类结果还行，就是有个WT_Rep3效果不是特别理想，然后可以看到每个样本中通路的上下调情况，红色上调蓝色下调。当然，效果不好可以尝试使用其他方法再次计算。

pheatmap(sample_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)

图片

也可以更换行顺序

pheatmap(sample_acts_mat, border_color = NA, color=my_color, breaks = my_breaks,row_order = rownames(sample_acts_mat),cluster_rows = F)

图片

当然，差异分析结果也可以利用起来，查看实验组与对照组之间异常活化或抑制的通路。


contrast_acts <- run_wmean(mat=deg, net=net, .source='source', .target='target',
                           .mor='weight', times = 100, minsize = 5)
contrast_acts
f_contrast_acts <- contrast_acts %>%
  filter(statistic == 'norm_wmean')
 
ggplot(f_contrast_acts, aes(x = reorder(source, score), y = score)) + 
  geom_bar(aes(fill = score), stat = "identity") +
  scale_fill_gradient2(low = "darkblue", high = "indianred", 
                       mid = "whitesmoke", midpoint = 0) + 
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        axis.text.x = 
          element_text(angle = 45, hjust = 1, size =10, face= "bold"),
        axis.text.y = element_text(size =10, face= "bold"),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  xlab("Pathways")

图片

最后，哪些基因在通路中起了积极/消极的作用呢？

pathway <- 'JAK-STAT'

df <- net %>%
  filter(source == pathway) %>%
  arrange(target) %>%
  mutate(ID = target, color = "3") %>%
  column_to_rownames('target')
inter <- sort(intersect(rownames(deg),rownames(df)))
df <- df[inter, ]
df['t_value'] <- deg[inter, ]
df <- df %>%
  mutate(color = if_else(weight > 0 & t_value > 0, '1', color)) %>%
  mutate(color = if_else(weight > 0 & t_value < 0, '2', color)) %>%
  mutate(color = if_else(weight < 0 & t_value > 0, '2', color)) %>%
  mutate(color = if_else(weight < 0 & t_value < 0, '1', color))

ggplot(df, aes(x = weight, y = t_value, color = color)) + geom_point() +
  scale_colour_manual(values = c("red","royalblue3","grey")) +
  geom_label_repel(aes(label = ID)) + 
  theme_minimal() +
  theme(legend.position = "none") +
  geom_vline(xintercept = 0, linetype = 'dotted') +
  geom_hline(yintercept = 0, linetype = 'dotted') +
  ggtitle(pathway)

图片

红色表示权重与t值成正相关，蓝色相反。t值正负又代表什么呢？其实看下图就明白啦！

图片

最后~需要完整代码的小伙伴后台回复decoupleR，所有我运行成功的结果也保存到了results.Rdata中，足以复现。

图片

敬请批评指正！

decoupleR：丰富你的通路富集分析结果

猜你喜欢

热点阅读