【包】signatureSearch

2026-01-07 本文已影响0人 JamesMori

https://www.bioconductor.org/packages/release/bioc/vignettes/signatureSearch/inst/doc/signatureSearch.html

1. Introduction

The signatureSearch package implements algorithms and data structures for performing gene expression signature (GES) searches, and subsequently interpreting the results functionally with specialized enrichment methods

In drug discovery they can be used for identifying novel modes of action (MOA) of bioactive compounds from reference databases such as LINCS containing the genome-wide GESs from tens of thousands of drug and genetic perturbations.

A typical GES search (GESS) workflow can be divided into two major steps (Figure 1). First, GESS methods are used to identify perturbagens such as drugs that induce GESs similar to a query GES of interest.

The queries can be drug-, disease- or phenotype-related GESs.

Second, specialized functional enrichment analysis (FEA) methods using annotations systems, such as Gene Ontologies (GO), pathways or Disease Ontologies (DO), have been developed and implemented in this package to efficiently interpret GESS results.

Figure 1: Overview of GESS and FEA methods. GES queries are used to search a perturbation-based GES reference database for perturbagens such as drugs inducing GESs similar to the query. To interpret the results, the GESS results are subjected to functional enrichment analysis (FEA) including drug set and target set enrichment analyses (DSEA, TSEA). Both identify functional categories (e.g. GO terms or KEGG pathways) over-represented in the GESS results. Subsequently, drug-target networks are reconstructed for visualization and interpretation.

2. History of GES Databases

2.1. CMAP

Lamb et al. (2006) generated one of the first GES databases called CMAP.
Initially, it included GESs for 164 drugs screened against four mammalian cell lines (Lamb et al. 2006).

2.2. CMAP2

A few years later CMAP was extended to CMAP2 containing GESs for 1,309 drugs and eight cell lines.

2.3. LINCS

More recently, a much larger GES database was released by the Library of Network-Based Cellular Signatures (LINCS) Consortium (Subramanian et al. 2017). In its initial release, the LINCS database contained perturbation-based GESs for 19,811 drugs tested on up to 70 cancer and non-cancer cell lines along with genetic perturbation experiments for several thousand genes. The number of compound dosages and time points considered in the assays has also been increased by 10-20 fold.

2.4. LINCS2

In 2020, the LINCS 2017 database is expanded to the beta release, here refer to as LINCS2. It contains >80k perturbations and >200 cell lines and over 3M gene expression profiles. This represents roughly a 3-fold expansion on the LINCS 2017 database. The datasets can be accessed at https://clue.io/releases/data-dashboard.

The CMAP/CMAP2 databases use Affymetrix Gene Chips as expression platform. To scale from a few thousand to many hundred thousand GESs, the LINCS Consortium uses now the more economic L1000 assay. This bead-based technology is a low cost, high-throughput reduced representation expression profiling assay. It measures the expression of 978 landmark genes and 80 control genes by detecting fluorescent intensity of beads after capturing the ligation-mediated amplification products of mRNAs (Peck et al. 2006). The expression of 11,350 additional genes is imputed from the landmark genes by using as training data a collection of 12,063 Affymetrix gene chips (Edgar, Domrachev, and Lash 2002). The substantial scale-up of the LINCS project provides now many new opportunities to explore MOAs for a large number of known drugs and experimental drug-like small molecules.

3. Reference Databases

The helper package signatureSearchData provides access to pre-built GES databases, including CMAP2, LINCS and LINCS2, that are stored on Bioconductor’s ExperimentHub as HDF5 files.
Users can download these databases as follows.

library(ExperimentHub)
library(rhdf5)
eh <- ExperimentHub()
cmap <- eh[["EH3223"]]
cmap_expr <- eh[["EH3224"]]
lincs <- eh[["EH3226"]]
lincs_expr <- eh[["EH3227"]]
lincs2 <- eh[["EH7297"]]
h5ls(lincs2)

3.1. CMAP

cmap contains log2-fold-changes of 12,437 genes from 1,281 compound treatments of 5 cell lines corresponding to a total of 3,478 signatures; cmap_expr contains mean expression values from 1,309 drug treatments of 4 cell lines corresponding to a total of 3,587 signatures.

3.2. LINCS

lincs contains moderated z-scores from differential expression (DE) analysis of 12,328 genes from 8,140 compound treatments of 30 cell lines corresponding to a total of 45,956 signatures; lincs_expr contains gene expression intensity values from 5,925 compound treatments of 30 cell lines corresponding to a total of 38,824 signatures;
To minimize redundancy in the lincs and lincs_expr databases, they were assembled from GESs corresponding to a compound dosage and treatment time of 10μM and 24h, respectively.
If necessary one can create here easily database instances for all LINCS measurements. However, this will make the search results overwhelmingly complex which we wanted to avoid here;

3.3. LINCS2

lincs2 contains moderated z-scores from DE analysis of 12,328 genes from 30,456 compound treatments of 58 cell lines corresponding to a total of 136,460 signatures. To minimize redundancy of perturbagens having many signatures in different dosage and treatment time within the same cell line, the ‘exemplar’ （范例） signature for each perturbagen in each cell line was assembled. These signatures are annotated from CLUE group and are generally picked based on TAS (Transcriptional Activity Score), such that the signature with the highest TAS is chosen as exemplar. The LINCS2 database is exactly the same as the reference database used for the Query Tool in CLUE website.

For details how the CMAP2, LINCS and LINCS2 databases were constructed, please refer to the vignette of the signatureSearchData package. The command browseVignettes("signatureSearchData") will open this vignette from R.

Custom databases can be built with the build_custom_db function. Here the user provides custom genome-wide gene expression data (e.g. for drug, disease or genetic perturbations) in a data.frame or matrix.

4. GESS

4 set-based (ranked or unranked), 1 correlation-based methods (quantitative)
set-based更通用，correlation-based更精准
On the other hand, due to the nature of the expected input, correlation-based methods are usually only an option when both the query and database entries are based on the same or at least comparable technologies. In other words, set-based methods are more technology agnostic than correlation-based methods, but may not provide the best precision performance.

4.1. Test Query and Database

To minimize the run time of the following test code, a small toy database has been assembled from the LINCS database containing a total of 100 GESs from human SKB (muscle) cells.
Of the 100 GESs in this toy database, 95 were random sampled and 5 were cherry-picked. The latter five are GESs from HDAC inhibitor treatments including the known drugs: vorinostat, rhamnetin, trichostatin A, pyroxamide, and HC toxin.
To further reduce the size of the toy database, the number of its genes was reduced from 12,328 to 5,000 by random sampling.
The query signature used in the sample code below is the vorinostat GES drawn from the toy database.
For simplicity and to minimize the build time of this vignette, the following sample code uses a pre-generated instance of this toy database stored under the extdata directory of this package. The detailed code for generating the toy database and other custom instances of the LINCS database is given in the Supplementary Material section of this vignette.

The following imports the toy GES database into a SummarizedExperiment container (here sample_db). In addition, the test query set of up/down DEGs (here upset and downset) is extracted from the vorinostat GES entry in the database.

db_path <- system.file("extdata", "sample_db.h5", package = "signatureSearch")
# Load sample_db as `SummarizedExperiment` object
library(SummarizedExperiment)
library(HDF5Array)
sample_db <- SummarizedExperiment(HDF5Array(db_path, name="assay"))
rownames(sample_db) <- HDF5Array(db_path, name="rownames")
colnames(sample_db) <- HDF5Array(db_path, name="colnames")
# get "vorinostat__SKB__trt_cp" signature drawn from toy database
query_mat <- as.matrix(assay(sample_db[,"vorinostat__SKB__trt_cp"]))
query <- as.numeric(query_mat)
names(query) <- rownames(query_mat)
upset <- head(names(query[order(-query)]), 150)
downset <- tail(names(query[order(-query)]), 150)

4.2. CMAP Search Method

CMAP, uses as query the two label sets of the most up- and down-regulated genes from a genome-wide expression experiment, while the reference database is composed of rank transformed expression profiles (e.g. ranks of LFC or z-scores).
The search results are a list of perturbagens such as drugs that induce similar or opposing GESs as the query. Similar GESs suggest similar physiological effects of the corresponding perturbagens.

In the following code block, the qSig function is used to generate a qSig object by defining (i) the query signature, (ii) the GESS method and (iii) the path to the reference database.
Next, the query signature is used to search the reference database with the chosen GESS method. The type of the query signature and the reference database needs to meet the requirement of the search algorithm.

qsig_cmap <- qSig(query = list(upset=upset, downset=downset), 
                  gess_method="CMAP", refdb=db_path)
cmap <- gess_cmap(qSig=qsig_cmap, chunk_size=5000, workers=1, addAnnotations = TRUE)
result(cmap)

## # A tibble: 100 × 11
##    pert      cell  type  trend raw_score scaled_score N_upset N_downset t_gn_sym
##    <chr>     <chr> <chr> <chr>     <dbl>        <dbl>   <int>     <int> <chr>   
##  1 vorinost… SKB   trt_… up        1.94         1         150       150 HDAC1; …
##  2 rescinna… SKB   trt_… down     -0.295       -1         150       150 ACE     
##  3 zuclopen… SKB   trt_… down     -0.287       -0.972     150       150 ADRA1A;…
##  4 evoxine   SKB   trt_… down     -0.244       -0.828     150       150 <NA>    
##  5 scouleri… SKB   trt_… down     -0.242       -0.821     150       150 ADRA1D  
##  6 ganglios… SKB   trt_… down     -0.239       -0.809     150       150 <NA>    
##  7 warfarin  SKB   trt_… down     -0.234       -0.794     150       150 ARSE; C…
##  8 trichost… SKB   trt_… up        1.50         0.775     150       150 HDAC1; …
##  9 endecaph… SKB   trt_… down     -0.224       -0.760     150       150 <NA>    
## 10 HC-toxin  SKB   trt_… up        1.44         0.745     150       150 HDAC1   
## # ℹ 90 more rows
## # ℹ 2 more variables: MOAss <chr>, PCIDss <chr>

The search result is stored in a gessResult object (here named cmap) containing the following components: search result table, query signature, name of the GESS method and path to the reference database.
The result accessor function can be used to extract the search result table from the gessResult object.
This table contains the search results for each perturbagen (here drugs) in the reference database ranked by their signature similarity to the query.

For the CMAP method, the similarity metrics are raw_score and scaled_score. The raw score represents the bi-directional enrichment score (Kolmogorov-Smirnov statistic) for a given up/down query signature. Under the scaled_score column, the raw_score has been scaled to values from 1 to -1 by dividing positive scores and negative scores with the maximum positive score and the absolute value of the minimum negative score, respectively.

The gess_* functions also support appending the compound annotation table (if provided) to the GESS result for the pert column (pert_id column if refdb is set as lincs2) that stores compounds in the drug slot of <drug><cell><factor> format of treatments in the reference database. The following uses LINCS method searching against the newest LINCS2 database and passing the LINCS2 compound information table as an example.

The columns in the search result table contain the following information. pert: name of perturbagen (e.g. drug) in the reference database; cell: acronym of cell type; type: perturbation type, e.g. compound treatment is trt_cp; trend: up or down when reference signature is positively or negatively connected with the query signature, respectively; N_upset or N_downset: number of genes in the query up or down sets, respectively; t_gn_sym: gene symbols of the corresponding drug targets.

data("lincs_pert_info2")
qsig_lincs2 <- qSig(query=list(upset=upset, downset=downset), 
                   gess_method="LINCS", refdb="lincs2")
# When the compound annotation table is not provided
lincs2 <- gess_lincs(qsig_lincs2, tau=FALSE, sortby="NCS", workers=2)
# When the compound annotation table is provided
lincs2 <- gess_lincs(qsig_lincs2, tau=TRUE, sortby="NCS", workers=1,
                     cmp_annot_tb=lincs_pert_info2, by="pert_id", 
                     cmp_name_col="pert_iname") # takes about 15 minutes
result(lincs2) %>% print(width=Inf)

4.3. LINCS Search Method

LINCS weights the query genes based on the corresponding differential expression scores of the quantitative gene expression profiles (GEPs) in the reference database (e.g. LFC or z-scores). Thus, the reference database used by LINCS needs to store the actual score values rather than their ranks.
LINCS algorithm uses a bi-directional weighted Kolmogorov-Smirnov enrichment statistic (ES) as similarity metric.
上下调明显的基因的LFC/Z-score去比对数据库类似的数据

qsig_lincs <- qSig(query=list(upset=upset, downset=downset), 
                   gess_method="LINCS", refdb=db_path)
lincs <- gess_lincs(qsig_lincs, sortby="NCS", tau=FALSE, workers=1, 
                    addAnnotations = TRUE, GeneType = "reference")
result(lincs)

## # A tibble: 100 × 14
##    pert          cell  type  trend   WTCS WTCS_Pval WTCS_FDR   NCS NCSct N_upset
##    <chr>         <chr> <chr> <chr>  <dbl>     <dbl>    <dbl> <dbl> <dbl>   <int>
##  1 vorinostat    SKB   trt_… up     1       0       0         2.57  2.57     150
##  2 trichostatin… SKB   trt_… up     0.863   5.96e-6 0.000180  2.22  2.22     150
##  3 HC-toxin      SKB   trt_… up     0.857   6.22e-6 0.000180  2.20  2.20     150
##  4 pyroxamide    SKB   trt_… up     0.625   1.63e-5 0.000180  1.60  1.60     150
##  5 zuclopenthix… SKB   trt_… down  -0.321   8.29e-5 0.000319 -1.19 -1.19     150
##  6 rescinnamine  SKB   trt_… down  -0.319   9.60e-5 0.000356 -1.18 -1.18     150
##  7 APHA-compoun… SKB   trt_… up     0.445   2.42e-5 0.000180  1.14  1.14     150
##  8 epothilone    SKB   trt_… down  -0.308   2.40e-4 0.000801 -1.14 -1.14     150
##  9 scopolamine-… SKB   trt_… up     0.416   2.54e-5 0.000180  1.07  1.07     150
## 10 I-070759      SKB   trt_… up     0.408   2.58e-5 0.000180  1.05  1.05     150
## # ℹ 90 more rows
## # ℹ 4 more variables: N_downset <int>, t_gn_sym <chr>, MOAss <chr>,
## #   PCIDss <chr>

The search results are stored in a gessResult object as under the CMAP example above.
The similarity scores stored in the LINCS result table are summarized here. WTCS: Weighted Connectivity Score; WTCS_Pval: nominal p-value of WTCS; WTCS_FDR: false discovery rate of WTCS_Pval; NCS: normalized connectivity score; NCSct: NCS summarized across cell types; Tau: enrichment score standardized for a given database. The latter is only included in the result table if tau=TRUE in a gess_lincs function call. The example given is run with tau=FALSE, because the tau values are only meaningful when the complete LINCS database is used which is not the case for the toy database. TauRefSize: size of reference perturbations for computing Tau.

The following provides a more detailed description of the similarity scores computed by the LINCS method. Additional details are available in the Supplementary Material Section of the Subramanian et al. (2017) paper.

WTCS: The Weighted Connectivity Score is a bi-directional ES for an up/down query set. If the ES values of an up set and a down set are of different signs, then WTCS is (ESup-ESdown)/2, otherwise, it is 0. WTCS values range from -1 to 1. They are positive or negative for signatures that are positively or inversely related, respectively, and close to zero for signatures that are unrelated.

WTCS_Pval and WTCS_FDR: The nominal p-value of the WTCS and the corresponding false discovery rate (FDR) are computed by comparing the WTCS against a null distribution of WTCS values obtained from a large number of random queries (e.g. 1000).

NCS: To make connectivity scores comparable across cell types and perturbation types, the scores are normalized. Given a vector of WTCS values w resulting from a query, the values are normalized within each cell line c and perturbagen type t to obtain the Normalized Connectivity Score (NCS) by dividing the WTCS value by the signed mean of the WTCS values within the subset of signatures in the reference database corresponding to c and t.

NCSct: The NCS is summarized across cell types as follows. Given a vector of NCS values for perturbagen p, relative to query q, across all cell lines c in which p was profiled, a cell-summarized connectivity score is obtained using a maximum quantile statistic. It compares the 67 and 33 quantiles of NCSp,c and retains whichever is of higher absolute magnitude.

Tau: The standardized score Tau compares an observed NCS to a large set of NCS values that have been pre-computed for a specific reference database. The query results are scored with Tau as a standardized measure ranging from 100 to -100. A Tau of 90 indicates that only 10% of reference perturbations exhibit stronger connectivity to the query. This way one can make more meaningful comparisons across query results.

4.4. gCMAP Search Method

gCMAP uses as query a rank transformed GEP and the reference database is composed of the labels of up and down regulated DEG sets. This is the opposite situation of the CMAP method, where the query is composed of the labels of up and down regulated DEGs and the database contains rank transformed GESs.
上下调明显的基因的rank去比对数据库类似的数据
In case of the gCMAP GESS method, the GEP-Q is a matrix with a single column representing gene ranks from a biological state of interest, here vorinostat treatment in SKB cells. The corresponding gene labels are stored in the row name slot of the matrix. Instead of ranks one can provide scores (e.g. z-scores) as in the example given below. In such a case the scores will be internally transformed to ranks. The reference database consists of gene label sets that were extracted from the toy databases by applying a higher and lower filter, here set to 1 and -1, respectively.

qsig_gcmap <- qSig(query = query_mat, gess_method = "gCMAP", refdb = db_path)
gcmap <- gess_gcmap(qsig_gcmap, higher=1, lower=-1, workers=1, addAnnotations = TRUE)
result(gcmap)
## # A tibble: 100 × 11
##    pert       cell  type  trend effect  nSet nFound signed t_gn_sym MOAss PCIDss
##    <chr>      <chr> <chr> <chr>  <dbl> <dbl>  <dbl> <lgl>  <chr>    <chr> <chr> 
##  1 vorinostat SKB   trt_… up     1      1098   1098 TRUE   HDAC1; … HDAC… 5311  
##  2 warfarin   SKB   trt_… down  -1       114    114 TRUE   ARSE; C… Vita… 54678…
##  3 D-609      SKB   trt_… down  -0.715   125    125 TRUE   <NA>     <NA>  45479…
##  4 TC-2559    SKB   trt_… down  -0.685   377    377 TRUE   CHRNA4   Acet… 98231…
##  5 rescinnam… SKB   trt_… down  -0.609   726    726 TRUE   ACE      ACE … 52809…
##  6 amiodarone SKB   trt_… down  -0.584   706    706 TRUE   ABCG2; … Pota… 2157  
##  7 trichosta… SKB   trt_… up     0.576  1430   1430 TRUE   HDAC1; … CDK … 444732
##  8 zuclopent… SKB   trt_… down  -0.554   483    483 TRUE   ADRA1A;… Dopa… 53115…
##  9 progester… SKB   trt_… down  -0.553   242    242 TRUE   AKR1C1;… Prog… 5994  
## 10 evoxine    SKB   trt_… down  -0.533   461    461 TRUE   <NA>     Furo… 673465
## # ℹ 90 more rows

As in the other search methods, the gCMAP results are stored in a gessResult object. The columns in the corresponding search result table, that are specific to the gCMAP method, contain the following information. effect: scaled bi-directional enrichment score corresponding to the scaled_score under the CMAP result; nSet: number of genes in the reference gene sets after applying the higher and lower cutoff; nFound: number of genes in the reference gene sets that are present in the query signature; signed: whether the gene sets in the reference database have signs, e.g. representing up and down regulated genes when computing scores.

4.5. Fisher Search Method

both the query and the database are composed of gene label sets, such as DEG sets.
In the following example, both the query and the refdb used under the qSig call are genome-wide GEPs, here z-scores. The actual gene sets required for the Fisher’s exact test are obtained by setting the higher and lower cutoffs to 1 and -1, respectively.

qsig_fisher <- qSig(query = query_mat, gess_method = "Fisher", refdb = db_path)
fisher <- gess_fisher(qSig=qsig_fisher, higher=1, lower=-1, workers=1, addAnnotations = TRUE)
result(fisher)
## # A tibble: 100 × 14
##    pert        cell  type  trend      pval      padj effect     LOR  nSet nFound
##    <chr>       <chr> <chr> <chr>     <dbl>     <dbl>  <dbl>   <dbl> <dbl>  <dbl>
##  1 vorinostat  SKB   trt_… over  0         0          37.5  Inf      1098   1098
##  2 trichostat… SKB   trt_… over  2.81e-177 1.41e-175  28.4    2.06   1430    705
##  3 HC-toxin    SKB   trt_… over  1.73e-132 5.77e-131  24.5    1.74   1804    746
##  4 SPB02303    SKB   trt_… under 3.18e- 25 7.94e- 24 -10.4   -1.05    963     99
##  5 MDL-28170   SKB   trt_… under 2.67e- 24 5.33e- 23 -10.2   -0.767  1823    260
##  6 clofibric-… SKB   trt_… under 2.10e- 19 3.49e- 18  -9.01  -0.660  1943    300
##  7 pyroxamide  SKB   trt_… over  2.27e- 16 3.24e- 15   8.21   0.770   641    225
##  8 TUL-XX023   SKB   trt_… under 3.03e- 13 3.79e- 12  -7.29  -0.731   885    116
##  9 formoterol  SKB   trt_… under 1.68e- 12 1.87e- 11  -7.06  -0.496  2238    389
## 10 I-070759    SKB   trt_… over  2.18e- 12 2.18e- 11   7.02   0.537  1222    359
## # ℹ 90 more rows
## # ℹ 4 more variables: signed <lgl>, t_gn_sym <chr>, MOAss <chr>, PCIDss <chr>

The columns in the result table specific to the Fisher method include the following information. pval: p-value of the Fisher’s exact test; padj: p-value adjusted for multiple hypothesis testing using R’s p.adjust function with the Benjamini & Hochberg (BH) method; effect: z-score based on the standard normal distribution; LOR: log odds ratio.

If the query contains the labels of up and down regulated genes then the two sets can be provided as a list. Internally, they will be combined into a single unsigned set, while the reference database is processed the same way as in the previous example.
这个效能应该是最差的

qsig_fisher2 <- qSig(query = list(upset=upset, downset=downset), 
                     gess_method = "Fisher", refdb = db_path)
fisher2 <- gess_fisher(qSig=qsig_fisher2, higher=1, lower=-1, workers=1, addAnnotations = TRUE)
result(fisher2)

4.6. Correlation-based Search Method

Correlation-based similarity metrics, such as Spearman or Pearson coefficients, can also be used as GESS methods.
As non-set-based methods, they require quantitative gene expression values for both the query and the database entries, that usually need to be of the same type to obtain meaningful results, such as normalized intensities or read counts from microarrays or RNA-Seq experiments, respectively.
最好是相同的技术，对数值最好都做好标准化？需要gene name类型一致。这个应该是取交集做的？那么会不会出现交集过少的情况？
The expected data structure of the query is a matrix with a single numeric column and the gene labels (e.g. Entrez Gene IDs) in the row name slot. For convenience, the correlation-based searches can either be performed with the full set of genes represented in the database or a subset of them. The latter can be useful to focus the computation for the correlation values on certain genes of interest such as a DEG set or the genes in a pathway of interest. For comparing the performance of different GESS methods, it can also be advantageous to subset the genes used for a correlation-based search to same set used in a set-based search, such as the up/down DEGs used in a LINCS GESS. This way the search results of correlation- and set-based methods can be more comparable because both are provided with equivalent information content.
可以在query和database中挑选差异基因进行分析

4.6.1. CORall

The following example runs a correlation-based search with the Spearman method using all genes present in the reference database. The GESs used for both the query and refdb are z-scores. The gess_cor function also supports Pearson and Kendall correlation coefficients by assigning the corresponding names to the method argument. For details, users want to consult the help file of the gess_cor function.

qsig_sp <- qSig(query=query_mat, gess_method="Cor", refdb=db_path)
sp <- gess_cor(qSig=qsig_sp, method="spearman", workers=1, addAnnotations = TRUE)
result(sp)
## # A tibble: 100 × 8
##    pert                cell  type   trend cor_score t_gn_sym        MOAss PCIDss
##    <chr>               <chr> <chr>  <chr>     <dbl> <chr>           <chr> <chr> 
##  1 vorinostat          SKB   trt_cp up        1     HDAC1; HDAC10;… HDAC… 5311  
##  2 trichostatin-a      SKB   trt_cp up        0.702 HDAC1; HDAC10;… CDK … 444732
##  3 HC-toxin            SKB   trt_cp up        0.634 HDAC1           HDAC… 3571  
##  4 pyroxamide          SKB   trt_cp up        0.325 HDAC1; HDAC3; … HDAC… 4996  
##  5 APHA-compound-8     SKB   trt_cp up        0.152 HDAC8           HDAC… 10379…
##  6 benzonatate         SKB   trt_cp up        0.151 SCN5A           Loca… 7699  
##  7 rescinnamine        SKB   trt_cp down     -0.136 ACE             ACE … 52809…
##  8 evoxine             SKB   trt_cp down     -0.134 <NA>            Furo… 673465
##  9 scopolamine-n-oxide SKB   trt_cp up        0.125 <NA>            <NA>  30006…
## 10 warfarin            SKB   trt_cp down     -0.123 ARSE; CYP2C8; … Vita… 54678…
## # ℹ 90 more rows

The column specific to the correlation-based search methods contains the following information. cor_score: correlation coefficient based on the method defined in the gess_cor function call.

4.6.2. CORsub

To perform a correlation-based search on a subset of genes represented in the database, one can simply provide the chosen gene subset in the query. During the search the database entries will be subsetted to the genes provided in the query signature. The given example uses a query GES that is subsetted to the genes with the 150 highest and lowest z-scores.

# Subset z-scores of 150 up and down gene sets from 
# "vorinostat__SKB__trt_cp" signature.
query_mat_sub <- as.matrix(query_mat[c(upset, downset),])
qsig_spsub <- qSig(query = query_mat_sub, gess_method = "Cor", refdb = db_path)
spsub <- gess_cor(qSig=qsig_spsub, method="spearman", workers=1, addAnnotations = TRUE)
result(spsub)
## # A tibble: 100 × 8
##    pert                cell  type   trend cor_score t_gn_sym        MOAss PCIDss
##    <chr>               <chr> <chr>  <chr>     <dbl> <chr>           <chr> <chr> 
##  1 vorinostat          SKB   trt_cp up        1     HDAC1; HDAC10;… HDAC… 5311  
##  2 trichostatin-a      SKB   trt_cp up        0.885 HDAC1; HDAC10;… CDK … 444732
##  3 HC-toxin            SKB   trt_cp up        0.856 HDAC1           HDAC… 3571  
##  4 pyroxamide          SKB   trt_cp up        0.646 HDAC1; HDAC3; … HDAC… 4996  
##  5 APHA-compound-8     SKB   trt_cp up        0.442 HDAC8           HDAC… 10379…
##  6 benzonatate         SKB   trt_cp up        0.333 SCN5A           Loca… 7699  
##  7 MD-040              SKB   trt_cp up        0.307 <NA>            <NA>  73707…
##  8 K784-3187           SKB   trt_cp up        0.297 <NA>            <NA>  36894…
##  9 scopolamine-n-oxide SKB   trt_cp up        0.297 <NA>            <NA>  30006…
## 10 fasudil             SKB   trt_cp up        0.288 CDC42BPB; MYLK… Rho … 3547  
## # ℹ 90 more rows

LINCS and Spearman methods应该是最准确且灵敏的

5. Batch Processing of GESSs

The above 5 GESS methods support searching reference database parallelly by defining the workers parameter, the default is 1.
It means that when submitting one query, the parallelization happens on the GES database level where one splits up a single query process into searching several chunks of the database in parallel.
Multiple GES queries can also be processed sequentially or in parallel mode. Parallel evaluations can substantially reduce processing times. The parallelization techniques covered in this vignette, are based on utilities of the BiocParallel and batchtools packages.
For demonstration purposes the following example uses a small batch query containing several GESs. First, this batch query is processed sequentially without any parallelization using a simple lapply loop. Next, the same query is processed in parallel mode using multiple CPU cores of a single machine. The third option demonstrates how this query can be processed in parallel mode on multiple machines of a computer cluster with a workload management (queueing) system (e.g. Slurm or Torque).（不看第三种了）

5.1. Sequential Processing

The following processes a small toy batch query with the LINCS method sequentially in an lapply loop. The object batch_queries is a list containing the two sample GESs q1 and q2 that are composed of entrez gene identifiers. The search results are written to tab-delimited tabular files under a directory called batch_res. The name and path of this directory can be changed as needed.

library(readr)
batch_queries <- list(q1=list(upset=c("23645", "5290"), downset=c("54957", "2767")),
                      q2=list(upset=c("27101","65083"), downset=c("5813", "84")))
refdb <- system.file("extdata", "sample_db.h5", package="signatureSearch")
gess_list <- lapply(seq_along(batch_queries), function(i){
    qsig_lincs <- qSig(query = batch_queries[[i]], 
                   gess_method="LINCS", refdb=refdb)
    lincs <- gess_lincs(qsig_lincs, sortby="NCS", tau=TRUE)
    if(!dir.exists("batch_res")){
        dir.create("batch_res")
    }
    write_tsv(result(lincs), paste0("batch_res/lincs_res_", i, ".tsv"))
    return(result(lincs))
})

5.2. Parallelization with Multiple CPU Cores

The GESSs from the previous example can be accelerated by taking advantage of multiple CPU cores available on a single computer system. The parallel evaluation happens in the below bplapply loop defined by the BiocParallel package. For this approach, all processing instructions are encapsulated in a function named f_bp that will be executed in the bplapply loop. As before, the search results are written to tab-delimited tabular files under a directory called batch_res. The name and path of this directory can be changed as needed. For more background information on this and the following parallelization options, users want to consult the vignette of the BiocParallel package.

library(BiocParallel)
f_bp <- function(i){
    qsig_lincs <- qSig(query = batch_queries[[i]], 
                   gess_method="LINCS", refdb=refdb)
    lincs <- gess_lincs(qsig_lincs, sortby="NCS", tau=TRUE)
    if(!dir.exists("batch_res")){
        dir.create("batch_res")
    }
    write_tsv(result(lincs), paste0("batch_res/lincs_res_", i, ".tsv"))
    return(result(lincs))
}
gess_list <- bplapply(seq_along(batch_queries), f_bp, BPPARAM = MulticoreParam(workers = 2))

6. Functional Enrichment (FEA)

GESS results are lists of perturbagens (here drugs) ranked by their signature similarity to a GES-Q of interest. Interpreting these search results with respect to the cellular networks and pathways affected by the top ranking drugs is difficult. To overcome this challenge, the knowledge of the target proteins of the top ranking drugs can be used to perform functional enrichment analysis (FEA) based on community annotation systems, such as Gene Ontologies (GO), pathways (e.g. KEGG, Reactome), drug MOAs or Pfam domains. For this, the ranked drug sets are converted into target gene/protein sets to perform Target Set Enrichment Analysis (TSEA) based on a chosen annotation system. Alternatively, the functional annotation categories of the targets can be assigned to the drugs directly to perform Drug Set Enrichment Analysis (DSEA). Although TSEA and DSEA are related, their enrichment results can be distinct. This is mainly due to duplicated targets present in the test sets of the TSEA methods, whereas the drugs in the test sets of DSEA are usually unique. Additional reasons include differences in the universe sizes used for TSEA and DSEA.
在TSEA分析中把重复转化为了权重
Importantly, the duplications in the test sets of the TSEA are due to the fact that many drugs share the same target proteins. Standard enrichment methods would eliminate these duplications since they assume uniqueness in the test sets. Removing duplications in TSEA would be inappropriate since it would erase one of the most important pieces of information of this approach. To solve this problem, we have developed and implemented in the signatureSearch package a weighting method for duplicated targets, where the weighting is proportional to the frequency of the targets in the test set.

To perform TSEA and DSEA, drug-target annotations are essential. They can be obtained from several sources, including DrugBank, ChEMBL, STITCH, and the Touchstone dataset from the LINCS project (Wishart et al. 2018; Gaulton et al. 2017; Kuhn et al. 2010; Subramanian et al. 2017).
RNA表达谱到drug，drug到蛋白，富集再用基因
Most drug-target annotations provide UniProt identifiers for the target proteins. They can be mapped, if necessary via their encoding genes, to the chosen functional annotation categories, such as GO or KEGG. To minimize bias in TSEA or DSEA, often caused by promiscuous binders, it can be beneficial to remove drugs or targets that bind to large numbers of distinct proteins or drugs, respectively.

Note, most FEA tests involving proteins in their test sets are performed on the gene level in signatureSearch. This way one can avoid additional duplications due to many-to-one relationships among proteins and their encoding genes. For this, the corresponding functions in signatureSearch will usually translate target protein sets into their encoding gene sets using identifier mapping resources from R/Bioconductor, such as the org.Hs.eg.db annotation package. Because of this as well as simplicity, the text in the vignette and help files of this package will refer to the targets of drugs almost interchangeably as proteins or genes, even though the former are the direct targets and the latter only the indirect targets of drugs.

6.1. TSEA

6.1.1. Hypergeometric Test

#GO
drugs <- unique(result(lincs)$pert[1:10])
dup_hyperG_res <- tsea_dup_hyperG(drugs=drugs, universe="Default", 
                                  type="GO", ont="MF", pvalueCutoff=0.05, 
                                  pAdjustMethod="BH", qvalueCutoff=0.1, 
                                  minGSSize=10, maxGSSize=500)
result(dup_hyperG_res)
#KEGG
dup_hyperG_k_res <- tsea_dup_hyperG(drugs=drugs, universe="Default", type="KEGG",
                                    pvalueCutoff=0.5, pAdjustMethod="BH", qvalueCutoff=0.5,
                                    minGSSize=10, maxGSSize=500)
result(dup_hyperG_k_res)
## Mapping 'itemID' column in the FEA enrichment result table from Entrez ID to gene Symbol
set_readable(result(dup_hyperG_k_res))
#REACTOME
dup_rct_res <- tsea_dup_hyperG(drugs=drugs, type="Reactome",
                               pvalueCutoff=0.5, qvalueCutoff=0.5, readable=TRUE)
result(dup_rct_res)

6.1.2. mGSEA Method

6.1.3. MeanAbs Method

6.2. DSEA

6.2.1. Hypergeometric Test

#GO
drugs <- unique(result(lincs)$pert[1:10])
hyperG_res <- dsea_hyperG(drugs=drugs, type="GO", ont="MF")
result(hyperG_res)

#KEGG
hyperG_k_res <- dsea_hyperG(drugs = drugs, type = "KEGG", 
                            pvalueCutoff = 1, qvalueCutoff = 1, 
                            minGSSize = 10, maxGSSize = 2000)
result(hyperG_k_res)