转录组数据分析富集分析

『文献阅读』REViGO

2020-07-24  本文已影响0人  ShawnMagic

文献:REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms

发表年份:Received March 2, 2011; Accepted June 7, 2011; Published July 18, 2011

期刊:PLoS one

引用: 1980

DOI: https://doi.org/10.1371/journal.pone.0021800

背景摘要

原文:

Outcomes of high-throughput biological experiments are typically interpreted by statistical testing for enriched gene functional categories defined by the Gene Ontology (GO). The resulting lists of GO terms may be large and highly redundant, and thus difficult to interpret. REVIGO is a Web server that summarizes long, unintelligible lists of GO terms by finding a representative subset of the terms using a simple clustering algorithm that relies on semantic similarity measures. Furthermore, REVIGO visualizes this non-redundant GO term set in multiple ways to assist in interpretation: multidimensional scaling and graph-based visualizations accurately render the subdivisions and the semantic relationships in the data, while treemaps and tag clouds are also offered as alternative views. REVIGO is freely available at http://revigo.irb.hr/.

重点:

介绍

Para1: 讲GO富集分析,没用不看

Para2:语义冗余如何影响结果

As high-throughput techniques become cheaper and more accurate, they detect even slight changes in gene expression or other measured properties. The lists of relevant genes will grow in size, and so will the derived lists of GO terms. Additionally, the redundancy in the resulting set of GO terms confounds interpre- tation and inflates the perceived number of biologically relevant results. This is frequently the case when analyzing terms in a parent- child relationship, e.g. the parent term ‘‘GO:0009058 biosynthetic process’’ fully encompasses its child term ‘‘GO:0008610 lipid biosynthetic process’’. In a list of terms enriched with overexpressed genes, if the child term has highly statistically significant enrichment, the parent term might appear significantly enriched purely as a consequence of including all the genes from the child term.

最常见的例子:比如说GO:0009058 biosynthetic processGO:0008610 lipid biosynthetic process的父层,比如说在脂质合成代谢这个GO Term的基因有超高显著的统计学意义上的富集,那么合成代谢这个GO term的显著富集是完全就是因为脂质合成代谢的富集而变的显著。

Para3:介绍一些去冗余工具: GOrilla,RedundancyMiner不解释了。
Para4: Go slim 介绍和利弊

In the same vein, researchers may attempt to simplify long GO term lists by replacing the full Gene Ontology with ‘‘GO Slims’’, cut-down versions of the Gene Ontology. The GO slims are, however, limited to general (high-level) GO terms which are typically less interesting than the more fine-grained terms – the ones that have been removed from the GO slims. Thus, the problem of weeding out the redundant GO terms is not easily solved by removing the GO terms’ descendants (or ancestors) in this manner. The complex structure of the GO warrants a solution that takes into account the terms’ proximity in the GO graph, quantified by the GO term ‘semantic similarity’ measures [8].

Para5: REViGO作用

We have implemented a computational approach that (a) summarizes long GO lists by reducing functional redundancies, and (b) visualizes the remaining GO terms in two-dimensional plots, interactive graphs, treemaps or tag clouds. Both the summarization and the visualization step draw on the concept of GO term semantic similarity, reviewed in [8]. In particular, several common measures of semantic similarity [9] that employ the ‘most informative common ancestor’ approach are supported. The implementation is freely available as the REVIGO Web server at http://revigo.irb.hr/.

结果和讨论

A simple algorithm to reduce redundancy within lists of GO terms

To mitigate the problem of large and redundant lists, we aim to find a single representative GO term for each of these clusters. REVIGO performs a simple clustering procedure which is in concept similar to the hierarchical (agglomerative) clustering methods such as the neighbor joining approach [10]. A flowchart of the steps in the algorithm is given in Fig. 1.

REViGO 会对冗余的GO terms列表先聚类,然后给每个cluster找到一个代表性的GO term。 这种简单的聚类方法参照邻接法(neighbor joining approach)

The intuition behind this procedure is to form groups of highly similar GO terms, where the choice of the groups’ representatives is guided by the p-values, enrichments or similar values that the user supplies alongside the GO terms (Fig. 1).

这个过程其实就是将高度相似的GO terms分组,这种分组是由pvalue,enrichment等这些数值来指导的。

If the p-values are quite close and one term is a child node of the other, REVIGO will tend to choose the parent term, with a possible exception when the terms are deemed to be de facto equivalent (Fig. 1, see caption). Note that REVIGO generally does not prioritize higherlevel or lower-level GO terms as cluster representatives – instead, the user-supplied p-values/enrichments are used to guide the selection, if possible.

如果p值相近,而且一个term是另一个的子节点的时候,会倾向于选择父项。当然,如果两个term被视为事实上等同的时候会例外。
软件通常情况下不会优先把高等级或者低等级的GO terms作为cluster的代表,在可能的情况下,用户提供的p值或者enrichments会指导这个选择。

Very general GO terms, however, are always avoided as cluster representatives (Fig. 1) as they tend to be uninformative. It is also possible to manually override the choice of the representative GO term using the ‘pin’ option in case the default solution is not satisfactory for the user e.g. when a more general, higher-level term is desired to represent the group.
The user does not necessarily need to provide previously determined pvalues or another numerical value alongside the GO terms. In that case, REVIGO will prioritize the terms with higher ‘uniqueness’ the negative of average similarity of a term to all other terms.

软件尽量避免让通用的GO terms作为cluster的代表,因为他们反应的信息很有限(uninformative),比如说:催化反应... 软件支持用pin来替换代表性的GO term,比如当你想用那些高级别的GO term作为cluster代表的时候。
当然,pvalue也不是必须的,如果没有p值的话,软件倾向与取有较高唯一性的term。

The terms that remain in the list after the algorithm has finished are the cluster representatives, where it is guaranteed that no two representatives will be more similar than a user-provided cutoff value C. In other words, a lower (more stringent) value of C will result in a shorter, but also a more semantically diverse list. To offer some bearing on the relationship of C to statistical significance, we conducted a simulation where we drew random pairs of GO terms and recorded the distribution of the SimRel semantic similarity measure [11] (default in REVIGO). One percent of randomly generated GO term pairs have SimRel.0.53. Therefore, at C= 0.53 there is a 99% chance an abovebackground similarity exists between each pair of terms in a cluster. REVIGO offers four pre-defined values of C (0.9, 0.7, 0.5 and 0.4) to the user. The lowest value of C= 0.4 – corresponding to the ‘‘tiny’’ list size – should be used with caution, as many GO terms might be removed from the list without strong statistical support for their redundancy with respect to other terms. The values of C= 0.7 (default) and 0.9 are much more conservative in this respect, but may not shorten the list enough.

Figure1 A flowchart describing the REVIGO algorithm to remove redundant GO terms from the provided GO term list.

Figure1

Visualization in scatterplots and interactive graphs

In drawing scatterplots (Fig. 2), the challenge lies in assigning x and y coordinates to each term so that more semantically similar GO terms are also closer in the plot. Here, we employ a multidimensional scaling procedure which initially places the terms using an eigenvalue decomposition of the terms’ pairwise distance matrix. This is followed by a stress minimization step which iteratively improves the agreement between the GO terms’ semantic similarities and their closeness in the displayed twodimensional space. The GO terms’ and associated data (term descriptions, p-values/enrichments, uniqueness, etc.) can be exported to a convenient text table and downloaded.

首先说散点图(scatterplots),我们会对每个GO Term一个x,y坐标值,这样可以保证语义相似的GOterm在图上更加接近。后面一堆blah blah听不懂ಥ_ಥ.... 最终的目的就是达到刚才说的这个,而且这些GO terms以及他们相关的值可以在网站下的表中找到,而且可以下载。

figure2

REVIGO also allows the user to make a graph-based visualization (Fig. 3). Each of the GO terms is a node in the graph, and 3% of the strongest GO term pairwise similarities are designated as edges in the graph. The threshold value of 3% was derived empirically; we found it strikes a good balance between over-connected graphs with no visible subgroups on the one hand, and very fragmented graphs with too many small groups on the other hand. The placement of the nodes is determined by the ForceDirected layout algorithm as implemented in Cytoscape Web [12]. In addition to being viewed in the Web browser, the graph may be exported to a XGMML file, or opened in the standalone Cytoscape program [13] via Java Web Start to produce high resolution, publication-quality images. Both visualizations indicate the generality of the GO terms by the bubble radius, where smaller bubbles imply more specific terms; the user-supplied p-values/ enrichments are shown using color shading.

figure3

Two additional views of the user’s data are supported in REVIGO. Treemaps (Fig. 4) show a two-level hierarchy of GO terms – the cluster representatives from the scatterplot and the graph are here joined into several very high-level groups. Tag clouds show (a) keywords which are overrepresented in the GO terms’ descriptions in the GO term list provided by the user (Fig. 5)

树形图展示了GO term的层级关系,这里会吧这些go terms分配到层级较高的几个大cluster中, 而且把散点图中代表性的term展现出来。

词云图则和普通的词云图一样,把高频GO term对应的description中出现的词汇突出出来。

figure4 figure5

An example use-case: summarizing the putative targets of a transcription factor

table1

To illustrate how REVIGO’s redundancy elimination algorithm (Fig. 1) works, we turn to a ‘toy example’ which has seven GO categories with associated p-values (Fig. 6). This dataset [14] lists gene functional categories co-expressed with the human gene coding for the transcription factor ZNF417, but not with the highly related protein ZNF587, measured using Affymetrix U133plus2 microarrays. The ZNF417 is an evolutionarily recent, great ape-specific transcription factor of which the ZNF587 is a more ancient homolog [14]; gene functions associated specifically to ZNF417 were found to be associated with brain development.

这个示例数据:这个基因set与转录因子ZNF417共表达,但是不与ZNF587共表达,587比417更加古老,417是在类人猿中新出现的。与417相关的基因可能和大脑发育相关。

A casual inspection reveals subgroups of redundant gene functions. For instance, the GO term ‘‘cerebral cortex neuron differentiation’’ has a high semantic similarity (SimRel = 0.72) to ‘‘telencephalon development’’ and is therefore removed by merging it into the cluster represented by the term having a more significant p-value (Fig. 6). The removed term is assigned a ‘dispensability’ value of 0.72, a relatively high value reflecting the removed term’s strong redundancy with respect to the chosen representative. In the next group of terms, ‘‘astrocyte differentiation’’ and ‘‘negative regulation of neuron differentiation’’ are similar (0.74 and 0.62, respectively) to ‘‘negative regulation of glial cell differentiation’’. Due to a weaker p-value, the first two terms are merged into a cluster represented by the last term (Fig. 6). Note how the choice of cluster representatives is unaffected by whether terms are more general or more specific. The highest remaining pairwise similarity (here, 0.40) is below the user-defined threshold C, here set to 0.5, and the clustering algorithm stops. In other words, after having removed the redundant terms, the ones that remain as the cluster representatives are those terms having dispensability values below C. The example list of seven GO terms has been reduced to four clusters, of which two are singletons

cerebral cortex neuron differentiation相对于telencephalon development 来说语义相似性达到0.72(SimRel = 0.72),那么cerebral cortex neuron differentiation会被合并到telencephalon development 中从主干上移除掉,被移除的术语被分配了0.72的“可有可有性”值,该值相对较高,反映了被移除的词语相对于所选代表的强冗余性。而合并后的cluster SimRel值变为0.4,已经小于阈值0.5,那么就终止循环了,这一个cluster就包含了他自己和cerebral cortex neuron differentiation

同样negative regulation of glial cell differentiation语义相似的两个term分别有0.62和0.74的SimRel值,在合并后成了0.

就是根据figure1的复杂流程对这些go term聚类,计算SimRel值,来达到去语义去冗余的目的, 例子中将7个go term减少到4个

如果C值设置的比较高,比如0.7或者0.9,就无法很好的去冗余。

A possible alternative for REVIGO’s summarization procedure are the frequently used ‘‘GO slims’’. Here, the seven terms are quite specific and consequently none of them is in the ‘‘generic’’ or ‘‘PIR’’ GO slims (http://www.geneontology.org/GO.slims.shtml). Therefore, the GO slim approach would not apply to this dataset, illustrating the general principle of how summarizing the list by filtering out the more specific (or equivalently, higher information content) GO terms results in a loss of the potentially more interesting results.

这里点名了GO slim,由于这7个term太具体了,GO slim其实在这个例子中没法用。

In addition to the ‘dispensability’ values, REVIGO provides ‘uniqueness’ values. These two values are anticorrelated, though not perfectly, since ‘uniqueness’ measures whether the term is an outlier when compared semantically to the whole list (without regard for the p-values), while the ‘dispensability’ compares a term to other semantically close terms and is assigned based both on the semantic distance and the supplied p-values.
提出了唯一性**,可分性(dispensability)和唯一性(uniqueness)这两个值是完全程现反相关的,尽管不完美,但是唯一性值可以判断这个go term和整体相比是不是一个离群值。

To demonstrate the multidimensional scaling-based visualization in REVIGO, we visualize these terms in Fig. 7; for illustrative purposes, all seven terms are visible in this instance, instead of only the four cluster representatives. Here, it can be seen how two terms are quite distinct from the rest and also from each other: ‘‘regulation of dopamine metabolism’’ and ‘‘sensory perception of chemical stimulus’’ – these terms were not assigned to any of the clusters in the redundancy elimination procedure described above. The remaining five terms are more closely related, where the ‘‘telencephalon development’’ and ‘‘negative regulation of glial cell differentiation’’ have more significant p-values than the three other terms and were thus chosen as cluster representatives.

一个结果图解读,直接机翻了:

为了在REVIGO中演示基于多维缩放的可视化,我们在图7中可视化这些术语;出于说明性目的,在这种情况下,所有七个术语都可见,而不仅仅是四个簇代表。在这里,可以看到两个术语是如何与其他术语以及彼此完全不同的:“多巴胺新陈代谢的调节”和“化学刺激的感官知觉”-在上述冗余消除过程中,这些术语没有被分配给任何群集。剩下的五个术语关系更为密切,其中“端脑发育”和“胶质细胞分化的负调控”比其他三个术语具有更显著的p值,因此被选为聚类代表。

figure7

最后是和其他软件的对比,没用过,所以就不讨论了

=========END===========

上一篇 下一篇

猜你喜欢

热点阅读