2015-Human genomics-A survey of

2017-08-19 本文已影响0人英天

1. 将这些计算工具分为以下三类

(1) basic traditional statistical analysis,
(2) machine learning approaches
(3) assignment of functional and biological information to describe and understand protein interaction networks.

2. 分析大数据的Guideline

Step one: Observe your data, quality control

Step two: Traditional statistics

Groups identified by the researcher either during experimental
design or during the data observation step can be compared here using Student’s t test, analysis of variance (ANOVA), and their nonparametric equivalents such as Kruskal-Wallis, in addition to regression modeling and other tests of traditional statistics. Many tests done simultaneously should be corrected using a multiple
test correction such as the Benjamini-Hochberg correction algorithm

Step three: Dimension reduction with machine learning

使用Table 1所示分类算法将features减少。而这些分类算法又分为Unsupervised和Supervised两类。

   （1）Unsupervised

principal component analysis (PCA)
Independent component analysis (ICA)
K-means
Hierarchical clustering

   （2）supervised

Partial least squares (PLS)
Random forests (RF)
Support vector machine (SVM)

支持上述分类算法的软件工具有：Weka [14], Scikit-learn (Machine Learning in Python)[15], and SHOGUN [16].

Table 1 Summary and comparison of classification and clustering methods

Step four: Pathway and network analysis

For pathway analysis, we refer to data analysis that aims to identify activated pathways or pathway modules from functional proteomic data.

For network analysis, we refer to data analysis that builds, overlays,
visualizes, and infers protein interaction networks from functional proteomics and other systems biology data.

Table 2 Summary of functional and network tools

3. Longitudinal or time-series data

Several software tools are available that specifically address
the problems associated with time-series data.
TimeClust is a stand-alone tool which is available for different platforms and allows the clustering of gene expression data collected over time with distance-based, model-based, and template-based methods [61]. There are also several other packages available in R such as maSigPro [62], timecourse [63], BAT [64], betr [65], fpca
[66], timeclip [67], rnits [68], and STEM [69].
Python probabilistic graphical query language (pGQL) [70] allows its user to interactively define linear HMM queries on time-course data using rectangular graphical widgets called probabilistic time boxes. The analysis is fully interactive, and the graphical display shows the
time courses along with the graphical query.
In JAVA, PESTS [71] and OPTricluster [72] both of which are
stand-alone with a GUI interface are useful for the clustering
of short time-series data in MATLAB.
DynamiteC is a dynamic modeling and clustering algorithm which
interleaves clustering time-course gene expression data
with estimation of dynamic models of their response by
biologically meaningful parameters [73].