R. python新手日记生物信息学从零开始学R语言从零开始

探索性数据分析(R):xda包

2019-09-14  本文已影响0人  柳叶刀与小鼠标

该软件包包含多个工具,可对任何输入数据集执行初始探索性​​分析。它包括用于绘制数据以及执行不同类型分析的自定义函数,例如单变量,双变量和多变量调查,这是任何预测建模管道的第一步。在开始构建预测模型之前,此包可用于充分了解任何数据集。

目前包含的功能如下:

numSummary(mydata)函数自动检测数据帧mydata中的所有数字列并提供其摘要统计信息。

charSummary(mydata)函数自动检测数据帧mydata中的所有字符列,并提供其摘要统计信息。

Plot(mydata,dep.var)将数据帧mydata中的所有自变量与dep.var参数指定的因变量进行对比。

removeSpecial(mydata,vec)用NA替换数据帧mydata中的所有特殊字符(由vector vec指定)。

bivariate(mydata,dep.var,indep.var)在数据帧mydata中执行因变量dep.var和自变量indep.var之间的双变量分析。

注意:上面提到的所有函数都要求mydata是data.frame - 请在使用此包中的任何函数之前将输入数据集转换为data.frame。

Installation

*安装xda包的最佳方法是首先安装devtools包。 要安装devtools,请按照[此处](https://github.com/hadley/devtools)的说明进行操作。 然后,使用以下命令安装xda

```source-x86
library(devtools)
install_github("ujjwalkarn/xda")
```

Usage

请参阅每个功能的文档以了解如何使用它。 例如,要查看numSummary()函数的文档,请使用?numSummary

## load the package into the current session

library(xda)

numSummary()

## to view a comprehensive summary for all numeric columns in the iris dataset

numSummary(iris)

## n = total number of rows for that variable
## nunique = number of unique values
## nzeroes = number of zeroes
## iqr = interquartile range
## noutlier = number of outliers
## miss = number of rows with missing value
## miss% = percentage of total rows with missing values ((miss/n)*100)
## 5% = 5th percentile value of that variable (value below which 5 percent of the observations may be found)
## the percentile values are helpful in detecting outliers
Output
> numSummary(iris)

                n mean    sd max min range nunique nzeros  iqr lowerbound upperbound noutlier kurtosis skewness mode miss miss%   1%   5% 25%  50% 75%  95%  99%
 Sepal.Length 150 5.84 0.828 7.9 4.3   3.6      35      0 1.30       3.15       8.35        0   -0.606    0.309  5.0    0     0 4.40 4.60 5.1 5.80 6.4 7.25 7.70
 Sepal.Width  150 3.06 0.436 4.4 2.0   2.4      23      0 0.50       2.05       4.05        4    0.139    0.313  3.0    0     0 2.20 2.34 2.8 3.00 3.3 3.80 4.15
 Petal.Length 150 3.76 1.765 6.9 1.0   5.9      43      0 3.55      -3.72      10.42        0   -1.417   -0.269  1.4    0     0 1.15 1.30 1.6 4.35 5.1 6.10 6.70
 Petal.Width  150 1.20 0.762 2.5 0.1   2.4      22      0 1.50      -1.95       4.05        0   -1.358   -0.101  0.2    0     0 0.10 0.20 0.3 1.30 1.8 2.30 2.50

charSummary()

## to view a comprehensive summary for all character columns in the warpbreaks dataset

charSummary(warpbreaks)

## n = total number of rows for that variable
## miss = number of rows with missing value
## miss% = percentage of total rows with missing values ((n/miss)*100)
## unique = number of unique levels of that variable
## top5levels:count = top 5 levels (unique values) in each column sorted by count
## for example, wool has 2 unique levels 'A' and 'B' each with count of 27 

Output
> charSummary(warpbreaks)

          n miss miss% unique top5levels:count
 wool    54    0     0      2       A:27, B:27
 tension 54    0     0      3 H:18, L:18, M:18

bivariate()

## to perform bivariate analysis between 'Species' and 'Sepal.Length' in the iris dataset

bivariate(iris,'Species','Sepal.Length')

## bin_Sepal.Length = 'Sepal.Length' variable has been binned into 4 equal intervals (original range is [4.3,7.9])
## for each interval of 'Sepal.Length', the number of samples from each category of 'Species' is shown 
## i.e. 39 of the 50 samples of Setosa have Sepal.Length is in the range (4.3,5.2], and so on. 
## the number of intervals (4 in this case) can be customized (see documentation)

Output
> bivariate(iris,'Species','Sepal.Length')

   bin_Sepal.Length setosa versicolor virginica
 1        (4.3,5.2]     39          5         1
 2        (5.2,6.1]     11         29        10
 3          (6.1,7]      0         16        27
 4          (7,7.9]      0          0        12

Plot()

## to plot all other variables against the 'Petal.Length' variable in the iris dataset

Plot(iris,'Petal.Length')

## some interesting patterns can be seen in the plots below and these insights can be used for predictive modeling
Output
> Plot(iris,'Petal.Length')
上一篇下一篇

猜你喜欢

热点阅读