R小推车

《Modern Statistics for Modern Bi

2019-03-06  本文已影响26人  热衷组培的二货潜

《Modern Statistics for Modern Biology》Chapter 一: 离散数据模型的预测(1.1 - 1.3)

《Modern Statistics for Modern Biology》Chapter 一: 离散数据模型的预测(1.4 - 1.5)

《Modern Statistics for Modern Biology》Chapter 二: 统计建模(2.1-2.3)

《Modern Statistics for Modern Biology》Chapter 二: 统计建模(2.4-2.5)

从这章开始最开始记录一些markdown等小知识
$\hat{p}=\frac{1}{12}$\hat{p}=\frac{1}{12}
掌握R语言中的apply函数族
卡方检验
Hardy-Weinberg equilibrium( 哈迪-温伯格平衡 )

2.6 The χ2 distribution

卡方检验是一种用途很广的计数资料的假设检验方法。它属于非参数检验的范畴,主要是比较两个及两个以上样本率( 构成比)以及两个分类变量的关联性分析。其根本思想就是在于比较理论频数和实际频数的吻合程度或拟合优度问题

卡方检验的基本思想

卡方检验是以χ2分布为基础的一种常用假设检验方法,它的无效假设H0是:观察频数与期望频数没有差别。
该检验的基本思想是:首先假设H0成立,基于此前提计算出χ2值,它表示观察值与理论值之间的偏离程度。根据χ2分布及自由度可以确定在H0假设成立的情况下获得当前统计量及更极端情况的概率P。如果P值很小,说明观察值与理论值偏离程度太大,应当拒绝无效假设,表示比较资料之间有显著差异;否则就不能拒绝无效假设,尚不能认为样本所代表的实际情况和理论假设有差别。


Ai为i水平的观察频数,Ei为i水平的期望频数,n为总频数,pi为i水平的期望频率。i水平的期望频数Ei等于总频数n×i水平的期望概率pi,k为单元格数。当n比较大时,χ2统计量近似服从k-1(计算Ei时用到的参数个数)个自由度的卡方分布

2.6.1 Intermezzo: quantiles and the quantile-quantile plot

问题 2.10

qs = ppoints(100)
quantile(simulstat, qs)
quantile(qchisq(qs, df = 30), qs)
hist() 即可

问题 2.11 0.5分位数的另一个名称?中位数

问题 2.12

> qqplot(qchisq(ppoints(B), df = 30), simulstat, main = "",
+ xlab = expression(chi[nu == 30]^2),asp = 1, cex = 0.5, pch = 16)
> abline(a = 0, b = 1, col = "red" )
图 2.8
> 1 - pchisq(S1, df = 30)
[1] 4.74342e-05

2.7 Chargaff’s Rule

> load("ChargaffTable.RData")
> ChargaffTable
                  A    T    C    G
Human-Thymus   30.9 29.4 19.9 19.8
Mycobac.Tuber  15.1 14.6 34.9 35.4
Chicken-Eryth. 28.8 29.2 20.5 21.5
Sheep-liver    29.3 29.3 20.5 20.7
Sea Urchin     32.8 32.1 17.7 17.3
Wheat          27.3 27.1 22.7 22.8
Yeast          31.3 32.9 18.7 17.1
E.coli         24.7 23.6 26.0 25.7
 library(ggplot2)
 library(reshape2)
data <-  melt(ChargaffTable)
colnames(data) <-  c("organisms", "nucleotide", "frequencies")
p1 <- ggplot(data, aes(nucleotide, frequencies, fill = organisms)) + 
    geom_bar(stat = "identity", position = "dodge") + 
    facet_wrap(~organisms, ncol = 4) + 
    coord_flip() 
library(ggsci)
# devtools::install_github('bbc/bbplot')
library(bbplot)
p1 + bbc_style() + scale_fill_rickandmorty()

p1 + scale_fill_rickandmorty() +
    labs( x = NULL, y = NULL) +
    theme( strip.text = element_text( size = 12, color = "white", hjust = 0.5 ),
       strip.background = element_rect( fill = "#858585", color = NA ),    
       panel.background = element_rect( fill = "#efefef", color = NA ),
       panel.grid.major.x = element_blank(),
       panel.grid.minor.x = element_blank(),
       panel.grid.minor.y = element_blank(),
       panel.grid.major.y = element_line( color = "#b2b2b2" ),
       panel.spacing.x = unit( 1, "cm" ),
       panel.spacing.y = unit( 0.5, "cm" ),
       legend.position = "none" ) 

问题 2.13

> statChf = function(x){
+     sum((x[, "C"] - x[, "G"])^2 + (x[, "A"] - x[, "T"])^2)
+ }
> chfstat = statChf(ChargaffTable)
> permstat = replicate(100000,{
+     permuted = t(apply(ChargaffTable, 1, sample))
+     colnames(permuted) = colnames(ChargaffTable)
+     statChf(permuted)
+ })
> pChf = mean(permstat <= chfstat)
> pChf
[1] 0.00014
图 2.10

问题 2.14

2.7.1 两个分类变量

> HairEyeColor[,, "Female"]
       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8

问题 2.15

> str(HairEyeColor)
 'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
 - attr(*, "dimnames")=List of 3
  ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
  ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
  ..$ Sex : chr [1:2] "Male" "Female"

色盲与性别

> load("Deuteranopia.RData")
> Deuteranopia
          Men Women
Deute      19     2
NonDeute 1981  1998
> chisq.test(Deuteranopia)

        Pearson's Chi-squared test with Yates' continuity correction

data:  Deuteranopia
X-squared = 12.255, df = 1, p-value = 0.0004641

2.7.2 A special multinomial: Hardy-Weinberg equilibrium( 哈迪-温伯格平衡 )

> library("HardyWeinberg")
> data("Mourant")
> Mourant[214:216,]
    Population    Country Total  MM  MN  NN
214    Oceania Micronesia   962 228 436 298
215    Oceania Micronesia   678  36 229 413
216    Oceania     Tahiti   580 188 296  96

> nMM = Mourant$MM[216]
> nMN = Mourant$MN[216]
> nNN = Mourant$NN[216]
> loglik = function(p, q = 1 - p) {
+   2 * nMM * log(p) + nMN * log(2*p*q) + 2 * nNN * log(q)
+ }
> xv = seq(0.01, 0.99, by = 0.01)
> yv = loglik(xv)
> yv
 [1] -2894.4074 -2433.5668 -2166.0994 -1977.8341 -1832.9917 -1715.6356 -1617.2657 -1532.8083
 [9] -1458.9915 -1393.5815 -1334.9857 -1282.0282 -1233.8167 -1189.6578 -1149.0023 -1111.4076
[17] -1076.5123 -1044.0170 -1013.6717  -985.2648  -958.6162  -933.5714  -909.9967  -887.7758
[25]  -866.8071  -847.0012  -828.2793  -810.5714  -793.8153  -777.9555  -762.9425  -748.7316
[33]  -735.2828  -722.5601  -710.5310  -699.1662  -688.4393  -678.3264  -668.8060  -659.8587
[41]  -651.4671  -643.6157  -636.2904  -629.4788  -623.1701  -617.3546  -612.0242  -607.1718
[49]  -602.7917  -598.8792  -595.4307  -592.4440  -589.9177  -587.8516  -586.2467  -585.1050
[57]  -584.4297  -584.2254  -584.4975  -585.2531  -586.5005  -588.2495  -590.5114  -593.2992
[65]  -596.6278  -600.5139  -604.9767  -610.0376  -615.7205  -622.0527  -629.0646  -636.7904
[73]  -645.2687  -654.5430  -664.6624  -675.6828  -687.6674  -700.6888  -714.8299  -730.1866
[81]  -746.8698  -765.0091  -784.7568  -806.2937  -829.8357  -855.6445  -884.0403  -915.4211
[89]  -950.2893  -989.2922 -1033.2827 -1083.4165 -1141.3148 -1209.3531 -1291.2149 -1393.0722
[97] -1526.4973 -1717.4719 -2048.9053
> plot(x = xv, y = yv, type = "l", lwd = 2,
+      xlab = "p", ylab = "log-likelihood")
> imax = which.max(yv)
> abline(v = xv[imax], h = yv[imax], lwd = 1.5, col = "blue")
> abline(h = yv[imax], lwd = 1.5, col = "purple")
图 2.11
> phat  =  af(c(nMM, nMN, nNN))
> phat
[1] 0.5793103
> pMM   =  phat^2
> qhat  =  1 - phat
> pHW = c(MM = phat^2, MN = 2*phat*qhat, NN = qhat^2)
> sum(c(nMM, nMN, nNN)) * pHW
      MM       MN       NN 
194.6483 282.7034 102.6483 

Visual comparison to the Hardy-Weinberg equilibrium

> pops = c(1, 69, 128, 148, 192)
> genotypeFrequencies = as.matrix(Mourant[, c("MM", "MN", "NN")])
> HWTernaryPlot(genotypeFrequencies[pops, ],
+         markerlab = Mourant$Country[pops],
+         alpha = 0.0001, curvecols = c("red", rep("purple", 4)),
+         mcex = 0.75, vertex.cex = 1)

问题2.16
在上面的代码中绘制三元图,然后将其他数据点添加到其中,您注意到了什么?您可以使用HWChisq函数备份您的讨论。

> HWTernaryPlot(genotypeFrequencies[-pops, ], alpha = 0.0001,
+    newframe = FALSE, cex = 0.5)

问题2.17

> newgf = round(genotypeFrequencies / 50)
> HWTernaryPlot(newgf[pops, ],
+         markerlab = Mourant$Country[pops],
+         alpha = 0.0001, curvecols = c("red", rep("purple", 4)),
+         mcex = 0.75, vertex.cex = 1)

> HWTernaryPlot(newgf[-pops, ], alpha = 0.0001,
   newframe = FALSE, cex = 0.5)

2.7.3 Concatenating several multinomials: sequence motifs and logos (motif图绘制)

> library("seqLogo")
载入需要的程辑包:grid
Warning message:
程辑包‘seqLogo’是用R版本3.5.1 来建造的 
> load("kozak.RData")
> kozak
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
A 0.33 0.25  0.4 0.15 0.20    1    0    0 0.05
C 0.12 0.25  0.1 0.40 0.40    0    0    0 0.05
G 0.33 0.25  0.4 0.20 0.25    0    0    1 0.90
T 0.22 0.25  0.1 0.25 0.15    0    1    0 0.00
> pwm = makePWM(kozak)
> seqLogo(pwm, ic.scale = FALSE)
图 2.13

在前面的章节中,我们已经看到了我们在多项分布中遇到的不同的“盒子”是如何很少有相等的概率的。换句话说,参数p1,p2...通常是不同的,这取决于所建模的内容。具有不同频率的多项式的例子包括二十种不同的氨基酸、血型和发色

.......,啊啊啊, 越来越难~ 进度太慢了,呼气吸气。
上一篇下一篇

猜你喜欢

热点阅读