R小推车

《Modern Statistics for Modern Bi

2019-03-05  本文已影响16人  热衷组培的二货潜

《Modern Statistics for Modern Biology》Chapter 一: 离散数据模型的预测(1.1 - 1.3)

《Modern Statistics for Modern Biology》Chapter 一: 离散数据模型的预测(1.4 - 1.5)

《Modern Statistics for Modern Biology》Chapter 二: 统计建模(2.1-2.3)

从这章开始最开始记录一些markdown等小知识
$\hat{p}=\frac{1}{12}$\hat{p}=\frac{1}{12}
掌握R语言中的apply函数族

2.3.1 经典数据的经典统计

问题 2.6

2.4 二项分布与极大似然

2.4.1 举例

> cb <- c(rep(0,110),rep(1,10)) ## 由于这里没找到原始数据,所以就自己生成了一组结果
> table(cb)
cb
  0   1 
110  10 

问题 2.7

> probs  =  seq(0, 0.3, by = 0.005)   ## 从0开始到3结束,以0.005为间隔生成一组数据
> likelihood = dbinom(sum(cb), prob = probs, size = length(cb))
> plot(probs, likelihood, pch = 16, xlab = "probability of success",
+        ylab = "likelihood", cex=0.6)
> probs[which.max(likelihood)]
[1] 0.085

二项分布的似然

> loglikelihood = function(theta, n = 300, k = 40) {
+   115 + k * log(theta) + (n - k) * log(1 - theta)
+ }
> thetas = seq(0, 1, by = 0.001)
> plot(thetas, loglikelihood(thetas), xlab = expression(theta), 
+ ylab = expression(paste("log f(", theta, " | y)")), type = "l") ## expression表达式的使用值得好好学习

2.5 More boxes:multinomial data(理解是多个box的多项分布数据)


2.5.1 DNA计数模型:碱基对

2.5.2 Nucleotide bias(碱基偏差)

setwd("e://compute language/NGS/Modern Statistics for Modern Biology/data/")
> library(Biostrings)
> staph = readDNAStringSet("staphsequence.ffn.txt", "fasta")
> staph[1]
  A DNAStringSet instance of length 1
    width seq                                                                        names               
[1]  1362 ATGTCGGAAAAAGAAATTTGGGAAAAAGTGCTTGAA...TAGAGAATCTTGAAAAAGAAATAAGAAATGTATAA lcl|NC_002952.2_c...
> letterFrequency(staph[[1]], letters = "ACTG", OR = 0)
  A   C   T   G 
522 219 392 229 

问题 2.8 Why did we use double square brackets in the second line?

问题 2.9 按照类似于练习1.8的步骤,测试第一个基因的核苷酸是否在四个核苷酸之间均匀分布。

参数列表:

> letterFrq = vapply(staph, letterFrequency, FUN.VALUE = numeric(4), letters = "ACGT", OR = 0)
> colnames(letterFrq) = paste0("gene", seq(along = staph))
>computerProportions = function(x){x/sum(x)}
> prop10 = apply(tab10, 2, computerProportions) ## 2表示列
> round(prop10, digits = 2)
  gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
A  0.38  0.36  0.35  0.37  0.35  0.33  0.33  0.34  0.38   0.27
C  0.16  0.16  0.13  0.15  0.15  0.15  0.16  0.16  0.14   0.16
G  0.17  0.17  0.23  0.19  0.22  0.22  0.20  0.21  0.20   0.20
T  0.29  0.31  0.30  0.29  0.27  0.30  0.30  0.29  0.28   0.36
> p0 = rowMeans(prop10)
> p0
        A         C         G         T 
0.3470531 0.1518313 0.2011442 0.2999714 
> cs = colSums(tab10)
> cs
 gene1  gene2  gene3  gene4  gene5  gene6  gene7  gene8  gene9 gene10 
  1362   1134    246   1113   1932   2661    831   1515   1287    696 
> expectedtab10 = outer(p0, cs, FUN = "*")  ## * 表示乘以,这里还可以用“+”、“-”表示
> round(expectedtab10)
  gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10
A   473   394    85   386   671   924   288   526   447    242
C   207   172    37   169   293   404   126   230   195    106
G   274   228    49   224   389   535   167   305   259    140
T   409   340    74   334   580   798   249   454   386    209

小插曲outer函数

> randomtab10 = sapply(cs, function(s) {rmultinom(1, s, p0)}) # n = 1表示看过程。 size = 1表示只看结果
> all(colSums(randomtab10) == cs)
[1] TRUE
> stat = function(obsvd, exptd = 20 * pvec){
+ sum((obsvd - exptd)^2 / exptd)
+ }
> B = 1000
> simulstat = replicate(B, {
+ randomtab10 = sapply(cs, function(s) {rmultinom(1, s, p0)})
+ stat(randomtab10, expectedtab10)
+ })
> S1 = stat(tab10, expectedtab10)
> S1
[1] 70.07533
> sum(simulstat >= S1)
[1] 0
> hist(simulstat, col = "lavender", breaks = seq(0, 75, length.out = 50))
> abline(v = S1, col = "red")
> abline(v = quantile(simulstat, probs = c(0.95, 0.99)),
+ col = c("darkgreen", "blue"), lty = 2)
图2.7

哇! 这一章好吃力,不过接近自己的生物学问题相关的统计还是蛮有亲切感的~

上一篇 下一篇

猜你喜欢

热点阅读