生物信息学与算法数据-R语言-图表-决策-Linux-PythonR小推车

《Modern Statistics for Modern Bi

2019-03-08  本文已影响49人  热衷组培的二货潜

《Modern Statistics for Modern Biology》Chapter 一: 离散数据模型的预测(1.1 - 1.3)

《Modern Statistics for Modern Biology》Chapter 一: 离散数据模型的预测(1.4 - 1.5)

《Modern Statistics for Modern Biology》Chapter 二: 统计建模(2.1-2.3)

《Modern Statistics for Modern Biology》Chapter 二: 统计建模(2.4-2.5)

《Modern Statistics for Modern Biology》Chapter 二 统计建模(2.6 - 2.7)

《Modern Statistics for Modern Biology》Chapter 二 统计建模(2.8 - 2.9)

从这章开始最开始记录一些markdown等小知识
$\hat{p}=\frac{1}{12}$\hat{p}=\frac{1}{12}
掌握R语言中的apply函数族
卡方检验
Hardy-Weinberg equilibrium( 哈迪-温伯格平衡 )
带你理解beta分布
简单介绍一下R中的几种统计分布及常用模型

  • 统计分布每一种分布有四个函数:d――density(密度函数),p――分布函数,q――分位数函数,r――随机数函数。比如,正态分布的这四个函数为dnorm,pnorm,qnorm,rnorm。下面我们列出各分布后缀,前面加前缀d、p、q或r就构成函数名:norm:正态t:t分布f:F分布chisq:卡方(包括非中心) unif:均匀exp:指数,weibull:威布尔,gamma:伽玛beta:贝塔 lnorm:对数正态,logis:逻辑分布,cauchy:柯西binom:二项分布geom:几何分布hyper:超几何nbinom:负二项pois:泊松 signrank:符号秩,wilcox:秩和tukey:学生化极差
    如何预测一条序列是否含有CpG岛

2.10 示例:基因组中出现核苷酸模式

library("Biostrings")

问题
通过浏览教程小片段,了解Biostring包中提供的一些有用的数据和函数。

函数 功能
length 返回序列长度
names 返回序列的名称
[ 从object中提取序列
head; tail 从object中提取第一条或者最后一条序列
rev 对序列进行反向排序
width, nchar 返回一个object中的所有序列的大小(比如碱基的数目)
==, != 两个object元素范围内的比较
match, %in% 匹配
duplicated, unique 去重复功能
sort, order 排序
relist, split, extractList
Table 2:Counting / tabulating. Table 3:Sequence transformation and editing. Table 4:String matching / alignments.

► 解

> GENETIC_CODE
TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG 
"F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "*" "W" "L" "L" "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q" 
CGT CGC CGA CGG ATT ATC ATA ATG ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG 
"R" "R" "R" "R" "I" "I" "I" "M" "T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A" 
GAT GAC GAA GAG GGT GGC GGA GGG 
"D" "D" "E" "E" "G" "G" "G" "G" 
attr(,"alt_init_codons")
[1] "TTG" "CTG"
> IUPAC_CODE_MAP
     A      C      G      T      M      R      W      S      Y      K      V      H      D      B      N 
   "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG"  "ACT"  "AGT"  "CGT" "ACGT" 
> vignette(package = "Biostrings")
> vignette("BiostringsQuickOverview", package = "Biostrings")
> library("BSgenome")
> ag = available.genomes()
> length(ag) ## 可以看到有91个基因组的信息
[1] 91
> ag[1:2]
[1] "BSgenome.Alyrata.JGI.v1"               "BSgenome.Amellifera.BeeBase.assembly4"
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BSgenome.Ecoli.NCBI.20080805", version = "3.8")
library("BSgenome.Ecoli.NCBI.20080805")
> Ecoli
E. coli genome:
# organism: Escherichia coli (E. coli)
# provider: NCBI
# provider version: 2008/08/05
# release date: NA
# release name: NA
# 13 sequences:
#   NC_008253 NC_008563 NC_010468 NC_004431 NC_009801 NC_009800 NC_002655 NC_002695 NC_010498 NC_007946 NC_010473 NC_000913
#   AC_000091                                                                                                              
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given sequence)
> shineDalgarno = "AGGAGGT"
> ecoli = Ecoli$NC_010473
> window = 50000
> starts = seq(1, length(ecoli) - window, by = window)
> ends   = starts + window - 1
> numMatches = vapply(seq_along(starts), function(i) {
+   countPattern(shineDalgarno, ecoli[starts[i]:ends[i]],
+                max.mismatch = 0)
+   }, numeric(1))
> table(numMatches)
numMatches
 0  1  2  3  4 
48 32  8  3  2 
> numMatches
 [1] 0 1 0 1 4 0 0 0 1 0 0 1 0 0 1 0 0 0 0 3 0 0 4 0 1 0 2 2 1 0 1 0 1 1 1 0 0 0 0 1 3 1 1 1 1 0 0 0 0 1 0 0 1 0 2 1 2 1 1 0 0 0 0
[64] 0 1 0 1 1 0 1 2 0 0 0 0 0 0 0 0 1 1 0 3 1 0 1 2 1 2 1 2 0 1
> str(starts)
 num [1:93] 1 50001 100001 150001 200001 ...
> str(numMatches)
 num [1:93] 0 1 0 1 4 0 0 0 1 0 ...

问题 What distribution might this table fit ?

> sdMatches = matchPattern(shineDalgarno, ecoli, max.mismatch = 0)
> sdMatches 
  Views on a 4686137-letter DNAString subject
subject: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC...TGCTGCATGATATTGAAAAAAATATCACCAAATAAAAAACGCCTTAGTAAGTATTTTTC
views:
       start     end width
 [1]   56593   56599     7 [AGGAGGT]
 [2]  199644  199650     7 [AGGAGGT]
 [3]  202176  202182     7 [AGGAGGT]
 [4]  214433  214439     7 [AGGAGGT]
 [5]  217429  217435     7 [AGGAGGT]
 ...     ...     ...   ... ...
[61] 4438786 4438792     7 [AGGAGGT]
[62] 4498085 4498091     7 [AGGAGGT]
[63] 4536658 4536664     7 [AGGAGGT]
[64] 4546821 4546827     7 [AGGAGGT]
[65] 4611626 4611632     7 [AGGAGGT]
> betweenmotifs = gaps(sdMatches)
> betweenmotifs 
  Views on a 4686137-letter DNAString subject
subject: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC...TGCTGCATGATATTGAAAAAAATATCACCAAATAAAAAACGCCTTAGTAAGTATTTTTC
views:
       start     end  width
 [1]       1   56592  56592 [AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAA...TGAATAATTTCGCTACCTACCACGCTCCAGGTGTCAGAACCCGGCAGAC]
 [2]   56600  199643 143044 [AAAGCTACCGTTATCCAGAATGGTATAGCGGTTTTCGCCACGTTGTTTC...ACATGTTATGGGGCTATAGCTCAGCTGGGAGAGCGCCTGCTTTGCACGC]
 [3]  199651  202175   2525 [CTGCGGTTCGATCCCGCATAGCTCCACCATCTCTGTAGTGGTTAAATAA...TAAAGAGTAACGGAGGAGCACGAAGGTTGGCTAATCCTGGTCGGACATC]
 [4]  202183  214432  12250 [TAGTGCAATGGCATAAGCCAGCTTGACTGCGAGCGTGACGGCGCGAGCA...GATATTGGAAATATCTGATTTGCAAATTATCGTGTTATCGCCAGGCTTT]
 [5]  214440  217428   2989 [TAATAACATGGGCAGGATAAGCTCGGGAGGAATGATGTTTAAGGCAATA...CCGTAGCGAGAATACTCAAAATCATCATAACGAAAAGCCCCTTACTTGT]
 ...     ...     ...    ... ...
[62] 4438793 4498084  59292 [GGATTTAATCACGGTAACATTTTGCTGGCTTAAAGCATCTGCCAGACGC...GAGCAACAGGCGGCAGAGCAAGGTTGGGAGTCATTGCATCGTCAACTTC]
[63] 4498092 4536657  38566 [AGATCCGGTTGCGGCAGCAAGGATTCATCCAAATGATCCACAAAGGCTT...AGGATATAGTAAGCAAATTAAAACGGGGTAAATTTGAACTCCAAATACC]
[64] 4536665 4546820  10156 [GGAATTAAAGAATGCGATGGATGGTATATTTACGAAAAAATAATTGATG...GGGGTAAAACGTTTCCTGTAGCACCGTGAGTTATACTTTGTATAACTTA]
[65] 4546828 4611625  64798 [GCAGATGCGTATTACCATAAAAAGATGGGGGAACAGTGCAGGTATGGTC...ATGGGGCACCGACACACCGAGCGTGGTGGCGCGCGCATCGCCGAGTGCA]
[66] 4611633 4686137  74505 [CGAGATCGCGGCAAAAACTCAGGCTCAGCGGCAGAAATAAAATCATCAG...TATTGAAAAAAATATCACCAAATAAAAAACGCCTTAGTAAGTATTTTTC]
> library("Renext")
载入需要的程辑包:evd
Warning messages:
1: 程辑包‘Renext’是用R版本3.5.2 来建造的 
2: 程辑包‘evd’是用R版本3.5.2 来建造的 
> expplot(width(betweenmotifs), rate = 1/mean(width(betweenmotifs)),
+         labels = "fit")
图 2.23

2.10.1在依赖关系的情况下建模(Modeling in the case of dependencies)

> library("BSgenome.Hsapiens.UCSC.hg19")
> chr8  =  Hsapiens$chr8
> setwd("e://compute language/NGS/Modern Statistics for Modern Biology/data/")
> CpGtab = read.table("model-based-cpg-islands-hg19.txt",header = TRUE)
> nrow(CpGtab)
[1] 65699
> head(CpGtab)
    chr  start    end length CpGcount GCcontent pctGC obsExp
1 chr10  93098  93818    721       32       403 0.559  0.572
2 chr10  94002  94165    164       12        97 0.591  0.841
3 chr10  94527  95302    776       65       538 0.693  0.702
4 chr10 119652 120193    542       53       369 0.681  0.866
5 chr10 122133 122621    489       51       339 0.693  0.880
6 chr10 180265 180720    456       32       256 0.561  0.893
> irCpG = with(dplyr::filter(CpGtab, chr == "chr8"), IRanges(start = start, end = end))
> head(irCpG)
IRanges object with 6 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]     32437     33708      1272
  [2]     40354     40656       303
  [3]     44536     46203      1668
  [4]     99457    100721      1265
  [5]    155954    156761       808
  [6]    179033    179756       724
> grCpG = GRanges(ranges = irCpG, seqnames = "chr8", strand = "+")
> grCpG
GRanges object with 2855 ranges and 0 metadata columns:
         seqnames              ranges strand
            <Rle>           <IRanges>  <Rle>
     [1]     chr8         32437-33708      +
     [2]     chr8         40354-40656      +
     [3]     chr8         44536-46203      +
     [4]     chr8        99457-100721      +
     [5]     chr8       155954-156761      +
     ...      ...                 ...    ...
  [2851]     chr8 146126089-146128165      +
  [2852]     chr8 146175867-146176338      +
  [2853]     chr8 146228006-146228907      +
  [2854]     chr8 146232962-146233086      +
  [2855]     chr8 146276730-146278360      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> genome(grCpG) = "hg19"
> library("Gviz")
载入需要的程辑包:grid
Warning message:
程辑包‘Gviz’是用R版本3.5.1 来建造的 
> ideo = IdeogramTrack(genome = "hg19", chromosome = "chr8")
Warning messages:
1: In readChar(f, 1) : 在non-UTF-8 MBCS语言环境里只能读取字节
2: In readChar(f, 1) : 在non-UTF-8 MBCS语言环境里只能读取字节
> ideo
Ideogram track 'chr8' for chromosome 8 of the hg19 genome 
> plotTracks(
+   list(GenomeAxisTrack(),
+     AnnotationTrack(grCpG, name = "CpG"), ideo),
+     from = 2200000, to = 5800000,
+     shape = "box", fill = "#006400", stacking = "dense")
> CGIview    = Views(unmasked(Hsapiens$chr8), irCpG)
> CGIview 
  Views on a 146364022-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
           start       end width
   [1]     32437     33708  1272 [CGGGTGCCATCTCAGCAGCTCACGGTGTGGAAACTGCGACACTCACA...ATTGGAACTGTGCGTTTGCGGAAGAACACGGCGGGGGGGGCGGCGC]
   [2]     40354     40656   303 [CGATTAGCGGGAGCGCAAAGCGAAGGGGCGGCCTGCGACGTTTTCCT...ACGAGGAGGGAGCCGCGCACGCGCGGTCGGCTCGGCGAGGAGCCGG]
   [3]     44536     46203  1668 [CGGCGCTGCAGGAGAGGAGATGCCCAGGCCAGGCGGCCGGCGCACGT...TGTGATGTTGTGTTCTCGGTGTGAGTTCATGGGTGTGACGGGGCGT]
   [4]     99457    100721  1265 [CGGCTGGCCGGGCAGGGGGGCTGACCCCCCCCACCTCCCTCCTGGAC...GCTCCGCTGCCCGCTGAACTCCATCCTCCCGGCGGTCGGGCGGCGG]
   [5]    155954    156761   808 [CGGGAAAGATTTTATTCACCGTCGATGCGGCCCCGAGTTGTCCCAAA...GGCACGGCAGGGCTCTCTTGCTCGCAGTATACTGGCGGCACGCCGC]
   ...       ...       ...   ... ...
[2851] 146126089 146128165  2077 [CGGCAAGTAGCACACTCGACCAACTGCTGAAACGAAAGCAGCTCGTT...AAAGAGAAGGCTTTATTATTTCAAGGTCGACGGCCAAGGAGACCGG]
[2852] 146175867 146176338   472 [CGACAGGCGGGAACGCCACCAGCCTGTGGGCGGAGTCCTGGCTCGGC...TATCTGGGGCAGCGTAGCCTCTCTAGATGGCGGGAGGCCGGGGCGA]
[2853] 146228006 146228907   902 [CGCCACCCTCGCCCTGCCCGCGCCTCGGCCAACACAAAGCGGGCGCT...GGGAGAGGCCAGGTGCGACCCGGGCAGGTGAGGAGCACGGGCCCGG]
[2854] 146232962 146233086   125 [CGCCTCCAGCTGCGCAAGGGCAGGTGCGGGCAAGGAAGTCCCACTGG...TGGGTGAGGGTAGAGAGGATGCCTCAGTGCGTCCGGACACCACCGA]
[2855] 146276730 146278360  1631 [CGTCTCAGGAAACACCGCTTTCCACTCCTGTGTCCCCGAAAGGAATC...ATCCACGGGTGCGAATCCACGGGCACGTGTGATCTGGGGTCCGCGG]
> NonCGIview = Views(unmasked(Hsapiens$chr8), gaps(irCpG))
> NonCGIview 
  Views on a 146364022-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
           start       end   width
   [1]     33709     40353    6645 [TCTTCTCATCACGTGCTCTCAGGGGCCACGATGTCAACATGCCTCA...AATGCTGGGTGCCCGCAAGAACGCGGAGCAGCAGGCACTCATTCC]
   [2]     40657     44535    3879 [TCTCCAAGTGCCGCCAGCTGCGGGATTTCCTCTGCAAAAGACAAAC...CCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGG]
   [3]     46204     99456   53253 [GTGCTGTGTGAGAACATGTGTGTAGTGTTCACATGTCCTCTGTGCG...AGTAGGGGCGGCCGGGCAGAGGCGCCCCTCACCTCCCGGACGGGG]
   [4]    100722    155953   55232 [CGGCTGCCCTAAGTGATCTTTTTAAGTAAAGGAGAAACTCACCAGG...AACCGTTCTAACTGGTCTCTGACCTTGATTATTAACGGCTGCAAC]
   [5]    156762    179032   22271 [CTGCTGGCAGCTAGGGACATTGCAGGGCCCTCTTGCTCACATTGTA...CCATGTCTGCATCCCGGGCGGTATGTACGCTCTGAGAGATACATG]
   ...       ...       ...     ... ...
[2850] 146125886 146126088     203 [GGGGGATGACTTTTATATATTTCTGACATGCCTGCATTAGAAGAAG...AGAAAAGACAGCATCGACAGCAAAATGGGGAGAGGATAAGACACC]
[2851] 146128166 146175866   47701 [AGGCATGGCTCAAATGTGGCTCCCACATGTGGGCTTAGGTCAAAGT...CCACCCCTTCACCCCGTCGCTCCGCCCGAGCTCGAATGGTGGCAC]
[2852] 146176339 146228005   51667 [CTCTGTGGTTCAGCCCCAGACCTGGTCCCACTAGCCCTGAGGCCAG...ACAGCCCGTAGACCTGCACGCCCCGCTGCCTCCGCTCGTGCCTCA]
[2853] 146228908 146232961    4054 [AGGGAGGACTTACGAGAATGGAAAGGAAAGCAGCCCGTCTTCAGTG...CTGAAGACAGCGCAGCTATCCGTAAGGATCACATGCGGCTTCCCT]
[2854] 146233087 146276729   43643 [GACCATAGTTCTCCAAGGGCACCTAGAGTGGATCCAGCTGAGTTCA...CCTTCGAGACCTTGCAGACCAGGCAGCTAACACTTCCCACCCCAC]
> seqCGI      = as(CGIview, "DNAStringSet")
> seqCGI 
  A DNAStringSet instance of length 2855
       width seq
   [1]  1272 CGGGTGCCATCTCAGCAGCTCACGGTGTGGAAACTGCGACACTCACACGGGTGCCATC...TCTCTGATTACATTGGAACTGTGCGTTTGCGGAAGAACACGGCGGGGGGGGCGGCGC
   [2]   303 CGATTAGCGGGAGCGCAAAGCGAAGGGGCGGCCTGCGACGTTTTCCTGTAAAGTTGGG...GCTCAGCACGGACGAGGAGGGAGCCGCGCACGCGCGGTCGGCTCGGCGAGGAGCCGG
   [3]  1668 CGGCGCTGCAGGAGAGGAGATGCCCAGGCCAGGCGGCCGGCGCACGTGGGTTCTCTGT...GAGTCCCTGTGTGTGATGTTGTGTTCTCGGTGTGAGTTCATGGGTGTGACGGGGCGT
   [4]  1265 CGGCTGGCCGGGCAGGGGGGCTGACCCCCCCCACCTCCCTCCTGGACGGGGCGGCTGG...AGACTGAGACAGCTCCGCTGCCCGCTGAACTCCATCCTCCCGGCGGTCGGGCGGCGG
   [5]   808 CGGGAAAGATTTTATTCACCGTCGATGCGGCCCCGAGTTGTCCCAAAGCCAGGCAGTG...GCTGGCGACGGGGCACGGCAGGGCTCTCTTGCTCGCAGTATACTGGCGGCACGCCGC
   ...   ... ...
[2851]  2077 CGGCAAGTAGCACACTCGACCAACTGCTGAAACGAAAGCAGCTCGTTTTCGTCTTGAC...AGTTTTTGAGAAAAGAGAAGGCTTTATTATTTCAAGGTCGACGGCCAAGGAGACCGG
[2852]   472 CGACAGGCGGGAACGCCACCAGCCTGTGGGCGGAGTCCTGGCTCGGCCCCTCGCGGCG...GCCGCTCGTTCTATCTGGGGCAGCGTAGCCTCTCTAGATGGCGGGAGGCCGGGGCGA
[2853]   902 CGCCACCCTCGCCCTGCCCGCGCCTCGGCCAACACAAAGCGGGCGCTCAGCACCCCGC...AGCGGAGTGAGGGGAGAGGCCAGGTGCGACCCGGGCAGGTGAGGAGCACGGGCCCGG
[2854]   125 CGCCTCCAGCTGCGCAAGGGCAGGTGCGGGCAAGGAAGTCCCACTGGGGCTTTTGGGG...GCTCGGGCGAGTGGGTGAGGGTAGAGAGGATGCCTCAGTGCGTCCGGACACCACCGA
[2855]  1631 CGTCTCAGGAAACACCGCTTTCCACTCCTGTGTCCCCGAAAGGAATCATCAGACATCT...GGAAGGGTGGAATCCACGGGTGCGAATCCACGGGCACGTGTGATCTGGGGTCCGCGG
> seqNonCGI   = as(NonCGIview, "DNAStringSet")
> dinucCpG    = sapply(seqCGI, dinucleotideFrequency)
> dinucNonCpG = sapply(seqNonCGI, dinucleotideFrequency)
> dinucNonCpG[, 1]
 AA  AC  AG  AT  CA  CC  CG  CT  GA  GC  GG  GT  TA  TC  TG  TT 
389 351 400 436 498 560 112 603 359 336 403 336 330 527 519 485 
> dinucCpG[, 1]
 AA  AC  AG  AT  CA  CC  CG  CT  GA  GC  GG  GT  TA  TC  TG  TT 
 49 127  55   8 132  44 116 112  54 129 107 100   4 104 112  18 
> NonICounts = rowSums(dinucNonCpG)
> IslCounts  = rowSums(dinucCpG)
> NonICounts 
      AA       AC       AG       AT       CA       CC       CG       CT       GA       GC       GG       GT       TA       TC 
14223322  7129981  9795572 11291315 10241505  6931732  1174927  9779692  8372468  5706958  6934755  7112531  9602902  8359398 
      TG       TT 
10221420 14197986 
> IslCounts 
    AA     AC     AG     AT     CA     CC     CG     CT     GA     GC     GG     GT     TA     TC     TG     TT 
 64103  83109 122131  44000 113346 193847 133549 122377 105186 177326 194517  86752  31210 106738 114592  65861 
> TI  = matrix( IslCounts, ncol = 4, byrow = TRUE)
> TI
       [,1]   [,2]   [,3]   [,4]
[1,]  64103  83109 122131  44000
[2,] 113346 193847 133549 122377
[3,] 105186 177326 194517  86752
[4,]  31210 106738 114592  65861
> TnI = matrix(NonICounts, ncol = 4, byrow = TRUE)
> dimnames(TI) = dimnames(TnI) =
+   list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
> TI
       A      C      G      T
A  64103  83109 122131  44000
C 113346 193847 133549 122377
G 105186 177326 194517  86752
T  31210 106738 114592  65861
> TI
       A      C      G      T
A  64103  83109 122131  44000
C 113346 193847 133549 122377
G 105186 177326 194517  86752
T  31210 106738 114592  65861
> MI = TI /rowSums(TI)
> MI
           A         C         G         T
A 0.20457773 0.2652333 0.3897678 0.1404212
C 0.20128250 0.3442381 0.2371595 0.2173200
G 0.18657245 0.3145299 0.3450223 0.1538754
T 0.09802105 0.3352314 0.3598984 0.2068492
> MN = TnI / rowSums(TnI)
> MN
          A         C          G         T
A 0.3351380 0.1680007 0.23080886 0.2660524
C 0.3641054 0.2464366 0.04177094 0.3476871
G 0.2976696 0.2029017 0.24655406 0.2528746
T 0.2265813 0.1972407 0.24117528 0.3350027

问题

问题

> freqIsl = alphabetFrequency(seqCGI, baseOnly = TRUE, collapse = TRUE)[1:4]
> freqIsl / sum(freqIsl)
        A         C         G         T 
0.1781693 0.3201109 0.3206298 0.1810901 
> freqNon = alphabetFrequency(seqNonCGI, baseOnly = TRUE, collapse = TRUE)[1:4]
> freqNon / sum(freqNon)
        A         C         G         T 
0.3008292 0.1993832 0.1993737 0.3004139

问题

> alpha = log((freqIsl/sum(freqIsl)) / (freqNon/sum(freqNon)))
> alpha
         A          C          G          T 
-0.5238084  0.4734387  0.4751059 -0.5061665 
> beta  = log(MI / MN)
> beta
           A         C         G          T
A -0.4935945 0.4566418 0.5239611 -0.6390469
C -0.5927341 0.3342289 1.7365319 -0.4699321
G -0.4671646 0.4383576 0.3360277 -0.4967508
T -0.8379216 0.5303960 0.4002977 -0.4821485
> x = "ACGTTATACTACG"
> scorefun = function(x) {
+   s = unlist(strsplit(x, ""))
+   score = alpha[s[1]]
+   if (length(s) >= 2)
+     for (j in 2:length(s))
+       score = score + beta[s[j-1], s[j]]
+   score
+ }
> scorefun(x)
         A 
-0.2824623 
> generateRandomScores = function(s, len = 100, B = 1000) {
+   alphFreq = alphabetFrequency(s)
+   isGoodSeq = rowSums(alphFreq[, 5:ncol(alphFreq)]) == 0
+   s = s[isGoodSeq]
+   slen = sapply(s, length)
+   prob = pmax(slen - len, 0)
+   prob = prob / sum(prob)
+   idx  = sample(length(s), B, replace = TRUE, prob = prob)
+   ssmp = s[idx]
+   start = sapply(ssmp, function(x) sample(length(x) - len, 1))
+   scores = sapply(seq_len(B), function(i)
+     scorefun(as.character(ssmp[[i]][start[i]+(1:len)]))
+   )
+   scores / len
+ }
> scoresCGI    = generateRandomScores(seqCGI)
> scoresNonCGI = generateRandomScores(seqNonCGI)

> br = seq(-0.6, 0.7, length.out = 50)
> h1 = hist(scoresCGI,    breaks = br, plot = FALSE)
> h2 = hist(scoresNonCGI, breaks = br, plot = FALSE)
> plot(h1, col = rgb(0, 0, 1, 1/4), xlim = c(-0.5, 0.5), ylim=c(0,120))
> plot(h2, col = rgb(1, 0, 0, 1/4), add = TRUE)
图 2.25

2.11 Summary of this chapter

References

放在最后的话 : 这一章真的看得很难受,虽然给的代码中学习到了很多。

上一篇下一篇

猜你喜欢

热点阅读