数据科学与R语言R语言与统计分析R

手把手R入门(学习笔记)

2019-06-18  本文已影响1人  Jason数据分析生信教室
Hands-on

0618 15:00

故事:Slot Machine八青哥

https://rstudio-education.github.io/hopr/programs.html

没有过维加斯体验的人对这个故事的理解会有点费力。比如说本人。
一个机器玩一次会生成3个不同的图标组合。如果该图标组合符合中奖规则,就会赢钱。和21点和大转盘比起来,八青哥的赔率比较低,所以更受庄主欢迎。

9 Programs

play ( )

-- 1 随机产生三个图案
-- 2 计算奖金

get_symbols <- function() {
  wheel <- c("DD", "7", "BBB", "BB", "B", "C", "0")
  sample(wheel, size = 3, replace = TRUE, 
    prob = c(0.03, 0.03, 0.06, 0.1, 0.25, 0.01, 0.52))
}
具体如下 具体回报率

9.1 策略

尽可能把整个过程拆分,拆分到很简单的可以用R自带函数解决为止。

play <- function() {

  # step 1: generate symbols
  symbols <- get_symbols()

  # step 2: display the symbols
  print(symbols)

  # step 3: score the symbols
  score(symbols)
}

取得图案👉展示图案👉给图案组合评分

if ( # Case 1: all the same <1>) {
  prize <- # look up the prize <3>
} else if ( # Case 2: all bars <2> ) {
  prize <- # assign $5 <4>
} else {
  # count cherries <5>
  prize <- # calculate a prize <7>
}

# count diamonds <6>
# double the prize if necessary <8>

翻译成人话就是
1.test是不是三个图案都一样
2.test是不是所有的图案都含有B
3.考虑每个条件的奖励
4.给含B的5¥
5.计算C的数量
6.计算Diamonds的数量(DD)
7.根据C的数量计算奖励
8.计算Diamonds的奖金

完整结构

这里出现了一段蜜汁代码

length(unique(symbols)==1)

分解一下这个组合代码
unique( ) 回return在vector里出现的的独一无二的项目。
如果symble里包含了同样的东西,unique(symbols)回变成一个长度为1的vector。

那不对啊,不应该变成length(unique(symbols))== 1

> symbols<-c("7",'8','7')
> unique(symbols)
[1] "7" "8"

或者可以自己写函数

same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]

if (same) {
  prize <- # look up the prize
} else if ( # Case 2: all bars ) {
  prize <- # assign $5
} else {
  # count cherries
  prize <- # calculate a prize
}
# count diamonds
# double the prize if necessary
> symbols<-c('B',"B","BBB")
> all(symbols %in% c("B", "BB", "BBB"))
[1] TRUE
if (same) {
  symbol <- symbols[1]
  if (symbol == "DD") {
    prize <- 800
  } else if (symbol == "7") {
    prize <- 80
  } else if (symbol == "BBB") {
    prize <- 40
  } else if (symbol == "BB") {
    prize <- 5
  } else if (symbol == "B") {
    prize <- 10
  } else if (symbol == "C") {
    prize <- 10
  } else if (symbol == "0") {
    prize <- 0
  }
}

Lookup tables

> payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25, 
+              "B" = 10, "C" = 10, "0" = 0)
> payouts
 DD   7 BBB  BB   B   C   0 
100  80  40  25  10  10   0 
> payouts["DD"]
 DD 
100 
> payouts["B"]
 B 
10 
unname(payouts["DD"])
 100 

在此,payout是一种类型的lookup table,(查找表格?)
0619 22:59

same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")

if (same) {
  payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25, 
    "B" = 10, "C" = 10, "0" = 0)
  prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
  prize <- # assign $5
} else {
  # count cherries
  prize <- # calculate a prize
}

symbols指定了个啥?

> symbols<-c("C","DD","C")
> sum(symbols=="C")
[1] 2
> sum(symbols=="DD")
[1] 1

继续写下去就是

same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")

if (same) {
  payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25, 
    "B" = 10, "C" = 10, "0" = 0)
  prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
  prize <- 5
} else {
  cherries <- sum(symbols == "C")
  prize <- # calculate a prize
}
diamonds <- sum(symbols == "DD")
# double the prize if necessary
if (cherries == 2) {
  prize <- 5
} else if (cherries == 1) {
  prize <- 2
} else {}
  prize <- 0
}

还不如转一下脑子用

same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")

if (same) {
  payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25, 
    "B" = 10, "C" = 10, "0" = 0)
  prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
  prize <- 5
} else {
  cherries <- sum(symbols == "C")
  prize <- c(0, 2, 5)[cherries + 1]
}

diamonds <- sum(symbols == "DD")
# double the prize if necessary
prize * 2 ^ diamonds
same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")

if (same) {
  payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25, 
    "B" = 10, "C" = 10, "0" = 0)
  prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
  prize <- 5
} else {
  cherries <- sum(symbols == "C")
  prize <- c(0, 2, 5)[cherries + 1]
}

diamonds <- sum(symbols == "DD")
prize * 2 ^ diamonds
score <- function (symbols) {
  # identify case
  same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
  bars <- symbols %in% c("B", "BB", "BBB")
  
  # get prize
  if (same) {
    payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25, 
      "B" = 10, "C" = 10, "0" = 0)
    prize <- unname(payouts[symbols[1]])
  } else if (all(bars)) {
    prize <- 5
  } else {
    cherries <- sum(symbols == "C")
    prize <- c(0, 2, 5)[cherries + 1]
  }
  
  # adjust for diamonds
  diamonds <- sum(symbols == "DD")
  prize * 2 ^ diamonds
}
play <- function() {
  symbols <- get_symbols()
  print(symbols)
  score(symbols)
}

应该是日语翻译的问题,其实静下心来看英文原版,一点丢不难嘛。明天继续S3

10. S3

0623 18:38

10.2 Attributes 属性

举个例子,data frame把行名字和列名字保存成了属性。

可以通过attributes( )来查看属性。

row.names(deck)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13"
## [14] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
## [27] "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
## [40] "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51" "52"
or to change an attribute’s value:

row.names(deck) <- 101:152

或者赋予新的属性

levels(deck) <- c("level 1", "level 2", "level 3")

attributes(deck)
## $names
## [1] "face"  "suit"  "value"
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
## [18] 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
## [35] 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
## [52] 152
## 
## $levels
## [1] "level 1" "level 2" "level 3"
play <- function() {
  symbols <- get_symbols()
  print(symbols)
  score(symbols)
}

修改play( )使其回归到prize并包含有相关的symbols,删除多余的print(symbols)

play <- function() {
  symbols <- get_symbols()
  prize <- score(symbols)
  attr(prize, "symbols") <- symbols
  prize
}
play <- function() {
  symbols <- get_symbols()
  structure(score(symbols), symbols = symbols)
}

three_play <- play()
three_play
##  0
##  attr(,"symbols")
##  "0"  "BB" "B"
slot_display <- function(prize){

  # extract symbols
  symbols <- attr(prize, "symbols")

  # collapse symbols into single string
  symbols <- paste(symbols, collapse = " ")

  # combine symbol with prize as a character string
  # \n is special escape sequence for a new line (i.e. return or enter)
  string <- paste(symbols, prize, sep = "\n$")

  # display character string in console without quotes
  cat(string)
}

slot_display(one_play)
## B 0 B
## $0

10.3 专属(?)功能

0624 11:45
比如说print()
print()不是一个普通功能,而是一个专属功能。因为你可以让print()在不同的情况下做不一样的事情。
如下

num <- 1000000000
print(num)
1000000000
## and a different thing when we gave num a class:

class(num) <- c("POSIXct", "POSIXt")
print(num)
"2001-09-08 19:46:40 CST"

print()会根据数据不同的属性进行不同格式的输出。

10.4 Methods

> print
function (x, ...) 
UseMethod("print")
<bytecode: 0x103594e68>
<environment: namespace:base>

其中UseMethod会检查数据的归类,并根据数据的归类来决定输出的格式。再举个例子,当你把POSIXct属性的数据给print的时候,UseMethod会把print中所有的变量都转变成print.POSIXct。R就会运行print.POSIXct

print.POSIXct
## function (x, ...) 
## {
##     max.print <- getOption("max.print", 9999L)
##     if (max.print < length(x)) {
##         print(format(x[seq_len(max.print)], usetz = TRUE), ...)
##         cat(" [ reached getOption(\"max.print\") -- omitted", 
##             length(x) - max.print, "entries ]\n")
##     }
##     else print(format(x, usetz = TRUE), ...)
##     invisible(x)
## }
## <bytecode: 0x7fa948f3d008>
## <environment: namespace:base>

举个factor的例子

print.factor
 function (x, quote = FALSE, max.levels = NULL, width = getOption("width"), 
     ...) 
 {
     ord <- is.ordered(x)
     if (length(x) == 0L) 
         cat(if (ord) 
             "ordered"
 ...
        drop <- n > maxl
         cat(if (drop) 
             paste(format(n), ""), T0, paste(if (drop) 
             c(lev[1L:max(1, maxl - 1)], "...", if (maxl > 1) lev[n])
         else lev, collapse = colsep), "\n", sep = "")
     }
     invisible(x)
 }
 <bytecode: 0x7fa94a64d470>
 <environment: namespace:base>

print.POSIXctprint.factor被称为print的方法(methods)。R根据数据的class来分配methods,print()本身有差不多200个methods。

methods(print)
##   [1] print.acf*                                   
##   [2] print.anova                                  
##   [3] print.aov*                                   
##  ...                      
## [176] print.xgettext*                              
## [177] print.xngettext*                             
## [178] print.xtabs*
##
##   Nonvisible functions are asterisked

总结一下,generic functions, methods, and class-based的组合被称为S3系统。因为该系统起源于S语言的第三版。R语言的很多generic function都是S3,比方说summary,head

所以可以根据S3系统的特性来自己改编输出的格式。只需要给数据指定class就好。

10.4.1 Method Dispatch

每个S3 method的名字都是由两个部分组成的。第一部分是method所属的function,比方说printsummaryhead,第二部分是属性class。比如说,print.functionsummary.matrix

> class(one_play) <- "slots"
> print.slots <- function(x, ...) {
+   cat("I'm using the print.slots method")
+ }
> print(one_play)
I'm using the print.slots method
play <- function() {
  symbols <- get_symbols()
  structure(score(symbols), symbols = symbols)
}

修改后

play <- function() {
  symbols <- get_symbols()
  structure(score(symbols), symbols = symbols, class = "slots")
}

10.5 Classes

要创建class有三个步骤

  1. 选择class的名字
  2. 给每一个class指定attribute
  3. 为generic methods写新class对应的methods

11. Loop

故事:每台slot机器貌似对1美元有42美分的回报,但是厂家设定每刀回报率是92美分。来计算一下每台机器的回报率到底是多少。

wheel <- c("DD", "7", "BBB", "BB", "B", "C", "0")
combos <- expand.grid(wheel, wheel, wheel, stringsAsFactors = FALSE)
prob <- c("DD" = 0.03, "7" = 0.03, "BBB" = 0.06, "BB" = 0.1, "B" = 0.25, "C" = 0.01, "0" = 0.52)
combos$prob1 <- prob[combos$Var1]
combos$prob2 <- prob[combos$Var2]
combos$prob3 <- prob[combos$Var3]
combos$prob <- combos$prob1 * combos$prob2 * combos$prob3
head(combos,3)
sum(combos$prob)
> head(combos,3)
  Var1 Var2 Var3 prob1 prob2 prob3    prob
1   DD   DD   DD  0.03  0.03  0.03 2.7e-05
2    7   DD   DD  0.03  0.03  0.03 2.7e-05
3  BBB   DD   DD  0.06  0.03  0.03 5.4e-05
> sum(combos$prob)
[1] 1

0625 11:30

for Loop

for( i in c("my", "first","for","loop")){
  print(i)
}
> for( i in c("my", "first","for","loop")){
+   print(i)
+ }
[1] "my"
[1] "first"
[1] "for"
[1] "loop"

其实for loop中的计算结果必须要进行保存。要不然计算就没有任何意义。可以事先写安排一个空头vector或者list,然后把计算结果放进去。

chars<-vector(length=4)
words<-c("my","first","for","loop")
for(i in 1:4){
  chars[i]<-words[i]
}
for (i in 1:nrow(combos)) {
  symbols <- c(combos[i, 1], combos[i, 2], combos[i, 3])
  combos$prize[i] <- score(symbols)
}
score <- function(symbols) {
  
  diamonds <- sum(symbols == "DD")
  cherries <- sum(symbols == "C")
  
  # identify case
  # since diamonds are wild, only nondiamonds 
  # matter for three of a kind and all bars
  slots <- symbols[symbols != "DD"]
  same <- length(unique(slots)) == 1
  bars <- slots %in% c("B", "BB", "BBB")

  # assign prize
  if (diamonds == 3) {
    prize <- 100
  } else if (same) {
    payouts <- c("7" = 80, "BBB" = 40, "BB" = 25,
      "B" = 10, "C" = 10, "0" = 0)
    prize <- unname(payouts[slots[1]])
  } else if (all(bars)) {
    prize <- 5
  } else if (cherries > 0) {
    # diamonds count as cherries
    # so long as there is one real cherry
    prize <- c(0, 2, 5)[cherries + diamonds + 1]
  } else {
    prize <- 0
  }
  
  # double for each diamond
  prize * 2^diamonds
}

while loop

实战中,while用的比for少。
while不会回归结果,需要自己手动设定来保存。

plays_till_broke <- function(start_with) {
  cash <- start_with
  n <- 0
  while (cash > 0) {
    cash <- cash - 1 + play()
    n <- n + 1
  }
  n
}

plays_till_broke(100)
 260

repeat loop

plays_till_broke <- function(start_with) {
  cash <- start_with
  n <- 0
  repeat {
    cash <- cash - 1 + play()
    n <- n + 1
    if (cash <= 0) {
      break
    }
  }
  n
}

plays_till_broke(100)
 237

12 Speed 高速化

abs_loop <- function(vec){
  for (i in 1:length(vec)) {
    if (vec[i] < 0) {
      vec[i] <- -vec[i]
    }
  }
  vec
}
abs_sets <- function(vec){
  negs <- vec < 0
  vec[negs] <- vec[negs] * -1
  vec
}

第二段是向量编码

winnings <- vector(length = 1000000)
for (i in 1:1000000) {
  winnings[i] <- play()
}

mean(winnings)
 0.9366984

最后的最后 数据科学的三大技能树

  1. 后勤问题:数据储存,操作技能
  2. 战术问题 : 挖掘数据内部的信息的技能
  3. 战略问题: 在较大的层次上总结结论的技能
数据科学的三大技能树

190625 15:09 困死了

190629 16:26 2周目

需要记住,像背课文一样的记住

Chapter 9

关键语法 function(),sample(),unique(),if loop,unname(),all(),sum()

wheel<-c("DD","7","BBB","BB","B","C","0")
prob=c(0.03, 0.03, 0.06, 0.1, 0.25, 0.01, 0.52))

1.随机抽取三个字符,每个字符出现概率如上
2.如果三个字符都一样,就按照如下给prize

c("DD"=100, "7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)

3.如果三个字符不一样,但都是"B","BB","BBB"中的一个,那就给
prize<-5
4.如果三个字符不一样,但是出现了“C”,那就根据“C”的个数分配prize
1个C=2, 2个C=5
5.如果出现“DD”,那就按照“DD”个数分配
prize*2^"DD"个数

get_symbols<-function(){
  wheel<-c("DD","7","BBB","BB","B","C","0")
  sample(wheel,size=3,replace=TRUE,
         prob=c(0.03, 0.03, 0.06, 0.1, 0.25, 0.01, 0.52))
}

score<-function(symbols){
  bars<- symbols %in% c("B","BB","BBB")
  
  if(length(unique(symbols))==1){
   payouts<-c("DD"=100, "7" = 80, "BBB" = 40, "BB" = 25, 
   "B" = 10, "C" = 10, "0" = 0)
   prize<-unname(payouts[symbols[1]])
  } else if(all(bars)){
    prize<-5
  } else {
    cherries<-sum(symbols=="C")
    prize<- c(0,2,5)[cherries+1]
  }
   diamonds<- sum(symbols=="DD")
   prize*2^diamonds
}
   
play<-function(){
  symbols<-get_symbols()
  print(symbols)
  score(symbols)
}
上一篇 下一篇

猜你喜欢

热点阅读