手把手R入门(学习笔记)
0618 15:00
故事:Slot Machine八青哥
没有过维加斯体验的人对这个故事的理解会有点费力。比如说本人。
一个机器玩一次会生成3个不同的图标组合。如果该图标组合符合中奖规则,就会赢钱。和21点和大转盘比起来,八青哥的赔率比较低,所以更受庄主欢迎。
9 Programs
写play ( )
-- 1 随机产生三个图案
-- 2 计算奖金
- 随机产生三个图案的function
get_symbols <- function() {
wheel <- c("DD", "7", "BBB", "BB", "B", "C", "0")
sample(wheel, size = 3, replace = TRUE,
prob = c(0.03, 0.03, 0.06, 0.1, 0.25, 0.01, 0.52))
}
- Manitoba slot machines 回报机制
--赢钱条件
--1 三个相同的图案(除了0)
--2 任何带有B的组合
--3 至少一个以上的C
9.1 策略
尽可能把整个过程拆分,拆分到很简单的可以用R自带函数解决为止。
- 9.1.1 连续指令
play <- function() {
# step 1: generate symbols
symbols <- get_symbols()
# step 2: display the symbols
print(symbols)
# step 3: score the symbols
score(symbols)
}
取得图案👉展示图案👉给图案组合评分
-
9.1.2 平行指令
具体流程
举个例子的话就是score( )
,score()
要同时考虑三种情况,就是
1.三个一样的图案,2.带B的组合图案,3.除去1和2的其他情况。
然后Diamonds的情况最为复杂,可以暂时无视。
-
完整过程
完整过程
if ( # Case 1: all the same <1>) {
prize <- # look up the prize <3>
} else if ( # Case 2: all bars <2> ) {
prize <- # assign $5 <4>
} else {
# count cherries <5>
prize <- # calculate a prize <7>
}
# count diamonds <6>
# double the prize if necessary <8>
完整结构翻译成人话就是
1.test是不是三个图案都一样
2.test是不是所有的图案都含有B
3.考虑每个条件的奖励
4.给含B的5¥
5.计算C的数量
6.计算Diamonds的数量(DD)
7.根据C的数量计算奖励
8.计算Diamonds的奖金
这里出现了一段蜜汁代码
length(unique(symbols)==1)
分解一下这个组合代码
unique( )
回return在vector里出现的的独一无二的项目。
如果symble里包含了同样的东西,unique(symbols)
回变成一个长度为1的vector。
那不对啊,不应该变成
length(unique(symbols))== 1
吗
> symbols<-c("7",'8','7')
> unique(symbols)
[1] "7" "8"
或者可以自己写函数
same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
if (same) {
prize <- # look up the prize
} else if ( # Case 2: all bars ) {
prize <- # assign $5
} else {
# count cherries
prize <- # calculate a prize
}
# count diamonds
# double the prize if necessary
-
all( )
函数
check在vector里的每一个成分。检查vector各成分是真还是假。
> symbols<-c('B',"B","BBB")
> all(symbols %in% c("B", "BB", "BBB"))
[1] TRUE
- 给各个条件制定奖金额
if (same) {
symbol <- symbols[1]
if (symbol == "DD") {
prize <- 800
} else if (symbol == "7") {
prize <- 80
} else if (symbol == "BBB") {
prize <- 40
} else if (symbol == "BB") {
prize <- 5
} else if (symbol == "B") {
prize <- 10
} else if (symbol == "C") {
prize <- 10
} else if (symbol == "0") {
prize <- 0
}
}
Lookup tables
> payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25,
+ "B" = 10, "C" = 10, "0" = 0)
> payouts
DD 7 BBB BB B C 0
100 80 40 25 10 10 0
> payouts["DD"]
DD
100
> payouts["B"]
B
10
unname(payouts["DD"])
100
在此,payout是一种类型的lookup table,(查找表格?)
0619 22:59
same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")
if (same) {
payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)
prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
prize <- # assign $5
} else {
# count cherries
prize <- # calculate a prize
}
symbols指定了个啥?
- 如何计算有多少个“C”
> symbols<-c("C","DD","C")
> sum(symbols=="C")
[1] 2
> sum(symbols=="DD")
[1] 1
继续写下去就是
same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")
if (same) {
payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)
prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
prize <- 5
} else {
cherries <- sum(symbols == "C")
prize <- # calculate a prize
}
diamonds <- sum(symbols == "DD")
# double the prize if necessary
- 计算C的奖励
可以用if循环,但是效率较低
比方说,
if (cherries == 2) {
prize <- 5
} else if (cherries == 1) {
prize <- 2
} else {}
prize <- 0
}
还不如转一下脑子用
same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")
if (same) {
payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)
prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
prize <- 5
} else {
cherries <- sum(symbols == "C")
prize <- c(0, 2, 5)[cherries + 1]
}
diamonds <- sum(symbols == "DD")
# double the prize if necessary
- 最后一步是当出现diamond的时候把奖励翻倍,翻倍数和diamond数一致
也就是
prize * 2 ^ diamonds
same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")
if (same) {
payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)
prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
prize <- 5
} else {
cherries <- sum(symbols == "C")
prize <- c(0, 2, 5)[cherries + 1]
}
diamonds <- sum(symbols == "DD")
prize * 2 ^ diamonds
- 整理一下score function就是
score <- function (symbols) {
# identify case
same <- symbols[1] == symbols[2] && symbols[2] == symbols[3]
bars <- symbols %in% c("B", "BB", "BBB")
# get prize
if (same) {
payouts <- c("DD" = 100, "7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)
prize <- unname(payouts[symbols[1]])
} else if (all(bars)) {
prize <- 5
} else {
cherries <- sum(symbols == "C")
prize <- c(0, 2, 5)[cherries + 1]
}
# adjust for diamonds
diamonds <- sum(symbols == "DD")
prize * 2 ^ diamonds
}
- 写完score function以后,就可以写完整的play function了。
play <- function() {
symbols <- get_symbols()
print(symbols)
score(symbols)
}
- 然后就可以不停的玩了。。。
应该是日语翻译的问题,其实静下心来看英文原版,一点丢不难嘛。明天继续S3
10. S3
0623 18:38
10.2 Attributes 属性
举个例子,data frame把行名字和列名字保存成了属性。
可以通过attributes( )
来查看属性。
row.names(deck)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
## [14] "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
## [27] "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
## [40] "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51" "52"
or to change an attribute’s value:
row.names(deck) <- 101:152
或者赋予新的属性
levels(deck) <- c("level 1", "level 2", "level 3")
attributes(deck)
## $names
## [1] "face" "suit" "value"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
## [18] 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
## [35] 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
## [52] 152
##
## $levels
## [1] "level 1" "level 2" "level 3"
- 练习题
play <- function() {
symbols <- get_symbols()
print(symbols)
score(symbols)
}
修改play( )
使其回归到prize并包含有相关的symbols,删除多余的print(symbols)
- 解决方案
创建新的play函数,给prize添加新的属性symbols
play <- function() {
symbols <- get_symbols()
prize <- score(symbols)
attr(prize, "symbols") <- symbols
prize
}
- 也可以使用structure( )一步到位
play <- function() {
symbols <- get_symbols()
structure(score(symbols), symbols = symbols)
}
three_play <- play()
three_play
## 0
## attr(,"symbols")
## "0" "BB" "B"
- 有何意义
可以直接输出属性
slot_display <- function(prize){
# extract symbols
symbols <- attr(prize, "symbols")
# collapse symbols into single string
symbols <- paste(symbols, collapse = " ")
# combine symbol with prize as a character string
# \n is special escape sequence for a new line (i.e. return or enter)
string <- paste(symbols, prize, sep = "\n$")
# display character string in console without quotes
cat(string)
}
slot_display(one_play)
## B 0 B
## $0
10.3 专属(?)功能
0624 11:45
比如说print()
print()
不是一个普通功能,而是一个专属功能。因为你可以让print()
在不同的情况下做不一样的事情。
如下
num <- 1000000000
print(num)
1000000000
## and a different thing when we gave num a class:
class(num) <- c("POSIXct", "POSIXt")
print(num)
"2001-09-08 19:46:40 CST"
print()
会根据数据不同的属性进行不同格式的输出。
10.4 Methods
> print
function (x, ...)
UseMethod("print")
<bytecode: 0x103594e68>
<environment: namespace:base>
其中UseMethod
会检查数据的归类,并根据数据的归类来决定输出的格式。再举个例子,当你把POSIXct属性的数据给print
的时候,UseMethod
会把print
中所有的变量都转变成print.POSIXct
。R就会运行print.POSIXct
print.POSIXct
## function (x, ...)
## {
## max.print <- getOption("max.print", 9999L)
## if (max.print < length(x)) {
## print(format(x[seq_len(max.print)], usetz = TRUE), ...)
## cat(" [ reached getOption(\"max.print\") -- omitted",
## length(x) - max.print, "entries ]\n")
## }
## else print(format(x, usetz = TRUE), ...)
## invisible(x)
## }
## <bytecode: 0x7fa948f3d008>
## <environment: namespace:base>
举个factor的例子
print.factor
function (x, quote = FALSE, max.levels = NULL, width = getOption("width"),
...)
{
ord <- is.ordered(x)
if (length(x) == 0L)
cat(if (ord)
"ordered"
...
drop <- n > maxl
cat(if (drop)
paste(format(n), ""), T0, paste(if (drop)
c(lev[1L:max(1, maxl - 1)], "...", if (maxl > 1) lev[n])
else lev, collapse = colsep), "\n", sep = "")
}
invisible(x)
}
<bytecode: 0x7fa94a64d470>
<environment: namespace:base>
print.POSIXct
和print.factor
被称为print
的方法(methods)。R根据数据的class来分配methods,print()
本身有差不多200个methods。
methods(print)
## [1] print.acf*
## [2] print.anova
## [3] print.aov*
## ...
## [176] print.xgettext*
## [177] print.xngettext*
## [178] print.xtabs*
##
## Nonvisible functions are asterisked
总结一下,generic functions, methods, and class-based的组合被称为S3系统。因为该系统起源于S语言的第三版。R语言的很多generic function都是S3,比方说
summary
,head
。
所以可以根据S3系统的特性来自己改编输出的格式。只需要给数据指定class就好。
10.4.1 Method Dispatch
每个S3 method的名字都是由两个部分组成的。第一部分是method所属的function,比方说print
,summary
,head
,第二部分是属性class。比如说,print.function
,summary.matrix
。
- 练习1
可以给数据指定任意的class,
> class(one_play) <- "slots"
> print.slots <- function(x, ...) {
+ cat("I'm using the print.slots method")
+ }
> print(one_play)
I'm using the print.slots method
- 练习2
修改play()
,给其添加slots属性,并在输出中显示。
play <- function() {
symbols <- get_symbols()
structure(score(symbols), symbols = symbols)
}
修改后
play <- function() {
symbols <- get_symbols()
structure(score(symbols), symbols = symbols, class = "slots")
}
10.5 Classes
要创建class有三个步骤
- 选择class的名字
- 给每一个class指定attribute
- 为generic methods写新class对应的methods
11. Loop
故事:每台slot机器貌似对1美元有42美分的回报,但是厂家设定每刀回报率是92美分。来计算一下每台机器的回报率到底是多少。
- 计算八青哥概率
wheel <- c("DD", "7", "BBB", "BB", "B", "C", "0")
combos <- expand.grid(wheel, wheel, wheel, stringsAsFactors = FALSE)
prob <- c("DD" = 0.03, "7" = 0.03, "BBB" = 0.06, "BB" = 0.1, "B" = 0.25, "C" = 0.01, "0" = 0.52)
combos$prob1 <- prob[combos$Var1]
combos$prob2 <- prob[combos$Var2]
combos$prob3 <- prob[combos$Var3]
combos$prob <- combos$prob1 * combos$prob2 * combos$prob3
head(combos,3)
sum(combos$prob)
> head(combos,3)
Var1 Var2 Var3 prob1 prob2 prob3 prob
1 DD DD DD 0.03 0.03 0.03 2.7e-05
2 7 DD DD 0.03 0.03 0.03 2.7e-05
3 BBB DD DD 0.06 0.03 0.03 5.4e-05
> sum(combos$prob)
[1] 1
0625 11:30
for Loop
for( i in c("my", "first","for","loop")){
print(i)
}
> for( i in c("my", "first","for","loop")){
+ print(i)
+ }
[1] "my"
[1] "first"
[1] "for"
[1] "loop"
其实for loop中的计算结果必须要进行保存。要不然计算就没有任何意义。可以事先写安排一个空头vector或者list,然后把计算结果放进去。
chars<-vector(length=4)
words<-c("my","first","for","loop")
for(i in 1:4){
chars[i]<-words[i]
}
- 添加并计算prize
for (i in 1:nrow(combos)) {
symbols <- c(combos[i, 1], combos[i, 2], combos[i, 3])
combos$prize[i] <- score(symbols)
}
score <- function(symbols) {
diamonds <- sum(symbols == "DD")
cherries <- sum(symbols == "C")
# identify case
# since diamonds are wild, only nondiamonds
# matter for three of a kind and all bars
slots <- symbols[symbols != "DD"]
same <- length(unique(slots)) == 1
bars <- slots %in% c("B", "BB", "BBB")
# assign prize
if (diamonds == 3) {
prize <- 100
} else if (same) {
payouts <- c("7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)
prize <- unname(payouts[slots[1]])
} else if (all(bars)) {
prize <- 5
} else if (cherries > 0) {
# diamonds count as cherries
# so long as there is one real cherry
prize <- c(0, 2, 5)[cherries + diamonds + 1]
} else {
prize <- 0
}
# double for each diamond
prize * 2^diamonds
}
while loop
实战中,while用的比for少。
while不会回归结果,需要自己手动设定来保存。
- 例子
计算玩到破产的次数
plays_till_broke <- function(start_with) {
cash <- start_with
n <- 0
while (cash > 0) {
cash <- cash - 1 + play()
n <- n + 1
}
n
}
plays_till_broke(100)
260
repeat loop
plays_till_broke <- function(start_with) {
cash <- start_with
n <- 0
repeat {
cash <- cash - 1 + play()
n <- n + 1
if (cash <= 0) {
break
}
}
n
}
plays_till_broke(100)
237
12 Speed 高速化
- 12.1 向量编码
比较以下两段代码
abs_loop <- function(vec){
for (i in 1:length(vec)) {
if (vec[i] < 0) {
vec[i] <- -vec[i]
}
}
vec
}
abs_sets <- function(vec){
negs <- vec < 0
vec[negs] <- vec[negs] * -1
vec
}
第二段是向量编码
- 事先创建向量代码
winnings <- vector(length = 1000000)
for (i in 1:1000000) {
winnings[i] <- play()
}
mean(winnings)
0.9366984
最后的最后 数据科学的三大技能树
- 后勤问题:数据储存,操作技能
- 战术问题 : 挖掘数据内部的信息的技能
- 战略问题: 在较大的层次上总结结论的技能
190625 15:09 困死了
190629 16:26 2周目
需要记住,像背课文一样的记住
Chapter 9
关键语法
function()
,sample()
,unique()
,if loop
,unname()
,all()
,sum()
wheel<-c("DD","7","BBB","BB","B","C","0")
prob=c(0.03, 0.03, 0.06, 0.1, 0.25, 0.01, 0.52))
1.随机抽取三个字符,每个字符出现概率如上
2.如果三个字符都一样,就按照如下给prize
c("DD"=100, "7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)
3.如果三个字符不一样,但都是"B","BB","BBB"中的一个,那就给
prize<-5
4.如果三个字符不一样,但是出现了“C”,那就根据“C”的个数分配prize
1个C=2, 2个C=5
5.如果出现“DD”,那就按照“DD”个数分配
prize*2^"DD"个数
get_symbols<-function(){
wheel<-c("DD","7","BBB","BB","B","C","0")
sample(wheel,size=3,replace=TRUE,
prob=c(0.03, 0.03, 0.06, 0.1, 0.25, 0.01, 0.52))
}
score<-function(symbols){
bars<- symbols %in% c("B","BB","BBB")
if(length(unique(symbols))==1){
payouts<-c("DD"=100, "7" = 80, "BBB" = 40, "BB" = 25,
"B" = 10, "C" = 10, "0" = 0)
prize<-unname(payouts[symbols[1]])
} else if(all(bars)){
prize<-5
} else {
cherries<-sum(symbols=="C")
prize<- c(0,2,5)[cherries+1]
}
diamonds<- sum(symbols=="DD")
prize*2^diamonds
}
play<-function(){
symbols<-get_symbols()
print(symbols)
score(symbols)
}