[读书笔记r4ds]19 Functions

2019-11-20 本文已影响0人茶思饭

III.Program 编程技巧

19 Functions

When should you write a function?

-当需要多次使用相同的代码时，应该考虑写function。
-写function的3个关键步骤:

名字pick a name for the function.
参数You list the inputs, or arguments, to the function inside function.
代码You place the code you have developed in body of the function, a { block that immediately follows function(...).

This is an important part of the “do not repeat yourself” (or DRY) principle.

写函数而不是复制、粘贴有3大好处：

可以给function 起一个明了的名字
随着需求变化，只需要更改部分代码而不是全部。
消除犯错的机会。

19.2.1 Practice

Why is TRUE not a parameter to rescale01()? What would happen if x contained a single missing value, and na.rm was FALSE?
TRUE 这个参数没必要改变，所以不是parameter。没结果。
In the second variant of rescale01(), infinite values are left unchanged. Rewrite rescale01()so that -Inf is mapped to 0, and Inf is mapped to 1.

rescale02 <- function(x) {
  x[x==Inf] <- 1
  x[x==-Inf] <- 0
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
```
mean(is.na(x))

x / sum(x, na.rm = TRUE)

sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
```
Follow http://nicercode.github.io/intro/writing-functions.html to write your own functions to compute the variance and skew of a numeric vector.
Write both_na(), a function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

both_na <- function(x,y){
  position <- ""
  if(length(x)==length(y)){
    is_x <- is.na(x)
    is_y <- is.na(y)
    len <- length(x)
    for(i in 1:len){
      if(is_x[i]==T &is_y[i]==T){
        position=c(position,i)}
    }
    if(length(position)>1){ 
      position=position[-1]
    }
  }else{
    print("Length is not equal.")
  }
  position
}

What do the following functions do? Why are they useful even though they are so short?

is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0

Read the complete lyrics to “Little Bunny Foo Foo”. There’s a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.

19.3 Functions are for humans and computers函数的可读性

Tips:The name of a function

Your function name will be short, but clearly evoke what the function does. But it’s better to be clear than short.
Generally, function names should be verbs, and arguments should be nouns.
using snake_case, or camelCase consistently for multiple words name.
prefix for family functions.
avoid overriding existing functions and variables.
Use comments, lines starting with #, to explain the “why” of your code.
Use long lines of - and = to make it easy to spot the breaks.

19.3.1 Exercises

Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.

### 判断是否是字符串的前缀是否正确
f1 <- function(string, prefix) {
  substr(string, 1, nchar(prefix)) == prefix
}
### 删除向量的最后一个单位
f2 <- function(x) {
  if (length(x) <= 1) return(NULL)
  x[-length(x)]
}
### 重复y字符以x的长度
f3 <- function(x, y) {
  rep(y, length.out = length(x))
}

f1: prefix_check
f2: vector_del
f3: rep_as_length

Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.
Compare and contrast rnorm() and MASS::mvrnorm(). How could you make them more consistent?
norm_r and norm_mvr
Make a case for why norm_r(), norm_d() etc would be better than rnorm(), dnorm(). Make a case for the opposite.

19.4 Conditional execution 条件判断

if

if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}

condition
-- The condition 必须是 TRUE 或者 FALSE.
-- 用 || (or) and && (and) 合并multiple logical expressions.
-- 不要用 | or & in an if statement. If you do have a logical vector, you can use any() or all() to collapse it to a single value.
-- 使用==需要小心，==具有向量性，能够产生多个逻辑值。
-- identical() 严格比较是否一致，产生1个逻辑值。
-- dplyr::near() 近似比较
Multiple conditions 多重判断

if (this) {
  # do that
} else if (that) {
  # do something else
} else {
  # 
}

switch() . It allows you to evaluate selected code based on position or name.

#> function(x, y, op) {
#>   switch(op,
#>     plus = x + y,
#>     minus = x - y,
#>     times = x * y,
#>     divide = x / y,
#>     stop("Unknown op!")
#>   )
#> }

cut(). It’s used to discretise continuous variables.

19.4.3 Code style

Both if and function should (almost) always be followed by squiggly brackets ({}), and the contents should be indented by two spaces.
An opening curly brace{ should never go on its own line and should always be followed by a new line.
A closing curly brace} should always go on its own line, unless it’s followed by else.

19.4.4 Exercises

What’s the difference between if and ifelse()? Carefully read the help and construct three examples that illustrate the key differences.
1） ifelse 必定返回一个值，不能返回向量。if条件判断后，可以返回向量，可以不返回任何值。
2）if 可以进行多重条件判断， ifelse 只能进行T/F 判断。
Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.)

greeting<- function(){
  now <- lubridate::now() %>% hour()
    if(now<12&&now>=5){
      print("Good morning!")
    }else if(now>=12 &&now<18){
      print("Good afternoon!")
    }else{
      print("Good evening!")
    }
}

Implement a fizzbuzz function. It takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number. Make sure you first write working code before you create the function.

fizzbuzz <- function(x){
  if (x%%3==0&& x%%5==0){
    "fizzbuzz"
  } else if(x%%3==0&& x%%5!=0){
    "fizz"
  } else if(x%%3!=0&& x%%5==0){
    "buzz"
  } else{x}
}
### 使用switch()
fizzbuzz2 <- function(x){
  a <- "a"
  if (x%%3==0) {a <- paste0(a,"b")}
  if (x%%5==0) {a <- paste0(a,"c")}
  switch(a,
          a = x,
         ab = "fizz",
         ac = "bizz",
         abc= "fizzbizz")
}

How could you use cut() to simplify this set of nested if-else statements?

if (temp <= 0) {
  "freezing"
} else if (temp <= 10) {
  "cold"
} else if (temp <= 20) {
  "cool"
} else if (temp <= 30) {
  "warm"
} else {
  "hot"
}

##使用cut() 和switch（）
if (temp <= 0) {  
  "freezing"
} else if(temp>=0&& temp<=30){
  c <- cut(temp,breaks=c(0,10,20,30)) %>% as.integer
  switch(c,"cold","cool","warm")
} else {
  "hot"
}

How would you change the call to cut() if I’d used < instead of <=? What is the other chief advantage of cut() for this problem? (Hint: what happens if you have many values in temp?)

##使用right=FALSE参数，切断部分包含左边界，不包含右边界
if (temp <=0) {  
  "freezing"
} else if(temp>=0&& temp<=30){
  c <- cut(temp,breaks=c(0,10,20,30), right=FALSE) %>% as.integer
  switch(c,"cold","cool","warm")
} else {
  "hot"
}

What happens if you use switch() with numeric values?
可以不用‘=’制定，按数字顺序识别，后续操作。
What does this switch() call do? What happens if x is “e”?
Experiment, then carefully read the documentation.

switch(x, 
  a = ,
  b = "ab",
  c = ,
  d = "cd",
)

Nothing happend!

19.5 Function arguments 函数的参数

参数主要有两种作用：

data，函数计算直接需要的信息。
details，函数的细节调整需要的信息。

data 类参数放在最前面，details参数放后面，并且最好有默认值。

参数的默认值最好是最常用的值。少许例外是出于数据安全的考虑，例如。na.rm 默认FALSE，因为缺失值na 非常值得我们关注，而正常运算时，往往需要将na.rm 设置为TRUE。为了方便而直接将na.rm值设置为TRUE，不是一个好主意。
当你调用函数时，data类参数，参数名称往往可以省略。details类参数，如果使用默认值可以不体现，如果需要修改默认值，则需要通过参数名调用。
调用参数名时，允许使用参数名的前缀进行部分匹配，但是需要避免混淆。
调用函数时，在 =前后加入空格， ,后面加入空格，可以提高代码的可读性。

19.5.1 Choosing names

好的参数名，便于理解，要兼顾易读性与长度。
有一些常用的函数名非常短，值得记住:
there are a handful of very common, very short names. It’s worth memorising these:
- x, y, z: vectors.
- w: a vector of weights.
- df: a data frame.
- i, j: numeric indices (typically rows and columns).
- n: length, or number of rows.
- p: number of columns.
可以考虑使用其他函数中的参数名。例如：使用 na.rm参数来确定是否需要去除missing value。

19.5.2 Checking values

对重要参数进行检验并报错，是好习惯。
It’s good practice to check important preconditions, and throw an error (with stop())。
需要平衡你花费的精力与函数的质量，对于一些非重要参数可不必检验。
There’s a tradeoff between how much time you spend making your function robust, versus how long you spend writing it.
折中的方式是采用stopifnot()函数。
stopifnot()：it checks that each argument is TRUE, and produces a generic error message if not.

19.5.3 Dot-dot-dot (…)

R中的许多函数可以具有任意个输入参数，这种功能依赖特殊的参数...。
Many functions in R take an arbitrary number of inputs,That rely on a special argument: ...
你可以在你的函数中使用其他函数中的...参数。
It’s useful because you can then send those ... on to another function.
...参数使用非常方便，可以让我把不想处理的参数交给其他函数。
It’s a very convenient technique.
任何拼错的参数都不会引起错误，这使得打字错误很容易被忽视。
But it does come at a price: any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed.

x <- c(1, 2)
sum(x, na.mr = TRUE)
#> [1] 4
### 你看出错误是怎么产生的吗？
## na.rm 参数被写成了na.mr

19.5.4 Lazy evaluation

Arguments in R are lazily evaluated: they’re not computed until they’re needed.
You can read more about lazy evaluation at http://adv-r.had.co.nz/Functions.html#lazy-evaluation.

19.5.5 Exercises

What does commas(letters, collapse = "-") do? Why?

commas(letters, collapse = "-")
# Error in stringr::str_c(..., collapse = "- ") : 
##   formal argument "collapse" matched by multiple actual arguments

因为在之前，设置commas 函数时，已经设定过collapse = ", "的参数，再次设定collapse = "- "，则collapse参数出现了多个匹配项，导致报错。
解决方法：

commas <- function(...) stringr::str_c(...)
commas(letters, collapse="-") 
[1] "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"

Notes: 如果str_c()设置了collapse = ", "的默认值，commas函数对collapse 默认值的修改，无法传递给str_c()

commas <- function(...,collaspe=",") stringr::str_c(..., collapse = ", ")
> commas(letters)
[1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z"
> commas(letters,collaspe = "-")
[1] "a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z"

需要一个中间变量传递

commas <- function(...,collaspe=",") {
  a <- collaspe
  stringr::str_c(..., collapse = a)
}
> commas(letters,collaspe = "-")
[1] "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"

It’d be nice if you could supply multiple characters to the pad argument, e.g. rule("Title", pad = "-+"). Why doesn’t this currently work? How could you fix it?

rule <- function(..., pad = "-") {
  title <- paste0(...)
  width <- getOption("width") - nchar(title) - 5
  cat(title, " ", stringr::str_dup(pad, width%/%str_length(pad)), "\n", sep = "")
}
rule("Important output",pad="+-")
Important output +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-

What does the trim argument to mean() do? When might you use it?
trim在计算平均值之前，从x的两端总共截断分数（0-0.5）倍数量的观测值。trim值之外范围的值被认为是最接近的终点。
使用trim 值进行计算的平均值，称为：截断均值。在统计学里面一般是去除最高端的5%和最低端的5%。当然为了满足不同的需求，不一定是5%，但是一般都是高端和低端同时去除同样比例的数据。
目的主要是为了避免部分极高值和极低值对于数据整体均值的影响，从而使平均值对整体更加有代表性。
最典型的例子是：奥运会上，体操运动员的得分，要将所有裁判的打分，去掉1个最高分，1个最低分，其余的平均值及为运动员的最后得分。
The default value for the method argument to cor() is c("pearson", "kendall", "spearman"). What does that mean? What value is used by default?

pearson correlation coefficient（皮尔森相关性系数）是一种最简单的反应特征和响应之间关系的方法。这个方法衡量的是变量之间的线性相关性。
spearman correlation coefficient（斯皮尔曼相关性系数）通常也叫斯皮尔曼秩相关系数。“秩”，可以理解成就是一种顺序或者排序，那么它就是根据原始数据的排序位置进行求解。
kendall correlation coefficient（肯德尔相关性系数），又称肯德尔秩相关系数，它也是一种秩相关系数，不过它所计算的对象是分类变量。
默认值是“pearson”。

19.6 Return values 返回值

函数的返回值，是你创建函数的目的。需要考虑2个问题：

提前返回值是否使函数更容易读？
能否让函数通过管道符传递？
19.6.1 Explicit return statements

返回值一般是最后计算的值。
可以使用return()函数，提前返回值。
I think it’s best to save the use of return() to signal that you can return early with a simpler solution.
- A common reason to do this is because the inputs are empty。
- Another reason is because you have a if statement with one complex block and one simple block.
If the first block is very long, by the time you get to the else, you’ve forgotten the condition. One way to rewrite it is to use an early return for the simple case:

f <- function() {
  if (!x) {
    return(something_short)
  }

  # Do 
  # something
  # that
  # takes
  # many
  # lines
  # to
  # express
}

19.6.2 Writing pipeable functions

return value’s object type will mean that your pipeline will “just work”. For example, with dplyr and tidyr the object type is the data frame.
There are two basic types of pipeable functions: transformations and side-effects.
- transformations: 将一个对象传递给函数的第一个参数，并返回一个修改后的对象。
  an object is passed to the function’s first argument and a modified object is returned.
- side-effects: 传递的对象没有被转换。该函数对对象执行操作，如绘制绘图或保存文件。副作用函数应该在不可见的情况下返回第一个参数，这样即使它们没有被打印出来，仍然可以在管道中使用。
  the passed object is not transformed. Instead, the function performs an action on the object, like drawing a plot or saving a file. Side-effects functions should “invisibly” return the first argument, so that while they’re not printed they can still be used in a pipeline.

19.7 Environment

Environments are crucial to how functions work.
The environment of a function controls how R finds the value associated with a name.
R uses rules called lexical scoping to find the value associated with a name.
Since y is not defined inside the function, R will look in the environment where the function was defined.
R places few limits on your power.You can do many things that you can’t do in other programming languages.