第 10 章 stringr 习题汇总

2020-05-15 本文已影响0人热衷组培的二货潜

有需要的小伙伴可以下载 PDF 版本（有书签页）查看，附件中
https://www.yuque.com/erhuoqian/mudww7/mdsxrq

10.2.5

1 在没有使用 stringr 的那些代码中，你会经常看到 paste() 和 paste0() 函数，这两个函数的区别是什么？ stringr 中的哪两个函数与它们是对应的？这些函数处理 NA 的方式有什么不同？

可以看出 paste 等同于 str_c 中的 sep 参数，而 paste0 等同于 str_c 中的 collapse 参数。关于 NA 值，可以看到 paste 和 paste0 函数默认是把 NA 当成一个字符串的，而 str_c 则会受 NA 的影响，是可传染的。

> paste("a", "b", sep = "_")
[1] "a_b"
> str_c("a", "b", sep = "_")
[1] "a_b"
> paste0("a", "_", "b")
[1] "a_b"
> str_c(c("a", "b"), collapse = "_")
[1] "a_b"
> # 关于 NA 值，可以看到 paste 和 paste0 函数默认是把 NA 当成一个字符串的，而 str_c 则会受 NA 的影响，是可传染的。
> paste("a", "b", sep = "_")
[1] "a_b"
> str_c("a", "b", sep = "_")
[1] "a_b"
> paste0("a", "_", "b")
[1] "a_b"
> str_c(c("a", "b"), collapse = "_")
[1] "a_b"
> paste("a", NA, sep = "_")
[1] "a_NA"
> str_c("a", NA, sep = "_")
[1] NA
> str_c("a", str_replace_na(NA), sep = "_")
[1] "a_NA"
> paste0("a", "_", NA)
[1] "a_NA"
> str_c(c("a", NA), collapse = "_")
[1] NA
> str_c(str_replace_na(c("a", NA)), collapse = "_")
[1] "a_NA"

2 用自己的语言描述一下 str_c() 函数的 sep 和 collapse 参数有什么区别？

sep 参数就是用来指定向量中字符串的连接符号
collapse 参数就是指定单个向量的内连接也就是合并成一个字符串的连接符号

3 使用 str_length() 和 str_sub() 函数提取出一个字符串最中间的字符。如果字符串中的字符数是偶数，你应该怎么做

扩展几个函数：
• floor：向下取整，即不大于该数字的最大整数
• ceiling：向上取整，即不小于该数字的最小整数
• trunc：取整数部分
• round：保留几位小数
• signif：保留几位有效数字，常用于科学技术

test <- "abcde"
> test <- "abcde"
> median_p <- median(seq_len(str_length(test)))
> str_sub(test, median_p, me_p)
[1] "c"
# 如果是偶数
# 插播几个函数，取整、向上取整、向下取整
> test1 <- "abcdes"
> median_p1 <- ceiling(median(seq_len(str_length(test))))
> str_sub(test, median_p1, median_p1)
[1] "c"

4 str_wrap() 函数的功能是什么？应该在何时使用这个函数？

可以看到 str_wrap() 函数就是指定插入字符的位置，比如 width = 10，将会每十个字符（包括空格）加一个换行符 \n。
• width：每一行的宽度，正整数
• indent：每个段落第一行的非负整数缩进，即首行缩进
• exdent：非负整数，在每个段落中给下行作缩进，非首行缩进

> ?str_wrap
> test <- "R would not be what it is today without the invaluable help of these\npeople outside of the R core team"
> cat(str_wrap(test, width = 30), "\n")
R would not be what it is
today without the invaluable
help of these people outside
of the R core team 
> cat(str_wrap(test, width = 30, indent = 2), "\n")
  R would not be what it is
today without the invaluable
help of these people outside
of the R core team 
> cat(str_wrap(test, width = 30, exdent = 2), "\n")
R would not be what it is
  today without the invaluable
  help of these people outside
  of the R core team

用处：当我们在做 GO、KEGG 等情况下，文本内容过长，需要换行展示，这时候就可以配合 str_wrap() 函数。详情可参考 Y 叔的公众号推文：ggplot2画图，文本太长了怎么办？

p + 
  scale_x_discrete(labels=function(x) str_wrap(x, width=10))

5 str_trim() 函数的功能是什么？其逆操作是哪个函数？

用来去除空白的函数：
• str_trim() removes whitespace from start and end of string; 移除首尾
• str_squish() also reduces repeated whitespace inside a string；中间的也移除
都有一参数 side：删除空白的边（left, right or both）

?str_trim()

逆操作函数为 str_pad() 加上空白
• width：加上字符后的字符长度，当指定长度小于本身字符长度时候不会改变
• side：加在哪边
• pad：填充字符，默认为空白

> str_pad("a", 10, pad = c("-", "_", " "))
[1] "---------a" "_________a" "         a"
> str_length(str_pad("hadley", 30, "left"))
[1] 30
当指定长度小于本身字符长度时候不会改变
> str_pad("hadley", 3)
[1] "hadley"

6 编写一个函数将字符向量转换为字符串，例如，将字符向量 c("a", "b", "c") 转换为字符串 "a, b and c"。仔细思考一下，如果给定一个长度为 0、1 或 2 的向量，那么这个函数应该怎么做？

参考答案：
str_commasep <- function(x) {
  n <- length(x)
  if (n == 0) {
    NA
  }else if(n == 1){
    x
  }else if(n == 2){
    str_c(x[[1]], x[[2]], sep = " and ")
  }else{
    not_last <- str_c(x[seq_len(n - 1)], collapse = ", ")
    last <- str_c("and", x[[n]], sep = " ")
    str_c(c(not_last, last), collapse = " ")
  }
}
> str_commasep(c("a","b", "c"))
[1] "a, b and c"
> str_commasep(c("a","b"))
[1] "a and b"
> str_commasep(c("a"))
[1] "a"
> str_commasep("")
[1] ""

10.3.2

1 解释一下为什么这些字符串不能匹配一个反斜杠 \ ：""、"\"、"\"。

""：转义符号 \ 后的一个字符
"\"：这将解析到正则表达式中的 \，它将转义正则表达式中的下一个字符。
"\"：前两个反斜杠将解析为正则表达式中的文字反斜杠，第三个将转义下一个字符。在正则表达式中，它会转义一些转义的字符。

2 如何匹配字符序列 "'\ ？

> writeLines("\"\'\\\\")
"'\\
> str_view("\"'\\", "\"\'\\\\")

3 正则表达式 ...... 会匹配哪种模式？如何用字符串来表示这个正则表达式？

\..\..\.. 表示匹配 .任意字符.任意字符.任意字符，比如 .x.y.z。
> str_view(c(".a.b.c", ".a.b", "....."), c("\\..\\..\\.."), match = TRUE)

10.3.4

1 如何匹配字符串 `"$^$"` ？

> str_view(c("$^$", "ab$^$c"), "^\\$\\^\\$$")

2 给定stringr::words 中的常用单词语料库，创建正则表达式以找出满足下列条件的所有单词。

因为这个列表非常长，所以你可以设置 str_view() 函数的 match 参数，只显示匹配的
单词（match = TRUE）或未匹配的单词（match = FALSE）。

a. 以 y 开头的单词。

以 y 开头的单词
> test <- stringr::words
> str_view(test, "^y", match = TRUE)

b. 以 x 结尾的单词。

以 x 结尾的单词
> str_view(test, "x$", match = TRUE)

image

c. 长度正好为 3 个字符的单词。（不要使用str_length() 函数，这是作弊！）

长度正好为 3 个字符的单词，太多，截取部分图
> str_view(test, "^...$", match = TRUE)

d. 具有 7 个或更多字符的单词。

具有 7 个或更多字符的单词
> str_view(test, ".......", match = TRUE)
或者
> reg1 <- str_c(rep(".", 7), collapse = "")
> str_view(test, reg1, match = TRUE)
或者
{7} 表示重复七次
> str_view(test, ".{7}", match = TRUE)
{7,} 表示重复七次及其以上
> str_view(test, ".{7,}", match = TRUE)
{7,9} 表示重复七次到九次

10.3.6

1 创建正则表达式来找出符合以下条件的所有单词。

a. 以元音字母开头的单词。

> str_subset(stringr::words, "^[aeiou]")

b. 只包含辅音字母的单词（提示：考虑一下匹配“非”元音字母）。

> str_subset(stringr::words, "^[^aeiou]+$")

c. 以 ed 结尾，但不以 eed 结尾的单词。

> str_subset(stringr::words, "[^e]ed$")

d. 以 ing 或 ize 结尾的单词。

> str_subset(stringr::words, "i(ng|ze)$")

2 实际验证一下规则：i 总是在 e 前面，除非 i 前面有 c

> length(str_subset(stringr::words, "ei"))
[1] 4
> length(str_subset(stringr::words, "ie"))
[1] 15
> 
> length(str_subset(stringr::words, "(cei|[^c]ie)"))
[1] 14
> length(str_subset(stringr::words, "(cie|[^c]ei)"))
[1] 3

3 q 后面总是跟着一个 u 吗？

> length(str_subset(stringr::words, "q[^u]"))
[1] 0

4 编写一个正则表达式来匹配英式英语单词，排除美式英语单词。

American and British English spelling differences

抄的答案，英式英语和美式英语的区别不是很清楚
英式英语
"ou" 而非 "o"
"ae" 或者 "oe" 而非 "a" 或者 "o"
以 "ise" 结尾而非 "ize"
以 "yse" 结尾
按照上面所列出的区别，可以通过以下来进行匹配排除美式英语单词
ou|ise$|ae|oe|yse$

5 创建一个正则表达式来匹配你所在国家的电话号码。

国际长途电话号码格式

我们国家是十一位数字手机电话号码，emm，至于哪些能开头哪些不能开头没细纠结。
> str_view(c("18271883605", "1293123"), "\\d{11}")

10.3.8

1 给出与 ?、+ 和 * 等价的 {m, n} 形式的正则表达式

? 等价于 {0，1}
+ 等价于 {1,}
* 等价于 {0,}

用语言描述以下正则表达式匹配的是何种模式（仔细阅读来确认我们使用的是正则表达式，还是定义正则表达式的字符串）？

a. ^.*$

b. "\{.+\}"

c. \d{4}-\d{2}-\d{2}

d. "\\{4}"

^.*$
匹配所有字符
"\\{.+\\}"
匹配至少一个字符
\d{4}-\d{2}-\d{2}
匹配类似 1234-12-12 格式
"\\\\{4}"
匹配四个 \

3 创建正则表达式来找出满足以下条件的所有单词。

a. 以 3 个辅音字母开头的单词。

b. 有连续 3 个或更多元音字母的单词。

c. 有连续 2 个或更多元音—辅音配对的单词

以 3 个辅音字母开头的单词
> str_view(words, "^[^aeiou]{3}", match = TRUE)
有连续 3 个或更多元音字母的单词
> str_view(words, "[aeiou]{3,}", match = TRUE)
有连续 2 个或更多元音—辅音配对的单词
> str_view(words, "([^aeiou][aeiou]){2,}", match = TRUE)

4 解一下 https://regexcrossword.com/challenges/beginner 中的正则表达式入门级纵横字谜。

10.3.10

用语言描述以下正则表达式会匹配何种模式？

a. (.)\1\1

b. "(.)(.)\2\1"

c. (..)\1

d. "(.).\1.\1"

e. "(.)(.)(.).*\3\2\1"

看到此题有没有联想到我们的生物序列？

a. (.)\1\1
匹配得到类似 aaa 结构的字符串
> str_subset(c("aaa", "Abc"), "(.)\\1\\1")
[1] "aaa"
b. "(.)(.)\\2\\1"
匹配类似 abba 结构的字符串
> str_view(words, "(.)(.)\\2\\1", match = T)

c. (..)\1
匹配类似 abab 结构的字符串
> str_view(words, "(..)\\1", match = T)

d. "(.).\\1.\\1"
匹配类似 a*a*a 结构的字符串，* 为任意一个字符
> str_view(words, "(.).\\1.\\1", match = T)

e. "(.)(.)(.).*\\3\\2\\1"
匹配类似 abc****cba 结构的字符串，*中不限制长度
> str_view(words, "(.)(.)(.).*\\3\\2\\1", match = T)

2 创建正则表达式来匹配出以下单词。

a. 开头字母和结尾字母相同的单词。

b. 包含一对重复字母的单词（例如，church 中包含了重复的ch）。

c. 包含一个至少重复3 次的字母的单词（例如，eleven 中的e 重复了3 次）。

a. 开头字母和结尾字母相同的单词
\\1?$ 将只有一个字符的考虑进来
"^(.)((.*\\1$)|\\1?$)"
> str_view(words, "^(.)((.*\\1$)|\\1?$)", match = TRUE)

b. 包含一对重复字母的单词（例如，church 中包含了重复的 ch）。
".*(.)(.).*\\1\\2"
这个算是不大对的，因为 . 包括了所有字符，包括数字
> str_view(words, ".*(.)(.).*\\1\\2", match = TRUE)
或者, 标准应该是这个
> str_view(words, "([a-zA-Z][a-zA-Z]).*\\1", match = TRUE)

c. 包含一个至少重复 3 次的字母的单词（例如，eleven 中的 e 重复了 3 次）。
> str_view(words, "([a-zA-Z]).*\\1.*\\1", match = TRUE)

10.4.2

试着使用两种方法来解决以下每个问题，一种方法是使用单个正则表达式，另一种方法是使用多个 str_detect() 函数的组合。

a. 找出以 x 开头或结尾的所有单词。

b. 找出以元音字母开头并以辅音字母结尾的所有单词。

c. 是否存在包含所有元音字母的单词？

d. 哪个单词包含最多数量的元音字母？哪个单词包含最大比例的元音字母？（提示：分母应该是什么？）

找出以 x 开头或结尾的所有单词。
> 使用 str_detect() 函数
> words[str_detect(words, "(^x|x$)")]
[1] "box" "sex" "six" "tax"
> str_subset(words, "(^x|x$)")
[1] "box" "sex" "six" "tax"
> str_view(words, "(^x|x$)", match = T)

找出以元音字母开头并以辅音字母结尾的所有单词
> str_subset(words, "^[aeiou].*[^aeiou]$")
> words[str_detect(words, "^[aeiou].*[^aeiou]$")]
哪个单词包含最多数量的元音字母？哪个单词包含最大比例的元音字母？
包含最多数量的元音字母的单词
> max_number <- max(str_count(words, "[aeiou]"))
> temp1 <- words[which((str_count(words, "[aeiou]") == max_number))]
> temp1
[1] "appropriate" "associate"   "available"   "colleague"   "encourage"   "experience"  "individual"  "television" 
包含最大比例的元音字母的单词
> ratio <- str_count(temp1, "[aeiou]")/str_length(temp1)
> temp1[ratio == max(ratio)]
[1] "associate" "available" "colleague" "encourage"

疑问：是否存在包含所有元音字母的单词？

> words[str_detect(words, "a") &
      str_detect(words, "e") &
        str_detect(words, "i") &
        str_detect(words, "o") &
        str_detect(words, "u")
      ]
character(0)

不存在。

10.4.4

1 在前面的示例中，你或许已经发现正则表达式匹配了 flickered，这并不是一种颜色。修改正则表达式来解决这个问题。

之前匹配项
> colors <- c(
  "red", "orange", "yellow", "green", "blue", "purple"
)
> color_match <- str_c(colors, collapse = "|")
> color_match
[1] "red|orange|yellow|green|blue|purple"
# 修改
## 我们在 10.3.3 中学过 \b：表示匹配单词间的边界。
> color_match2 <- str_c("\\b(", str_c(colors, collapse = "|"), ")\\b")
> color_match2
[1] "\\b(red|orange|yellow|green|blue|purple)\\b"
> more2 <- sentences[str_count(sentences, color_match2) > 1]
> str_view_all(more2, color_match2, match = TRUE)

2 从 Harvard sentences 数据集中提取以下内容。

a. 每个句子的第一个单词。

b. 以 ing 结尾的所有单词。

c. 所有复数形式的单词。

每个句子的第一个单词
> str_extract(sentences, "[A-Za-z]+") %>% 
  head()
[1] "The"   "Glue"  "It"    "These" "Rice"  "The" 
以 ing 结尾的所有单词。
> pattern <- "\\b[A-Za-z]+ing\\b"
> sentences_with_ing <- str_detect(sentences, pattern)
> unique(unlist(str_extract_all(sentences[sentences_with_ing], pattern)))

所有复数形式的单词
答案虽然给的是这个，但是并不是所有的复数都是以 s 或者 es 结尾
> unique(unlist(str_extract_all(sentences, "\\b[A-Za-z]{3,}s\\b"))) %>%
   head()
[1] "planks" "days"   "bowls"  "lemons" "makes"  "hogs"

10.4.6

1 找出跟在一个数词（one、two、three 等）后面的所有单词，提取出数词与后面的单词。假设这里只考虑数字 1-10

\b：单词边界

\w：任意单词字符

\W：任意非单词字符

> numword <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) +(\\w+)"
> sentences[str_detect(sentences, numword)] %>%
  str_extract(numword)

2 找出所有缩略形式，分别列出撇号前面和后面的部分。

> contraction <- "([A-Za-z]+)'([A-Za-z]+)"
> sentences[str_detect(sentences, contraction)] %>%
  str_extract(contraction) %>%
  str_split("'")
这里的缩略形式指的是通过 ' 连接的单词  
这里仅仅列出部分结果

10.4.8

1 迷惑？使用反斜杠替换字符串中的所有斜杠

这里是有点迷惑的，替换后并没有直接变成 \，而是得通过 writeLines 函数来实现真正的替换。
> test <- c("a/b", "a/b//v")
> writeLines(str_replace_all(test, "/", "\\\\"))
a\b
a\b\\v

2 使用 replace_all() 函数实现 str_to_lower() 函数的一个简单版

> LETTERS2letters <- letters
> names(LETTERS2letters) <- LETTERS
> str_replace_all(words, LETTERS2letters)

3 交换 words 中单词的首字母和末尾字母，其中哪些字符串仍然是个单词？

> test <- str_replace_all(words, "^([A-Za-z])(.*)([a-z])$", "\\3\\2\\1")
> intersect(test, words)
刚开始我是没看懂后面仍是一个单词是啥意思，看了答案才发现就是首位相同字母的单词？看了答案后其实仍然觉得怪怪的

在 10.3.10　练习中有一题就是首尾字母相同的。

> data <- c("abba", "abc", "a")
> str_view(data, "^(.)((.*\\1$)|\\1?$)", match = T)

运用到本例子中：
> words1 <- str_to_lower(words)
> test2 <- str_subset(words1, "^(.)((.*\\1$)|\\1?$)")
> test2

> words1 <- str_to_lower(words)
> test2 <- str_subset(words1, "^(.)((.*\\1$)|\\1?$)")
> test <- str_replace_all(words1, "^([A-Za-z])(.*)([a-z])$", "\\3\\2\\1")
> test_1 <- intersect(test, words1)
> setdiff(test_1, test2)
[1] "lead" "read" "god"  "dog"  "deal" "on"   "no"   "dear"
> setdiff(test2, test_1)
character(0)
额， 猜错了，哈哈

10.4.10

1 拆分字符串"apples, pears, and bananas"

> x <- c("apples, pears, and bananas")
> str_split(x, ", +(and +)?")[[1]]
[1] "apples"  "pears"   "bananas"
要善于利用 ? 表示一次或者没有

2 为什么使用boundary("word") 的拆分效果要比"" 好？

> x <- "This is a sentence. This is another sentence."
> str_split(x, boundary("word"))[[1]]
[1] "This"     "is"       "a"        "sentence" "This"     "is"       "another"  "sentence"
近乎等同于
> str_split(x, " ")[[1]]
[1] "This"      "is"        "a"         "sentence." "This"      "is"        "another"   "sentence."
但是我们可以看到，如果使用 " " 来拆分，会把点 . 等符号包括在一个单词中

3 使用空字符串（""）进行拆分会得到什么结果？尝试一下，然后阅读文档。

我们可以看到使用 "" 拆分后一句话变成了单个的字符，空格也是。

> x <- "This is a sentence. This is another sentence."
> str_split(x, "")
[[1]]
 [1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "e" "n" "t" "e" "n" "c" "e" "." " " "T" "h" "i" "s" " " "i" "s" " " "a" "n" "o"
[32] "t" "h" "e" "r" " " "s" "e" "n" "t" "e" "n" "c" "e" "."
使用 boundary() 时候等同于上面。
> str_split(x, boundary())
[[1]]
 [1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "e" "n" "t" "e" "n" "c" "e" "." " " "T" "h" "i" "s" " " "i" "s" " " "a" "n" "o"
[32] "t" "h" "e" "r" " " "s" "e" "n" "t" "e" "n" "c" "e" "."

10.5

1 如何找出包含 \ 的所有字符串？分别使用 regex() 和 fixed() 函数来完成这个任务。

> str_subset(c("a\\b", "ab"), "\\\\")
[1] "a\\b"
> str_subset(c("a\\b", "ab"), fixed("\\"))

2 sentences 数据集中最常见的 5 个单词是什么？

思路就是提取所有单词，然后取消 list, 全统一小写，再统计数目
这种一般可以用来分析云词汇图
> tibble(word = unlist(str_extract_all(sentences, boundary("word")))) %>%
  mutate(word = str_to_lower(word)) %>%
  count(word, sort = T) %>%
  slice(1:5)
# A tibble: 5 x 2
  word      n
  <chr> <int>
1 the     751
2 a       202
3 of      132
4 to      123
5 and     118

10.7

1 找出可以完成以下操作的 stringi 函数。

a. 计算单词的数量。

b. 找出重复字符串。

c. 生成随机文本。

计算单词的数量。
> stringi::stri_count_words(sentences) %>%
  head(5)
[1] 8 8 9 9 7
找出重复字符串，
 R 中很多函数你都可以根据对应功能的英文名来进行推测
> test <- c("a", "b", "c", "a")
> test[stringi::stri_duplicated(test)]
[1] "a"
生成随机文本。
看答案。。
> stringi::stri_rand_strings(4, 5)
[1] "auldB" "GBdbU" "UUZmd" "SisRN"
打乱原本句子中单词的顺序
> stringi::stri_rand_shuffle("The brown fox jumped over the lazy cow.")
[1] " hwobeT oe dharjvl xpyuer eof.c zwom nt"
> stringi::stri_rand_lipsum(1)

2 如何控制 stri_sort() 函数用来排序的语言设置？

> string1 <- c("hladny", "chladny")
> stringi::stri_sort(string1, locale = "pl_PL")
[1] "chladny" "hladny" 
> stringi::stri_sort(string1, locale = "sk_SK")
[1] "hladny"  "chladny"
> string2 <- c("number100", "number2")
> stringi::stri_sort(string2)
[1] "number100" "number2"  
> 
> stringi::stri_sort(string2, opts_collator = stringi::stri_opts_collator(numeric = TRUE))
[1] "number2"   "number100"