[读书笔记r4ds]14. Strings

2019-10-11 本文已影响0人茶思饭

在线读书：
R for data science
github地址: https://github.com/hadley/r4ds

II Data Wrangle

Data Wrangle 分为3个步骤：import, tidy transformation.

image.png

这一章讲字符串的操作，用到的R包主要是Stringr.

library(tidyverse)
library(stringr)

14.2 String basic

R 接受用双引号" " 或者单引号' ' 引起的字符作为string 字符串格式，两种用法没有差别。
字符串必须具有完整的前后双引号，缺少后引号的命令行，无法运行，会在下一行显示+号。可以按Esc键退出重新输入。
如果要在字符串中包含一个文本单引号或双引号，可以使用\来“跳过”它:

double_quote <- "\""  # or ' " '
single_quote <- '\''   or " ' "

或者也可以采用与外面不同的引号形式来避免错误，在" " 中使用 ' '，在' '中使用 " "。

在字符串中的第一个\ 会被跳过，如果要用'\' 则要用'\\'表示。
用print() 输出的字符串，包含了escape，与字符串本来的样子有出入。
可以用 writeLines() 来输出。

x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \

Special characters 特殊字符:
\n : newline新的一行
\t : Tab
\r : carriage return 回车
\b backspace 退格
\a alert (bell)
\f form feed 换页
\v vertical tab 垂直制表符
\\ backslash \ 斜杠
\' ASCII apostrophe ' 单引号
\" ASCII quotation mark " 双引号
\`` ASCII grave accent (backtick)\ 重音符
\nnn character with given octal code (1, 2 or 3 digits)
\xnn character with given hex code (1 or 2 hex digits)
\unnnn Unicode character with given code (1--4 hex digits)
\Unnnnnnnn Unicode character with given code (1--8 hex digits)
可以用?'"', or ?"'" 查看特殊字符串的帮助文档

a <- "abc\\efg\r12456"   #"\r" 表示 回车 ，"\\" 表示 \ .
a
# "abc\\efg\r12456"
 writeLines(a)          ## 前面的字符被后面的替换掉了，多余的留了下来。
# 12456fg             
a <- "abc\\efg\b12456" #  "\b" 表示退格，删除了前面一个字符。
writeLines(a)
# abc\ef12456
a <- "abc\\efg\a12456" # "\a" 表示警告，插入了一个 表示警告的�符号
writeLines(a)
# abc\efg�12456
a <- "abc\\efg\f12456"   #"\f" 表示换页，页面被清空，只留下之后的“12456”。
# 12456 
a <- "abc\\efg\v123456"  # 
writeLines(a)
# abc\efg�123456
a <- "abc\\efg\12456" # " \124" 被认为是字符代码，插入了一个字符。
writeLines(a)
#abc\efgT56

Base R 也有许多函数可以进行String 操作，但他们很多不一致，因此这里只用stringr，他们的函数具有更直观的名称。所有的stringr函数都具有str_的前缀，这样在输入str_代码后，后面的会触发自动补全功能，能够看到所有的stringr的函数，方便选择。
str_length() 查看字符串长度
str_c() 合并字符串， sep= 参数可以设置分隔符符号。

str_c("x", "y")
#> [1] "xy"
str_c("x", "y", sep = ", ")
#> [1] "x, y"

str_c() 是矢量化的，它自动处理较短的向量使其长度与最长的向量相同:

str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

长度为0 的字符串，被str_c默认清除。在与if函数一起使用时特别有用.

name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
#> [1] "Good morning Hadley."

str_replace_na() 将NA值当作字符串"NA" 进行操作。

x <- c("abc", NA)
str_c("|-", x, "-|")
#> [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
#> [1] "|-abc-|" "|-NA-|"

str_c 可以用于合并一个字符串向量，用collapse参数。

str_c(c("x", "y", "z"), collapse = ", ")
#> [1] "x, y, z"

Subsetting string 字符串子集
-- str_sub() 具有stat , end 参数用于给定子集的位置。

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"
# negative numbers count backwards from end
str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"

-- str_sub() 不会报错，会给出尽可能正确的回应。

str_sub("a", 1, 5)
#> [1] "a"

-- str_sub() 的结果也可用赋值符号进行修改。

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple"  "banana" "pear"

-- str_to_lower() 转换为小写字母
-- str_to_upper() 转换为大写字母
-- str_to_title() 转换为标题形式，每个单词首字母大写。
-- str_to_sentence() 转换为句子形式，Only 每句的首字母大写。

Locales 地域
由于不同地域具有不同的书写习惯，为了保证在不同地域的电脑上代码运行结果一致，有必要指定locale =参数。
locale 参数的值参照 ISO 639 language code，用2或3个字母的缩写表示。
order() and sort() 函数也使用当前电脑的 locale 信息，当需要在不同电脑上都显示相同的结果时，就要添加 locale =参数。

x <- c("apple", "eggplant", "banana")
str_sort(x, locale = "en")  # English
#> [1] "apple"    "banana"   "eggplant"
str_sort(x, locale = "haw") # Hawaiian
#> [1] "apple"    "eggplant" "banana"

-- en 英语；
-- zh 中文；
-- fr 法语；
-- ja 日语；
-- de 德语；
-- es 西班牙语；
......

str_wrap() # 每一个输入的字符串都是被当做一个段落（或者仅包含空格的行）。段落按照设置的格式（width，indent，exdent）进行分行。每一行为一个字符串作为结果返回。width ,每行的宽度，indent, 首行缩进，exdent,除首行外其他行的缩进（悬挂缩进）。
str_wrap(string, width = 80, indent = 0, exdent = 0)
str_trim() ###移除字符串开头和结尾处的空格。
--str_trim(string, side = c("both", "left", "right"))
str_squish(string) ###移除字符串内重复的空格。
str_pad() # 在两端增加空格。

练习：

Write a function that turns (e.g.) a vector c("a", "b", "c") into the string a, b, and c. Think carefully about what it should do if given a vector of length 0, 1, or 2.

str_commasep <- function(x, delim = ",") {
  n <- length(x)
  x <-str_replace_na(x)
  if (n == 0) {
    ""
  } else if (n == 1) {
    x
  } else if (n == 2) {
    # no comma before and when n == 2
    str_c(x[[1]], "and", x[[2]], sep = " ")
  } else {
    # commas after all n - 1 elements
    not_last <- str_c(x[seq_len(n - 1)], delim)
    # prepend "and" to the last element
    last <- str_c("and", x[[n]], sep = " ")
    # combine parts with spaces
    str_c(c(not_last, last), collapse = " ")
  }
}
str_commasep("")
#> [1] ""
str_commasep("a")
#> [1] "a"
str_commasep(c("a", "b"))
#> [1] "a and b"
str_commasep(c("a", "b", "c"))
#> [1] "a, b, and c"
str_commasep(c("a", "b", "c", "d"))
#> [1] "a, b, c, and d"
## 作者：Richard_Zhou
##  链接：https://www.jianshu.com/p/4790b00dc238

14.3 Matching patterns with regular expressions

正则表达式的模式匹配

str_view() 简单匹配
". " 可以匹配任意字符。
"\." 可以精确匹配.符号，用\跳过任意匹配，不过这又产生了一个问题，就是\符号本身就表示跳过，因此如果要输入\.时，实际需要输入\\.才可以。
如果需要匹配\，则需要\\\\，这是因为文本\\\\ ，表示的意思是\\，在执行匹配时，又再次执行了跳过\。

x <- "a\\b"
writeLines(x)
#> a\b

str_view(x, "\\\\")

^匹配起始字符串
$ 匹配字符串的末尾

14.3.2.1 Exercises

1.How would you match the literal string "$^$"?

str_view("$^$"，"\\$\\^\\$")

Given the corpus of common words in stringr::words, create regular expressions that find all words that:
Start with “y”.
End with “x”
Are exactly three letters long. (Don’t cheat by using str_length()!)
Have seven letters or more.

str_view(stringr::words,"^y",match=T)
str_view(stringr::words, "x$",match=T)
str_view(stringr::words,"^...$",match=T)
str_view(stringr::words, "^.......",match=T)

其他模糊匹配方式：

\d: 匹配任意数字
\s: 匹配任意空白 (e.g. space, tab, newline).
[abc]: 匹配 a, b, or c.
[^abc]: 匹配任意字符，除了a, b, or c.

[] 可以匹配 $ . | ? * + ( ) [ {字符，而不用“\”，但有些字符在[] 也有特殊意义，因此，必须手动输入\来跳过] \ ^ and -.

14.3.3.1 Exercises

1.Create regular expressions to find all words that:
Start with a vowel(元音）.
That only contain consonants（辅音）. (Hint: thinking about matching “not”-vowels.)
End with ed, but not with eed.
End with ing or ise.
Empirically verify the rule “i before e except after c”.
Is “q” always followed by a “u”?

str_view(stringr::words, "^[aeiou],match=T)
str_view(stringr::words, "^[^aeiou],match=T)
str_view(stringr::words, "[^e]ed$",match=T)
str_view(stringr::words, "ing|ise$",match=T)
str_view(stringr::words, "ing|ise$",match=T)
str_view(stringr::words, "[^c]ie|cei",match=T)
str_view(stringr::words, "q[^u]",match=T)  ### 没有匹配，及所有的"q"都有“u”跟着。

2.Write a regular expression that matches a word if it’s probably written in British English, not American English.

str_view(stringr::words, "re$",match=T)# 以–re结尾的单词：英式以-re结尾；美式以-er结尾。
str_view(stringr::words, "our$",match=T)#以-our结尾的单词：英式以-our结尾；美式通常以-or结尾。
str_view(stringr::words, "ise$",match=T)#以-ize或-ise结尾的单词：英式英语中，以-ize或-ise拼写的动词都是可以的；而在美式英语中，总是拼做-ize。
str_view(stringr::words, "yse$",match=T)#以-yse结尾的单词：英式英语中，这类动词写作-yse；美式英语中总是写作-yze。
str_view(stringr::words, "ll[ed|ing]$",match=T)#以元音+字母l结尾的单词：英式拼写中，动词以元音+字母l结尾时，如果需要再添加元音，会双写l；美式拼写中，无需双写。
str_view(stringr::words, "[ae|oe]",match=T)#双元音的拼写：英式英语中，双元音ae或oe都是两个字母；美式英语中，它们都写做一个字母e。
str_view(stringr::words, "ence$",match=T)#以–ence结尾的名词：英式英语中以–ence结尾的名词，在美式英语中写做-ense。
str_view(stringr::words, "ogue$",match=T)#以–ogue结尾的名词：英式拼写为–ogue；美式拼写为-og或-ogue均可。

Create a regular expression that will match telephone numbers as commonly written in your country.
"(0[0-9]{2,3})-" #固话
"1([1-9]{2})([0-9]{8})"## 手机

14.3.4 Repetition 重复

?: 0 or 1
+: 1 or 多次
*: 0 or 多次
{n}: n次
{n,}: n or 多次
{,m}: 最多m次
{n,m}: 最少n次，最多m次

str_view(x, "C{2,3}") ###默认匹配最长的字符串
str_view(x, 'C{2,3}?') ### 匹配最短的字符串

14.3.4.1 Exercises

Describe the equivalents of ?, +, * in {m,n} form.
？={0,1}
+={1,}
*={0,}
Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
1. ^.*$ ## .*匹配任意字符
2. "\\{.+\\}" ## {.+}
3. \d{4}-\d{2}-\d{2} ## 任意数字重复4次，-，任意数字重复2次，-，任意数字重复2次
4. "\\\\{4}" ## \{4}, 表示“\”4次
Create regular expressions to find all words that:
1. Start with three consonants.
  str_view(stringr::words, "^[^aoeiu]{3}",match=T)
2. Have three or more vowels in a row.
  str_view(stringr::words, "[aoeiu]{3,}",match=T)
3. Have two or more vowel-consonant pairs in a row.
  str_view(stringr::words, "([aoeiu][^aoeiu]){2,}",match=T)
Solve the beginner regexp crosswords athttps://regexcrossword.com/challenges/beginner.

14.3.5 Grouping and backreferences

正则表达式的反向引用
反向引用非常方便，因为它允许重复一个模式（pattern），无需再重写一遍。我们可以使用#（#是组号）来引用前面已定义的组（用括号括起来的内容）。比如一个文本以abc开始，接着为xyz，紧跟着abc，对应的正则表达式可以为“abcxyzabc”，也可以使用反向引用重写正则表达式，"(abc)xyz\\1"，\1表示第一组（abc）。\2表示第二组，\3表示第三组，以此类推。

14.3.5.1 Exercises

Describe, in words, what these expressions will match:
(.)\1\1 ## 3个相同字符aaa
"(.)(.)\\2\\1" ## 2个字符的回文结构abba
(..)\1 ## 任意2个字符的重复结构abab
"(.).\\1.\\1" ## 类似abaa的结构
"(.)(.)(.).*\\3\\2\\1" ## 3个连续字符及其回文结构，中间可以间隔任意字符abcxxcba
Construct regular expressions to match words that:
Start and end with the same character.
str_view(stringr::words, "^(.).*\\1$",match=T)
Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice.)
str_view(stringr::words, "(..).*\\1",match=T)
Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s.)
str_view(stringr::words, "(.).*\\1.*\\1",match=T)

14.4 Tools

image.png

正则表达式的匹配模式用pattern来表示，他把正则表达式在字符串的功能分为四个方面，分别是

查找：Detect pattern，确定这个模式有没有
定位：Locate pattern，返回模式起止位置
取回：Extract pattern, 返回模式匹配到的条目
替换：Replace pattern，替换匹配的模式，返回替换后的结果

14.4.1 Detect matches

str_detect() 为了确定字符串向量是否匹配模式，返回向量等长的逻辑值。

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

str_count() 告诉你字符串中有多少个匹配项

str_count(x, "a")
#> [1] 1 3 1

正则表达式中的匹配不会重复，例如下面例子中匹配数是2不是3.

str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")

带有_all 后缀的函数会进行全部匹配，而不是只匹配一个。

14.4.2 Extract matches

1.str_subset() 可以实现匹配项取子集

str_subset(x,"e")
[1] "apple" "pear"

str_extract()##返回的是匹配到的模式
str_extract_all()##以list形式返回的匹配到的模式

使用simplify=TRUE参数，以matrix形式返回匹配的模式

str_extract(x,"e")
[1] "e" NA  "e"
##以list形式返回的匹配到的模式
str_extract_all(x,"a")
[[1]]
[1] "a"

[[2]]
[1] "a" "a" "a"

[[3]]
[1] "a"
##以matrix形式返回匹配到的模式
str_extract_all(x,"a",simplify = T)
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "a"  "a" 
[3,] "a"  ""   ""

14.4.3 Grouped matches

str_match返回的是数据框
第一列是str_extract匹配到的模式，后面依次是括号中的内容，模式中有多少个(),就返回多少列。本例中有2对小括号。
如果数据是 tibble格式，使用tidyr::extract()函数也很方便，工作方式类似str_match()只是需要命名匹配项。

str_match(x,"([aoeiu]).*([aoeiu])")
     [,1]    [,2] [,3]
[1,] "apple" "a"  "e" 
[2,] "anana" "a"  "a" 
[3,] "ea"    "e"  "a" 
tibble(x=x) %>%
 tidyr::extract(x,c("vowel1","vowel2"),"([aoeiu]).*([aoeiu])",
remove=FALSE)  ## 保留原数据

14.4.4 Replacing matches

str_replace() and str_replace_all() 可以将匹配字符串替换为其他字符串

str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

使用 str_replace_all() 还可以实现多重替换通过提供a named vector

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

除了使用固定字符串，还可以对匹配部件进行反向引用。Instead of replacing with a fixed string you can use backreferences to insert components of the match.

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% ###第2个单词与第3个单词互换位置。
  head(5) 
#> [1] "The canoe birch slid on the smooth planks." 
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."     
#> [4] "These a days chicken leg is a rare dish."   
#> [5] "Rice often is served in round bowls."

14.4.1.1 Exercises

1.For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

Find all words that start or end with x.

words[str_detect(words,"^x|x$")]

Find all words that start with a vowel and end with a consonant.

words[str_detect(words,"^[aoeiu].*[^aoeiu]$")]

Are there any words that contain at least one of each different vowel?

words[str_count(words,"[aoeiu]")>2]

What word has the highest number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)

words[str_count(words,"[aoeiu]") == max(str_count(words,"[aoeiu]"))]
words[str_count(words,"[aoeiu]")/str_length(words) == max(str_count(words,"[aoeiu]")/str_length(words))]

14.4.2.1 Exercises

In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a colour. Modify the regex to fix the problem.

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = " | ")
colour_match
#> [1] "red|orange|yellow|green|blue|purple"
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)

image.png

修改正则表达式：

colors <- c( " red", "orange", "yellow", "green", "blue", "purple")
> (colour_match <- str_c(colors,collapse = "|"))
[1] " red|orange|yellow|green|blue|purple"
> str_view_all(more,colour_match)

image.png

From the Harvard sentences data, extract:
The first word from each sentence.

str_extract(sentences,"^[^ ]+ ")

All words ending in ing.

str_extract(sentences,"[^ ]+ing ")

All plurals.

str_extract(sentences,"([a-z]+)(((s|x|sh|ch)es)|ies|[aoeiu]ys|ves|[^aeiu']s)[ .]")

14.4.3.1 Exercises

Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.

numb <- c(" one","two","three","four","five","six","seven"," eight","nine"," ten ")
number <- str_c(numb,collapse = " | ") %>% paste0("(",.,") ","([^ ]+)")
str_subset(sentences,number)%>%str_match(number)

Find all contractions. Separate out the pieces before and after the apostrophe.

str_subset(sentences,"'") %>% str_match("([^ ]+)'([^ ]+)")

14.4.4.1 Exercises

Replace all forward slashes in a string with backslashes.

x <- c("ab\\c","abbc\\edf")
 x
#[1] "ab\\c"     "abbc\\edf"
str_replace_all(x,"\\\\\\\\","\\/\\/")
#"ab//c"     "abbc//edf"

Implement a simple version of str_to_lower() using replace_all().

paste0('"',LETTERS,'"',"=",'"',letters,'"') %>% str_c(collapse = ",") %>% writeLines()
#"A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z"
str_replace_all(sentences,c("A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z"))%>% head(5)
[1] "the birch canoe slid on the smooth planks."  "glue the sheet to the dark blue background."
[3] "it's easy to tell the depth of a well."      "these days a chicken leg is a rare dish."   
[5] "rice is often served in round bowls."       
c("A"="a","B"="b","C"="c","D"="d","E"="e","F"="f","G"="g","H"="h","I"="i","J"="j","K"="k","L"="l","M"="m","N"="n","O"="o","P"="p","Q"="q","R"="r","S"="s","T"="t","U"="u","V"="v","W"="w","X"="x","Y"="y","Z"="z")
##  A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z 
## "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

通过观察发现，c() 只是实现了对字符串向量的命名，因此，采用下面的方法更好。

names(letters) <- LETTERS
letters
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z 
## "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" 
str_replace_all(sentences,letters) %>% head(5)
#[1] "the birch canoe slid on the smooth planks."               
#[2] "glue the sheet to the dark blue background."              
#[3] "it's easy to tell the depth of a well."                   
#[4] "these days a chicken leg is a rare dish."                 
#[5] "rice is often served in round bowls."

Switch the first and last letters in words. Which of those strings are still words?

str_replace_all(words,"(^.)(.*)(.$)","\\3\\2\\1")
str_replace_all(words,"(^.)(.*)(.$)","\\3\\2\\1") %>% str_subset(paste0("^",str_c(words,collapse = "$|^"),"$"))
 [1] "a"          "america"    "area"       "dad"        "dead"       "lead"       "read"       "depend"     "god"       
[10] "educate"    "else"       "encourage"  "engine"     "europe"     "evidence"   "example"    "excuse"     "exercise"  
[19] "expense"    "experience" "eye"        "dog"        "health"     "high"       "knock"      "deal"       "level"     
[28] "local"      "nation"     "on"         "non"        "no"         "rather"     "dear"       "refer"      "remember"  
[37] "serious"    "stairs"     "test"       "tonight"    "transport"  "treat"      "trust"      "window"     "yesterday"

14.4.5 Splitting 分列

str_split() 可按一定规律分割字符串为多个单元，类似excel 中的分列功能。
由于每个元素可能包含的单元数目不同，因此该函数返回的格式是由字符串向量组成的列表。
如果你只是分割一个字符串，最方便的做法取子集[[1]]，这样返回vector 格式的结果。
也可以指定参数，simplify = TRUE返回matrix格式的结果。同时也可以通过n= xx 参数指定返回matrix最大列数，以去除一些不必要的列。
除了以匹配模式进行分列，还可以通过boundary()函数以字符串、行、句子，单词进行分列。
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word boundary()s:
boundary(type = c("character", "line_break", "sentence", "word"), skip_word_none = NA, ...)

words <- c("These are   some words.")
> str_count(words, boundary("word"))
[1] 4
> str_split(words, " ")[[1]]
[1] "These"  "are"    ""       ""       "some"   "words."
> str_split(words, boundary("word"))[[1]]
[1] "These" "are"   "some"  "words"

str_split_fixed() 返回字符串 matrix with ncolumns.

fruits <- c(
  "apples and oranges and pears and bananas",
  "pineapples and mangos and guavas"
)
> str_split_fixed(fruits, " and ", 3)
     [,1]         [,2]      [,3]               
[1,] "apples"     "oranges" "pears and bananas"
[2,] "pineapples" "mangos"  "guavas"           
> str_split_fixed(fruits, " and ", 4)
     [,1]         [,2]      [,3]     [,4]     
[1,] "apples"     "oranges" "pears"  "bananas"
[2,] "pineapples" "mangos"  "guavas" ""

14.4.5.1 Exercises

1.Split up a string like "apples, pears, and bananas" into individual components.

c<-"apples, pears, and bananas"
str_split(c,", |and ")

2.Why is it better to split up by boundary("word") than " "?
boundary("word") 可以忽略空格、逗号等的影响。

What does splitting with an empty string ("") do? Experiment, and then read the documentation.
结果以每个字符分列：

str_split(words,"")[[1]]
 [1] "T" "h" "e" "s" "e" " " "a" "r" "e" " " " " " " "s" "o" "m" "e" " " "w" "o" "r" "d" "s" "."

14.4.6 Find matches

str_locate() ·and ·str_locate_all()· 给出匹配模式的起始和终止位置，可以用str_locate()查找匹配位置str_sub()` 进行提取或修改.

14.5 Other types of pattern 其他模式

通常所使用的pattern ，其实是regex()函数的省略。
通过regex() 的参数，可以实现更加精确的控制:
- ignore_case = TRUE 忽略字符串的大写、或小写模式,通常默认为FALSE。

bananas <- c("banana", "Banana", "BANANA")
> str_extract(bananas,"banana")
[1] "banana" NA       NA      
> str_extract(bananas,regex("banana",ignore_case = TRUE))
[1] "banana" "Banana" "BANANA"

multiline = TRUE 允许使用 ^ and$ 来匹配每一行的开始或者结尾，而不是整个文本的开头和末尾。

x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"

comments = TRUE 允许使用注释，使得正则表达式更容易理解。#后的字符被认为是注释，所有的空格也会被忽略。如果需要使用空格需要使用\\.

 str_split(words,regex("\\ ",comments = T))[[1]]## 以空格进行分列
# [1] "These"  "are"    ""       ""       "some"   "words."
str_split(words,regex(" ",comments = T))[[1]]## pattern中的空格被忽略，以每个字符进行分列
# [1] ""  "T" "h" "e" "s" "e" " " "a" "r" "e" " " " " " " "s" "o" "m" "e" " " "w" "o" "r" "d" "s" "." ""
str_split(words,regex("[. ]",comments = T))[[1]] ## 只能以点进行分列。
# [1] "These are   some words" ""                      
str_split(words,regex("[.\\ ]",comments = T))[[1]]##  以· 或者空格 进行分列
# [1] "These" "are"   ""      ""      "some"  "words" ""

dotall = TRUE allows .匹配任意字符包括\n.

此外，还有3种函数可以替代regex() :

fixed() # 精确匹配字符串，忽略正则表达式，速度更快。在非英文环境使用fixed()需要注意同一字符的不同实现方式，在fixed()函数中无法识别。
coll(): 比较字符串以标准的校对准则。compare strings using standard collation rules. coll() 具有 locale参数，以确定采用哪种校对规则。coll()的执行速度很慢。
boundary() 在str_split()中提到的boundary() 函数也可以在其他函数中使用。

 str_extract_all(words, boundary("word"))[[1]]
[1] "These" "are"   "some"  "words"

14.5.1 Exercises

How would you find all strings containing \ with regex() vs. with fixed()?

x <- "Line 1\\\\Line 2\nLine 3"
writeLines(x)
#Line 1\\Line 2
#Line 3
str_view_all(x, regex("\\\\"))
str_view_all(x,fixed("\\"))

What are the five most common words in sentences?

str_split(sentences,boundary("word")) %>%  ## 分割单词，
  unlist %>%  str_to_lower%>%   ## 去list，全部转为小写
  table() %>% sort(decreasing = T) %>%  ##使用table统计 ，sort排序
  head(5)#head显示前5个。
# .
# the   a  of  to and 
# 751 202 132 123 118

14.6 Other uses of regular expressions

apropos() searches all objects available from the global environment.
dir() lists all the files in a directory. The pattern argument takes a regular expression and only returns file names that match the pattern.

14.7 stringi

stringr 是在stringi包的基础上产生的，stringr 包含了最基本的字符串处理函数46个，stringi包具有234个函数，功能更加强大。如果有更复杂的字符串处理，可以使用stringi包，这两个包的函数非常相似，只需要替换str_为 stri_即可，

14.7.1 Exercises

Find the stringi functions that:
Count the number of words.
stri_count(sentences,regex = " ") %>% head()
Find duplicated strings.
stri_extract(words,regex="(.)\\1",simplify=T) ###寻找连续字符
stri_duplicated(c("a", "b", "a", NA, "a", NA)) ### 判断是否有重复字符串
Generate random text.
How do you control the language that stri_sort() uses for sorting?

可以指定decreasing=TRUE参数，倒序排列

 stri_sort(sample(LETTERS))
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
stri_sort(sample(LETTERS),decreasing = TRUE) 
 [1] "Z" "Y" "X" "W" "V" "U" "T" "S" "R" "Q" "P" "O" "N" "M" "L" "K" "J" "I" "H" "G" "F" "E" "D" "C" "B" "A"

-指定locale="xx"参数，按地区语言特性排列。

stri_sort(c("hladny", "chladny"), locale="pl_PL")
[1] "chladny" "hladny" 
> stri_sort(c("hladny", "chladny"), locale="sk_SK")
[1] "hladny"  "chladny"

-指定numeric=TRUE参数，按数字大小排列

stri_sort(c(1, 100, 2, 101, 11, 10))
[1] "1"   "10"  "100" "101" "11"  "2"  
> stri_sort(c(1, 100, 2, 101, 11, 10), numeric=TRUE)
[1] "1"   "2"   "10"  "11"  "100" "101"

[读书笔记r4ds]14. Strings

II Data Wrangle

14.2 String basic

练习：

14.3 Matching patterns with regular expressions

正则表达式的模式匹配

14.3.2.1 Exercises

其他模糊匹配方式：

14.3.3.1 Exercises

14.3.4 Repetition 重复

14.3.4.1 Exercises

14.3.5 Grouping and backreferences

14.3.5.1 Exercises

14.4 Tools

14.4.1 Detect matches

14.4.2 Extract matches

14.4.3 Grouped matches

14.4.1.1 Exercises

14.4.2.1 Exercises

14.4.3.1 Exercises

14.4.4.1 Exercises

14.4.5 Splitting 分列

14.4.5.1 Exercises

14.4.6 Find matches

14.5 Other types of pattern 其他模式

此外，还有3种函数可以替代regex() :

14.5.1 Exercises

14.6 Other uses of regular expressions

14.7 stringi

14.7.1 Exercises

猜你喜欢

热点阅读