R语言中字符串的处理(2/3)-分割、提取、替换
本文转自微信公众号: 一遇之见 的 大作 R中字符串处理:函数实现 。原文太长,分三次学习、消化。
字符串分割函数:strsplit
,str_split
和str_split_fixed
函数strsplit
,str_split
和str_split_fixed
均可实现字符串的分割,但strsplit
和str_split
返回结果为列表,而str_split_fixed
返回结果为矩阵。
fruits = c("Small Yellow Banana", " Red Apple", "Big Sweet Pear ", "Sour PineApple")
strsplit(fruits, " ")
# [[1]]
# [1] "Small" "Yellow" "Banana"
# [[2]]
# [1] "" "Red" "Apple"
# [[3]]
# [1] "Big" "Sweet" "Pear" "" #其实这里是两个空格
# [[4]]
# [1] "Sour" "PineApple"
library(stringr)
str_split(fruits, " ")
# [[1]]
# [1] "Small" "Yellow" "Banana"
# [[2]]
# [1] "" "Red" "Apple"
# [[3]]
# [1] "Big" "Sweet" "Pear" "" "" #这个函数识别出了两个空格
# [[4]]
# [1] "Sour" "PineApple"
str_split_fixed(fruits, " ", n = 3)
# [,1] [,2] [,3]
# [1,] "Small" "Yellow" "Banana"
# [2,] "" "Red" "Apple"
# [3,] "Big" "Sweet" "Pear "
# [4,] "Sour" "PineApple" ""
函数unlist
可将函数strsplit
和str_split
返回结果列表转化为向量。
unlist(strsplit(fruits, " "))
# [1] "Small" "Yellow" "Banana" "" "Red" "Apple"
# [7] "Big" "Sweet" "Pear" "" "Sour" "PineApple"
unlist(str_split(author, " "))
unlist(str_split(fruits, " "))
# [1] "Small" "Yellow" "Banana" "" "Red" "Apple"
# [7] "Big" "Sweet" "Pear" "" "" "Sour"
# [13] "PineApple"
三个字符串分割函数中,str_split_fixed
的返回结果为数据框,方便对后期结果的引用。此外,函数str_split
和str_split_fixed
中都有参数n
,但str_split
中的参数可设置也可不设置,函数返回结果依旧是列表;str_split_fixed
中参数n
必须设置。其中参数n
小于最大分割个数时,后面的不再分隔;参数n
超过最大分割数时,后面内容为空。
str_split(fruits, " ", n = 2)
# [[1]]
# [1] "Small" "Yellow Banana"
# [[2]]
# [1] "" "Red Apple"
# [[3]]
# [1] "Big" "Sweet Pear "
# [[4]]
# [1] "Sour" "PineApple"
str_split(fruits, " ", n = 5)
# [[1]]
# [1] "Small" "Yellow" "Banana"
# [[2]]
# [1] "" "Red" "Apple"
# [[3]]
# [1] "Big" "Sweet" "Pear" "" ""
# [[4]]
# [1] "Sour" "PineApple"
str_split_fixed(fruits, " ", n = 3)
# [,1] [,2] [,3]
# [1,] "Small" "Yellow" "Banana"
# [2,] "" "Red" "Apple"
# [3,] "Big" "Sweet" "Pear "
# [4,] "Sour" "PineApple" ""
str_split_fixed(fruits, " ", n = 2)
# [,1] [,2]
# [1,] "Small" "Yellow Banana"
# [2,] "" "Red Apple"
# [3,] "Big" "Sweet Pear "
# [4,] "Sour" "PineApple"
str_split_fixed(fruits, " ", n = 5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] "Small" "Yellow" "Banana" "" ""
# [2,] "" "Red" "Apple" "" ""
# [3,] "Big" "Sweet" "Pear" "" ""
# [4,] "Sour" "PineApple" "" "" ""
字符串提取
-
函数
substr(x, start,stop)
:对字符串x截取从start到stop的子字符串。 -
函数
substring(text,first, last = 1000000L)
:对字符串text截取从first到last的子字符串,last默认值为1000000,可以不传参。 -
str_sub(x, start = 1L, end = -1L)
:对字符串x截取从start到end的子字符串,start和end有默认值,可以不传参。
txt <- c("Hello, the World!","I'm Chinese", "I love China.", "I come from China!")
substr(txt, 1, 5)
# [1] "Hello" "I'm C" "I lov" "I com"
substring(txt, 1, 5)
# [1] "Hello" "I'm C" "I lov" "I com"
str_sub(txt, 1, 3)
# [1] "Hel" "I'm" "I l" "I c"
substr(txt[1], c(1,2,3,4), c(2,3,4,5)) # 只对第一个元素有效
# [1] "He"
substr(txt, c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"
substring(txt[1], c(1,2,3,4), c(2,3,4,5)) # 重复短元素,在相同位置匹配
# [1] "He" "el" "ll" "lo"
substring(txt, c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"
str_sub(txt[1], c(1,2,3,4), c(2,3,4,5)) # 重复短元素,在相同位置匹配
# [1] "He" "el" "ll" "lo"
str_sub(txt, c(1,2,3,4), c(2,3,4,5))
# [1] "He" "'m" "lo" "om"
-
函数
strtrim(x,width)
:对字符串x从开头截取指定width的子字符串,参数均可循环使用。对于中文字符,一个字符的长度为2,因此width也要设置为2倍宽度。 -
stringr包中的函数
word(string,start = 1L, end = start, sep = fixed(" "))
:用于从语句中提取单词(字符串)。string
为字符串或字符串向量;start
为数值向量给出提取的开始位置;end
为数值向量给出提取的结束位置;sep
为单词间分隔符,默认为空格。
txt <- c("Hello, the World!","I'm Chinese", "I love China.", "I come from China!")
strtrim(txt, 7)
# [1] "Hello, " "I'm Chi" "I love " "I come "
strtrim(txt, c(1,2,3,4)) # 重复短元素,在相同位置匹配
# [1] "H" "I'" "I l" "I co"
word(txt, 2)
# [1] "the" "Chinese" "love" "come"
word(txt, c(1,2)) # 重复短元素,在相同位置匹配 = (1,2,1,2)
# [1] "Hello," "Chinese" "I" "come"
字符串替换
尽管sub
和gsub
,str_replace
和str_replace_all
可用于字符串的替换,但严格地说R语言没有字符串替换的函数,因为R语言不管什么操作对参数都是传值不传址。
text = c("Hellow, Adam Adam!", "Hi, Paul Adam !", "How are you, Adam, Ava.")
sub(pattern = "Adam", replacement = "world", text)
# [1] "Hellow, world Adam!" "Hi, Paul world !" "How are you, world, Ava."
gsub(pattern = "Adam", replacement = "world", text)
# [1] "Hellow, world world!" "Hi, Paul world !" "How are you, world, Ava."
可以看到:虽然说是“替换”,但原字符串并没有改变,要改变原变量我们只能通过再赋值的方式。sub
和gsub
的区别是: 前者只做一次替换(不管有几次匹配),而gsub
把满足条件的匹配都做替换。
stringr
包中也有类似函数sub
的str_repalce
函数做一次替换,以及类似函数gsub
的str_repalce_all
函数把满足条件的匹配都做替换。
sub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
# [1] "IacHgd" "aeIfgH" "defg"
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
# [1] "IacHgd" "aeIfgH" "defg"
gsub(c("H"), c("I"),c("HacHgd", "aeHfgH", "defg"))
# [1] "IacIgd" "aeIfgI" "defg"
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H"), c("I"))
# [1] "IacIgd" "aeIfgI" "defg"
与sub
和gsub
不同,stringr
包中的函数str_repalce
和str_replace_all
不仅可以实现一个字符串的查询替换,也可以实现多个字符串在相同位置的针对查询替换。(其实本质是一样的,就是短的字符向量重复完成匹配)。
sub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg")) # 只有H参与了查询替换
## [1] "IacHgd" "aeIfgH" "defg"
## 1.Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## 2.Warning in sub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## [1] "IacHgd" "beHfgH" "defg"
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
gsub(c("H","a"), c("I", "b"),c("HacHgd", "aeHfgH", "defg"))
## [1] "IacIgd" "aeIfgI" "defg"
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'pattern' has length > 1 and only the first element will be used
## Warning in gsub(c("H", "a"), c("I", "b"), c("HacHgd", "aeHfgH", "defg")):
## argument 'replacement' has length > 1 and only the first element will be used
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a"), c("I", "b"))
## [1] "IacIgd" "beHfgH" "defg"
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
str_replace(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e")) #此时返回结果长度为4
## [1] "IacHgd" "beHfgH" "defH" "HacHge"
## Warning in stri_replace_first_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
str_replace_all(c("HacHgd", "aeHfgH", "defg"), c("H","a","g", "d"), c("I", "b","H","e"))#此时返回结果长度为4
## [1] "IacIgd" "beHfgH" "defH" "HacHge"
## Warning in stri_replace_all_regex(string, pattern,
## fix_replacement(replacement), : longer object length is not a multiple of
## shorter object length
此外,函数str_repalce_all
还可以实现多个字符串的同时替换(str_replac
没有此功能)。
y = c(c("I", "b"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "IbcIgd" "beIfgI" "defg"
针对函数str_repalce_all
的多个字符串的同时替换功能,有时会出现意想不到的结果,而mgsub::mgsub
可以产生另外一种结果。
y = c(c("a", "H"))
names(y) = c("H","a")
str_replace_all(c("HacHgd", "aeHfgH", "defg"),y)
## [1] "HHcHgd" "HeHfgH" "defg"
mgsub::mgsub(c("HacHgd", "aeHfgH", "defg"), c("H","a"),c(c("a", "H")))
## [1] "aHcagd" "Heafga" "defg"