R语言stringr包处理字符串
stringr
包是R数据处理神器Tidyverse
包中的工具之一,是处理字符串很好用的工具,结合正则表达式,可以发挥巨大作用。
字符串长度
stringr
包的操作对象是向量,str_length()
函数用于确定字符串长度。
> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_length(x)
#> [1] 3 5 5 5 4 9
但是如果是下面这种写法,便会出现语法错误,因为输入对象是非向量。
> str_length("why", "video", "cross", "extra", "deal", "authority")
Error in str_length("why", "video", "cross", "extra", "deal", "authority") :
unused arguments ("video", "cross", "extra", "deal", "authority")
字符串拼接
str_c()
函数用于进行字符串的拼接,主要参数有待拼接字符串向量,sep=''
和collapse=NULL
。
> x = c('apple', 'banana','peach')
> y = c('one', 'two', 'three')
> str_c(x, y)
# [1] "appleone" "bananatwo" "peachthree"
> str_c(x, y, sep = '_')
# [1] "apple_one" "banana_two" "peach_three"
> str_c(x, y, collapse = "_")
# [1] "appleone_bananatwo_peachthree"
注意,上述例子中sep
和collapse
的作用后的不同,sep
作用后还是多个字符串,collapse
作用后则变为了一个字符串。
> case1 <- str_c(x, y, sep = '_')
> str_length(case1)
# [1] 9 10 11
> case2 <- str_c(x, y, collapse = "_")
> str_length(case2)
# [1] 29
字符串拆分
str_split()
是stringr
包中进行字符串拆分的函数,根据特定字符或者子集数量进行字符串拆分,选取特定子集。
# 构建一个由'_'分割的字符串向量
> x <- c('aajs_123_dkks', 'ahda_236_akdk', 'ahdj_178_ajdj', 'agsh_109_auqyr', 'qwp_2635_qnjx')
> str_split(x, pattern = '_')
[[1]]
[1] "aajs" "123" "dkks"
[[2]]
[1] "ahda" "236" "akdk"
[[3]]
[1] "ahdj" "178" "ajdj"
[[4]]
[1] "agsh" "109" "auqyr"
[[5]]
[1] "qwp" "2635" "qnjx"
主要参数如下pattern = , n = Inf , simplify = FALSE
,默认返回值类型为list
,simplify = True
则返回值类型为matrix
,array
。
> class(str_split(x, pattern = '_'))
[1] "list"
> str_split(x, pattern = '_', simplify = TRUE)
[,1] [,2] [,3]
[1,] "aajs" "123" "dkks"
[2,] "ahda" "236" "akdk"
[3,] "ahdj" "178" "ajdj"
[4,] "agsh" "109" "auqyr"
[5,] "qwp" "2635" "qnjx"
> class(str_split(x, pattern = '_', simplify = TRUE))
# [1] "matrix" "array"
字符串向量拆分后选择特定的列,用于后续操作,比如本例中拆分后选取数字列,则可以使用矩阵和数组选取子集的操作。
> str_split(x, pattern = '_', simplify = TRUE)[,2]
[1] "123" "236" "178" "109" "2635"
字符串子集
可以使用str_subset()
根据某一特征选取向量中的特定的字符串,也可以结合正则表达式进行选择。参数包括pattern
和negate
。negate
默认是FALSE
,如果是TRUE
,作用是反选。
> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_subset(x, pattern = 'o')
[1] "video" "cross" "authority"
> str_subset(x, pattern = 'o', negate = T)
[1] "why" "extra" "deal"
如果使用正则表达式
,则和不使用存在不同,如下举例。
> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_subset(x, pattern = 'oi')
# character(0)
> str_subset(x, pattern = '[oi]')
[1] "video" "cross" "authority"
字符串替换
使用str_replace()
进行特定字符的替换,参数包括要替换的模式pattern
和替换成的模式replacement
。
> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_replace(x, 'i', '@')
[1] "why" "v@deo" "cross" "extra" "deal" "author@ty"
使用正则表达式后则有所不同:
> str_replace(x, 'ie', '@@')
[1] "why" "video" "cross" "extra" "deal" "authority"
> str_replace(x, '[ie]', '@@')
[1] "why" "v@@deo" "cross" "@@xtra" "d@@al" "author@@ty"
注意,str_replace
只替换匹配到的第一个,使用str_replace_all()
进行全部替换:
> x <- c('apple', 'happy')
> x
[1] "apple" "happy"
> str_replace(string = x, pattern = 'p', replacement = '%')
[1] "a%ple" "ha%py"
> str_replace_all(string = x, pattern = 'p', replacement = '%')
[1] "a%%le" "ha%%y"
另外,使用str_replace_na()
缺失值的替换,
> x <- c('one', NA,'ten', NA, 'eleven',NA)
> x
[1] "one" NA "ten" NA "eleven" NA
> str_replace_na(string = x, replacement = '%')
[1] "one" "%" "ten" "%" "eleven" "%"
字符串填补
使用str_pad()
函数进行字符串的填补,参数包括string, width, side = c("left", "right", "both"), pad = " ")
,举例如下:
> str_pad(string = letters[1:7], width = 5, side = 'left', pad = '#')
[1] "####a" "####b" "####c" "####d" "####e" "####f" "####g"
> str_pad(string = letters[1:7], width = 5, side = 'both', pad = '#')
[1] "##a##" "##b##" "##c##" "##d##" "##e##" "##f##" "##g##"
> str_pad(string = letters[1:7], width = 5, side = 'right', pad = '#')
[1] "a####" "b####" "c####" "d####" "e####" "f####" "g####"