生信R

R-tidyverse系列-stringr字符处理

2021-09-03  本文已影响0人  小贝学生信

stringr字符处理 - 简书 (jianshu.com)
dplyr表格操作 - 简书 (jianshu.com)

在正式学习stringrdplyr为代表的tidyrverse核心系列包时,有必要先了解下正则表达式以及管道符的相关知识。

正则表达式

创建包含\(例如\d\s...)的正则表达式,需要在字符串中对\进行转移,即\\d\\s。其实只要记住\\本质上代表\即可。

匹配特殊字符示例
如果只是想匹配字符串本身的含义,fixed(正则表达式)可将正则表达式相关字符当作纯文本字符看待。
fruit = c("apple","banana","pear","pineapple")
#查找具有ABAB模式的字符串
str_view(fruit,"(.)(.)\\1\\2", match = T)

注意:默认的正则匹配方式都是“贪婪的”,即正则表达式会在符合规则的前提下匹配尽量长的字符串。通过在正则表达式后面添加一个 ? ,你可以将匹配方式更改为“懒惰的”,即匹配尽量短的字符串。

管道符%>%

?magrittr::`%>%

stringr字符处理

1、字符串向量特征匹配
2、字符串特征匹配(取子集)
3、字符串的替换修改
4、字符串格式相关
5、字符串拼接与拆分

library(stringr)

https://github.com/rstudio/cheatsheets/blob/master/strings.pdf

1、字符串向量特征匹配

对于一个字符串向量,判断其中哪些字符串是否具有给定的模式特征

fruit <- c("apple", "banana", "pear", "pinapple")
str_detect(fruit, "a") #包含a的字符串
str_detect(fruit, "a", negate = T) #不包含a的字符串
str_detect(fruit, "^a") #以a开头的字符串
str_detect(fruit, "[aeiou]") #含有aeiou其中任意一个字符的字符串
fruit <- c("apple", "banana", "pear", "pinapple")
str_starts(fruit, "p")
str_ends(fruit, "e")
# str_which == which(str_detect(x, pattern))
# str_detect == x[str_detect(x, pattern)]
fruit <- c("apple", "banana", "pear", "pinapple")
str_subset(fruit, "a")
str_which(fruit, "a")
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "a")
str_locate_all(fruit, "a")
fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit, "a")
str_count(fruit, c("a", "b", "p", "p")) #向量化操作

2、字符串特征匹配(取子集)

hw <- "Hadley Wickham"
str_sub(hw, 1, 6)
str_sub(hw, -3, -1)
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract(shopping_list, "\\d")  #提取数字
str_extract(shopping_list, "[a-z]+") #提取1至多个小写字母
str_extract(shopping_list, "[a-z]{1,4}") #贪婪匹配
str_extract(shopping_list, "[a-z]{1,4}?") #懒惰匹配
str_extract(shopping_list, "\\b[a-z]{1,4}\\b") #提取特征模式的单词
str_extract(shopping_list, fixed("\\b[a-z]{1,4}\\b")) #fixed可将正则表达式视为普通字符串

#str_extract 默认只提取字符串里第一个符合的模式
#str_extract_all可以提取全部的匹配
str_extract_all(shopping_list, "[a-z]+")
str_extract_all(shopping_list, "\\b[a-z]+\\b")
str_match("bacad","b(a)")  #提取前一个字符为b的a
str_match("bacad","[^b](a)") ##提取前一个字符不为b的a
#回溯引用
str_match("banana","(a)(.)(\\1\\2)") #提取aXaX的模式文本,其中X可以是任何文本

3、字符串的替换修改

?str_replace #默认只替换第一个模式的文本
?str_replace_all #全部
fruits = c("apple","banana","pear")
str_replace(fruits, "[aeiou]", "-")
#回溯应用,下面表示将a替换为aa,e替换为ee
str_replace(fruits, "([aeiou])", "\\1\\1")

#向量化一对一替换
str_replace(fruits, "[aeiou]", c("1", "2", "3"))
str_replace_all(fruits, "[aeiou]", c("1", "2", "3"))
str_replace(fruits, c("p", "e", "a"), "-") 
str_replace_all(fruits, c("p", "e", "a"), "-") 

#  多种特定模式的替换,仅限于str_replace_all
fruits %>%
  str_c(collapse = "---") %>%
  str_replace_all(c("one" = "1", "two" = "2", "three" = "3"))

4、字符串格式相关

str_length(c("i", "like", "programming", NA))
rbind(
  str_pad("hadley", 30, "left"),
  str_pad("hadley", 30, "right"),
  str_pad("hadley", 30, "both")
)
str_pad("a", 10, pad = c("-", "_", " "))
x <- "This string is moderately long"
rbind(
  str_trunc(x, 20, "right"),
  str_trunc(x, 20, "left"),
  str_trunc(x, 20, "center")
)
str_trim("  String with trailing and leading white space\t")
str_trim("\n\nString with trailing and leading white space\n\n")
str_squish("  String with trailing,  middle, and leading white space\t")
sentences[1]
cat(str_wrap(sentences[1],width = 10))
name <- "Fred"
age <- 50
str_glue("My name is {name}, my age next year is {age + 1}")
mtcars[1:3,] %>% str_glue_data("{rownames(.)} has {hp} hp")
dog <- "The quick brown dog"
##变大写
str_to_upper(dog)
##变小写
str_to_lower(dog)
##首字母大写
str_to_title(dog)

5、字符串拼接与拆分

str_c("Letter-", letters)
str_c("Letter", letters, sep = ": ")
str_c(letters, collapse = "")
fruit <- c("apple", "pear", "banana")
str_dup(fruit, 2)
str_dup(fruit, 1:3)
fruits <- c(
  "apples and oranges and pears and bananas",
  "pineapples and mangos and guavas"
)
str_split(fruits, " and ")
str_split(fruits, " and ", simplify=T)

#限定分割得到的字符串的数量(从左到右)
str_split(fruits, " and ", n = 2)
str_split(fruits, " and ", n = 3)
str_split(fruits, " and ", n = 5)
上一篇 下一篇

猜你喜欢

热点阅读