R语言学习R语言

正则表达式(stringr包)

2021-03-06  本文已影响0人  范垂钦_92be

前言

R语言的stringr包是由大名鼎鼎的Hadley Wickham开发,是对于stringi的进一步封装。这个人有多牛,因为他数据处理和可视化开发工具方面的突出贡献,获得专为统计计算而设立的约翰·钱伯斯奖,这在当年可是让一众统计学家大呼不满。Hadley Wickham通过开发ggplot2包让人们意识到原来R语言绘图可以这么简单美观,这可是为R语言争取了不少用户,因为觉得在数据处理不够便捷,大神就写了一个目前堪称数据处理的神器tidyverse,将众多的方法串联到一起,tidyverse是他把自己所写的包整理成了一整套数据处理的方法,包括ggplot2、readr、purrr、dplyr、tidyr、stringr、forcats、reshape2等。同时还专门写了一本书《R for Data Science》,中文书名是《R数据科学》。这本书里面也详细介绍了tidyverse的使用方法。这个大佬目前是Rstudio首席科学家,中国R语言的荣光谢益辉大神也在这个公司工作。而在这里,我们主要介绍R语言的stringr包。

定义和举例

正则表达式是什么?它是一种提取文本串特征、描述文本串的方法!

  1. 首先,我们想要查找的两个字符串刚好位于括号内。
pattern = "\\(\\),"
  1. 找到了括号,这个括号里面是要有东西的,内容就是大写英文字母、小写英文字母和数字,所以我们用[A-Za-z0-9]代替。A-Z代表从A-Z的大写英文字母集合,a-z代表从a-z的小写英文字母集合。0-9代表从0-9的数字的集合。
  2. 此外,括号内的内容是大写英文字母、小写英文字母和数字中的任意两部分且数量不限,所以我们用*符号代替。*代表0或者多个。
pattern = "\\([A-Za-z0-9]*\\),"
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> string
[1] "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> pattern
[1] "\\([A-Za-z0-9]*\\)"
> str_match(string,pattern)
     [,1]             
[1,] "(Chlamydomonas)"
> str_match_all(string,pattern)
[[1]]
     [,1]             
[1,] "(Chlamydomonas)"
[2,] "(IFT80)"        
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> a <- str_match_all(string,pattern) # str_match_all函数输出结果是列表
> pattern <- "\\(|\\)"
> str_remove_all(a[[1]], pattern)
[1] "Chlamydomonas" "IFT80"        

For str_match, a character matrix. First column is the complete match, followed by one column for each capture group. For str_match_all, a list of character matrices.

> pattern = "\\([A-Za-z0-9]*\\)"
> pattern2 = "\\(([A-Za-z0-9]*)\\)"
> str_match_all(string,pattern2)
[[1]]
     [,1]              [,2]           
[1,] "(Chlamydomonas)" "Chlamydomonas"
[2,] "(IFT80)"         "IFT80"        
> str_match_all(string,pattern2)[[1]][,2]
[1] "Chlamydomonas" "IFT80"

正则表达式

①.R中的正则表达式模式有三种:

1、扩展正则表达式:默认方式(第一种是最常用的);
2、Perl风格正则表达式:设置参数perl = TRUE;
3、字面意义正则表达式:设置参数fixed = TRUE。

②.R中的基本元字符如下:(这些字符的含义与Python一样)

. \ | ( ) [ ] ^ $ * + ?

.     表示任意字符,包括换行符;
\     表示对字符进行转义,即恢复它本来的含义。但在R中,\中也是字符,所以转义\是要用\\;
|     表示匹配,举个例子,A|B,表示对A或B其中一个匹配,A匹配成功则不匹配B;
()    字符组,括号中的模式作为一个整体进行匹配
[]    字符集合,括号内的任意字符将被匹配
^     匹配字符串开头。举个例子,^MT-表示匹配开头含有M、T这两个字母的字符串(常见于单细胞测序中线粒体基因的匹配)。但如果加了[]并且位于首位,则表示反义。例如[^6],则表示匹配所有不是6的字符
$     匹配字符串结尾。但将它置于[]内则消除了它的特殊含义。例如[akm$],表示匹配’a’,’k’,’m’或者’$’。

数量词:* + ? {m} {m,n} {m,}

*      前一个规则匹配0或无限次
+      前一个规则匹配1或无限次
?      前一个规则匹配0或1次,也常用语非贪婪模式中
{m}    前一个规则匹配m次
{m,n}  前一个规则匹配m~n次,尽可能多
{m,}   前一个规则匹配m次以上,尽可能多
③.R中的转义

如果我们想查找元字符本身,如?*,我们需要提前告诉编译系统,取消这些字符的特殊含义。这个时候,就需要用到转义字符\,即使用\?\.当然,如果我们要找的是\,则使用\\进行匹配。

④.R中预定义的字符组
代码 含义说明
[:digit:] 数字:0-9
[:lower:] 小写字母:a-z
[:upper:] 大写字母:A-Z
[:alpha:] 字母:a-z及A-Z
[:alnum:] 所有字母及数字
[:punct:] 标点符号,如. , ;
[:graph:] Graphical characters,即[:alnum:]和[:punct:]
[:blank:] 空字符,即:Space和Tab
[:space:] Space,Tab,newline,及其他space characters
[:print:] 可打印的字符,即:[:alnum:],[:punct:]和[:space:]
⑤.R中代表字符组的特殊符号
代码 含义说明
\w 字符串,等价于[:alnum:]
\W 非字符串,等价于[^[:alnum:]]
\s 空格字符,等价于[:blank:]
\S 非空格字符,等价于[^[:blank:]]
\d 数字,等价于[:digit:]
\D 非数字,等价于[^[:digit:]]
\b Word edge(单词开头或结束的位置)
\B No Word edge(非单词开头或结束的位置)
\< Word beginning(单词开头的位置)
\> Word end(单词结束的位置)

stringr包

①.stringr包安装
#第一种方法是从CRAN上安装发行版:
install.packages("stringr")
#第二种方法是从github上安装最新的版本,可测试最新的功能:
install.packages("devtools")
devtools::install_github("tidyerse/stringr")
#第三种方法是直接安装tidyverse包,它会顺便就把stringr包安装上
install.packages("tidyverse")
②.stringr包函数

stringr包里面的函数主要分为6大类,包括:

  1. 字符串匹配函数:str_detect、str_which、str_count、str_locate、str_locate_all、str_view、str_view_all
  2. 字符串截取函数:str_sub、str_subset、str_extract、str_extract_all、str_match、str_match_all
  3. 字符串长度控制函数:str_length、str_pad、str_trunc、str_trim、str_squish
  4. 字符串变化函数:str_replace、str_replace_all、str_replace_na、str_to_lower、str_to_upper、str_remove、str_remove_all
  5. 字符串拼接/切割函数:str_c、str_dup、str_split、str_split_fixed
  6. 字符串排序函数:str_sort、str_order

接下来,我们将逐个演示这些函数的使用方法。

1. 字符串匹配函数

str_detect可以检测pattern是否包括在某个字符串中,并返回TRUE和FALSE

> x <- c("apple","banana","pear")
> str_detect(x,"a")
[1] TRUE TRUE TRUE

str_count检测pattern是否包括在某个字符串中的数目

> x <- c("apple","banana","pear")
> str_count(x,"a")
[1] 1 3 1

str_which告诉pattern的索引位置

> x <- c("apple","banana","pear")
> str_which(x,"a")
[1] 1 2 3
> str_which(x,"ar")
[1] 3
> str_which(x,"an")
[1] 2
> str_which(x,"ap")
[1] 1

str_locatestr_locate_all返回pattern的开始和终止位置;
区别是str_locate只返回字符串里面的首个匹配到的pattern;
str_locate_all返回字符串里面的所有匹配到的pattern;

> x <- c("apple","banana","pear")
> str_locate(x,"a")
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3
> str_locate(x,"an")
     start end
[1,]    NA  NA
[2,]     2   3
[3,]    NA  NA
> str_locate_all(x,"a")
[[1]]
     start end
[1,]     1   1

[[2]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[3]]
     start end
[1,]     3   3

> str_locate_all(x,"an")
[[1]]
     start end

[[2]]
     start end
[1,]     2   3
[2,]     4   5

[[3]]
     start end

str_viewstr_view_all函数都可以以可视化的方式,返回字符串中匹配到的pattern;

> x <- c("apple","banana","pear")
> str_view(x,"a")
image.png
> x <- c("apple","banana","pear")
> str_view_all(x,"a")
image.png
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "\\([A-Za-z0-9]*\\)"
> str_view_all(string,pattern)
image.png
2. 字符串截取函数

str_sub在给定起始和终止参数的基础上对字符串进行截取或者替换

> x <- c("apple","banana","pear")
> str_sub(x,1,3)
[1] "app" "ban" "pea"
> # 负号表示从后往前数
> str_sub(x,-3,-1)
[1] "ple" "ana" "ear"
> # "a"替换截取出来的字符串,此时原本的x会发生改变
> str_sub(x,1,3) <- "a"
> x
[1] "ale"  "aana" "ar"

str_subset返回pattern所在的字符串

> x <- c("apple","banana","pear")
> str_subset(x,"ap")
[1] "apple"
> str_subset(x,"an")
[1] "banana"
> str_subset(x,"a")
[1] "apple"  "banana" "pear"  
># negate = T时,返回不匹配的字符串
> str_subset(x,"ap",negate = T)
[1] "banana" "pear" 

str_extract函数返回每个字符串中首个匹配到的pattern
str_extract_all函数返回每个字符串中所有匹配到的patternstr_extract_all函数中simplify默认为False,默认返回list;当simplify为True,则返回matrix

> x <- c("apple","banana","pear")
> str_extract(x,"a")
[1] "a" "a" "a"
> str_extract_all(x,"a")
[[1]]
[1] "a"

[[2]]
[1] "a" "a" "a"

[[3]]
[1] "a"

> str_extract_all(x,"a",simplify = T)
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "a"  "a" 
[3,] "a"  ""   ""  

str_match函数返回每个字符串中首个匹配到的pattern,以matrix的形式呈现
str_match_all函数返回每个字符串中所有匹配到的pattern,以list的形式呈现

> x <- c("apple","banana","pear")
> str_match(x,"a")
     [,1]
[1,] "a" 
[2,] "a" 
[3,] "a" 
> str_match_all(x,"a")
[[1]]
     [,1]
[1,] "a" 

[[2]]
     [,1]
[1,] "a" 
[2,] "a" 
[3,] "a" 

[[3]]
     [,1]
[1,] "a" 
3. 字符串长度控制函数

str_length函数可以计算字符串的长度

> x <- c("apple","banana","pear")
> str_length(x)
[1] 5 6 4

str_pad函数可以填充字符

> str_pad(c("a", "abc", "abcdef"), 10,pad="a")
[1] "aaaaaaaaaa" "aaaaaaaabc" "aaaaabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,pad="k")
[1] "kkkkkkkkka" "kkkkkkkabc" "kkkkabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,side="left",pad="k")
[1] "kkkkkkkkka" "kkkkkkkabc" "kkkkabcdef"
> str_pad(c("a", "abc", "abcdef"), 10,side="right",pad="k")
[1] "akkkkkkkkk" "abckkkkkkk" "abcdefkkkk"
> str_pad(c("a", "abc", "abcdef"), 10,side="both",pad="k")
[1] "kkkkakkkkk" "kkkabckkkk" "kkabcdefkk"
> str_pad(c("a", "abc", "abcdef"), 5,pad="k")
[1] "kkkka"  "kkabc"  "abcdef"
> str_pad(c("a", "abc", "abcdef"), 1,pad="k")
[1] "a"      "abc"    "abcdef"
> str_pad(c("aa", "abc", "abcdef"), 1,pad="k")
[1] "aa"     "abc"    "abcdef"
> str_pad(c("a", "abc", "abcdef"), c(1, 2, 3),pad="k")
[1] "a"      "abc"    "abcdef"
> str_pad(c("a", "abc", "abcdef"), c(2, 4, 7),pad="k")
[1] "ka"      "kabc"    "kabcdef"
> str_pad(c("a", "abc", "abcdef"), c(2, 4, 7),pad=c("k","l","m"))
[1] "ka"      "labc"    "mabcdef"

str_trim函数去除字符串的空白部分

> str_trim("  String with trailing and leading white space\t")
[1] "String with trailing and leading white space"
> str_trim("\n\nString with trailing and leading white space\n\n")
[1] "String with trailing and leading white space"

str_squish函数作用和str_trim函数作用一致,但除了去除字符串前、后的空格,它还可以去除字符串中间出现的重复的空格。这一点上,str_trim函数无法办到。

> str_trim("\n\nString with excess,  trailing and leading white   space\n\n")
[1] "String with excess,  trailing and leading white   space"
> str_squish("\n\nString with excess,  trailing and leading white   space\n\n")
[1] "String with excess, trailing and leading white space"

str_trunc函数可以把字符串切割到指定长度

> x <- "This string is moderately long"
> str_trunc(x, 20, "right")
[1] "This string is mo..."
> str_trunc(x, 20, "left")
[1] "...s moderately long"
> str_trunc(x, 20, "center")
[1] "This stri...ely long"
4. 字符串变化函数

str_replace函数可以替换pattern为新的字符,仅限于第一个匹配到的
str_replace_all函数可以替换所有匹配到的pattern
str_replace_na 可以将缺失值替换成‘NA’,这样na.omit函数就无法将缺失值删除了

> x <- c("apple","banana","pear")
> # 把 a 替换成 k 
> str_replace(x,"a","k")
[1] "kpple"  "bknana" "pekr"  
> str_replace_all(x,"a","k")
[1] "kpple"  "bknknk" "pekr" 

> x <- c(NA, "abc", "def")
> x
[1] NA    "abc" "def"
> is.na(x)
[1]  TRUE FALSE FALSE
> table(is.na(x))
FALSE  TRUE 
    2     1
> na.omit(x)
[1] "abc" "def"
attr(,"na.action")
[1] 1
attr(,"class")
[1] "omit"
> str_replace_na(x)
[1] "NA"  "abc" "def"
> x <- str_replace_na(x)
> x
[1] "NA"  "abc" "def"
> na.omit(x)
[1] "NA"  "abc" "def"

str_replacestr_replace_all函数中,replacement可以用\1, \2中表示模式中的捕获

> str_replace_all(c("123,456", "011"), 
+                 "([[:digit:]]+),([[:digit:]]+)", "\\2,\\1")
[1] "456,123" "011"

str_to_upper函数可以将小写字母转成大写字母
str_to_lower函数可以将大写字母转成小写字母

> x <- c("apple","banana","pear")
> str_to_upper(x)
[1] "APPLE"  "BANANA" "PEAR"  
> str_to_lower(x)
[1] "apple"  "banana" "pear"

str_remove可以移除字符串中首个匹配到的pattern
str_remove_all可以移除字符串中所有匹配到的pattern

> fruits <- c("one apple", "two pears", "three bananas")
> str_remove(fruits, "[aeiou]")
[1] "ne apple"     "tw pears"     "thre bananas"
> str_remove_all(fruits, "[aeiou]")
[1] "n ppl"    "tw prs"   "thr bnns"
5. 字符串拼接/切割函数

str_c函数可以拼接多个字符串

> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[22] "v" "w" "x" "y" "z"
> str_c(letters,letters,sep = "-")
 [1] "a-a" "b-b" "c-c" "d-d" "e-e" "f-f" "g-g" "h-h" "i-i" "j-j" "k-k" "l-l" "m-m" "n-n"
[15] "o-o" "p-p" "q-q" "r-r" "s-s" "t-t" "u-u" "v-v" "w-w" "x-x" "y-y" "z-z"
> str_c(letters,letters,collapse = "-")
[1] "aa-bb-cc-dd-ee-ff-gg-hh-ii-jj-kk-ll-mm-nn-oo-pp-qq-rr-ss-tt-uu-vv-ww-xx-yy-zz"

str_dup函数可以复制字符串

> fruit <- c("apple", "pear", "banana")
> str_dup(fruit, 2)
[1] "appleapple"   "pearpear"     "bananabanana"
> str_dup(fruit, 1:3)
[1] "apple"              "pearpear"           "bananabananabanana"
> str_c("ba", str_dup("na", 0:5))
[1] "ba"           "bana"         "banana"       "bananana"     "banananana"  
[6] "bananananana"

str_split按照pattern分割字符串

str_split_fixed按照pattern将字符串分割成指定个数

> fruits <- c(
+     "apples and oranges and pears and bananas",
+     "pineapples and mangos and guavas"
+ )
> 
> str_split(fruits, " and ")
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    

> str_split(fruits, " and ", simplify = TRUE)
     [,1]         [,2]      [,3]     [,4]     
[1,] "apples"     "oranges" "pears"  "bananas"
[2,] "pineapples" "mangos"  "guavas" ""       
> 
> # Specify n to restrict the number of possible matches
> str_split(fruits, " and ", n = 3)
[[1]]
[1] "apples"            "oranges"           "pears and bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    

> str_split(fruits, " and ", n = 2)
[[1]]
[1] "apples"                        "oranges and pears and bananas"

[[2]]
[1] "pineapples"        "mangos and guavas"

> # If n greater than number of pieces, no padding occurs
> str_split(fruits, " and ", n = 5)
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"

[[2]]
[1] "pineapples" "mangos"     "guavas"    

> # Use fixed to return a character matrix
> str_split_fixed(fruits, " and ", 3)
     [,1]         [,2]      [,3]               
[1,] "apples"     "oranges" "pears and bananas"
[2,] "pineapples" "mangos"  "guavas"           
> str_split_fixed(fruits, " and ", 4)
     [,1]         [,2]      [,3]     [,4]     
[1,] "apples"     "oranges" "pears"  "bananas"
[2,] "pineapples" "mangos"  "guavas" ""       
6. 字符串排序函数

str_order函数和str_sort函数都可以对字符串进行排序,两者之前的区别在于前者返回排序后的索引(下标),而后者返回排序后的实际值

> x <- c("a", "cc", "bbb", "dddd")
> str_sort(x)
[1] "a"    "bbb"  "cc"   "dddd"
> str_order(x)
[1] 1 3 2 4
stringr重要函数总结
字符串匹配函数:str_detect、str_which、str_view、str_view_all
字符串截取函数:str_subset、str_extract、str_extract_all、str_match、str_match_all
字符串变化函数:str_replace、str_replace_all、str_remove、str_remove_all
字符串拼接/切割函数:str_c、str_split、str_split_fixed
函数 功能说明 R Base中对应函数 是否允许正则表达式
str_detect 检测pattern是否包括在某个字符串中 grepl
str_which 告诉pattern的索引位置
str_view 可视化字符串中首个pattern匹配到的位置
str_view_all 可视化字符串中所有pattern匹配到的位置
str_subset 返回pattern所在的字符串
str_extract 返回每个字符串中首个匹配到的pattern regmatches
str_extract_all 返回每个字符串中所有匹配到的pattern regmatches
str_match 返回每个字符串中首个匹配到的pattern
str_match_all 返回每个字符串中所有匹配到的pattern
str_replace 替换每个字符串首个匹配到的pattern sub
str_replace_all 替换每个字符串所有匹配到的pattern gsub
str_remove 移除字符串中首个匹配到的pattern
str_remove_all 移除字符串中所有匹配到的pattern
str_c 可以拼接多个字符串 paste或paste0
str_split 按照pattern分割字符串 strsplit
str_split_fixed 按照pattern将字符串分割成指定个数
  1. 例如,使用str_split函数可以顺利将括号内的东西提取出来
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> str_split(string,"\\(|\\)",simplify = T)
     [,1]                                                [,2]            [,3] [,4]   
[1,] "Homo sapiens intraflagellar transport 80 homolog " "Chlamydomonas" " "  "IFT80"
     [,5]    
[1,] ", mRNA"
> str_split(string,"\\(|\\)",simplify = T)[,c(2,4)]
[1] "Chlamydomonas" "IFT80"        
  1. 例如,使用str_exact函数和str_remove_all函数可以顺利将括号内的东西提取出来
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> str_extract_all(string,"\\([A-Za-z0-9]*\\)")
[[1]]
[1] "(Chlamydomonas)" "(IFT80)"
> a <- str_extract_all(string,"\\([A-Za-z0-9]*\\)")
> str_remove_all(a[[1]], pattern)
[1] "Chlamydomonas" "IFT80"        
  1. 你甚至可以用srt_replace_all函数提取括号内的字符
> string <- "Homo sapiens intraflagellar transport 80 homolog (Chlamydomonas) (IFT80), mRNA"
> pattern <- "([:print:]+)(\\()([:print:]+)(\\))([:print:]+)(\\()([:print:]+)(\\))([:print:]+)"
> str_replace_all(string, pattern,"\\3")
[1] "Chlamydomonas"
> str_replace_all(string, pattern,"\\7")
[1] "IFT80"
参考:

R 正则表达式
R语言与正则表达式
原来是它!正则表达式揪出生信分析中没有报错的内鬼错误
R语言教程
R for Data Science

上一篇下一篇

猜你喜欢

热点阅读