字符串的处理(初步处理+stringr&stringi)
2021-06-08 本文已影响0人
Hayley笔记
字符串的处理与正则表达式关系密切,参考:R语言中的正则表达式
1. 字符串的初步处理
生成字符串
x <- c('huake','wuda')
1.1 nchar函数:查看字符串有多少个字符
nchar(x)
# [1] 5 4
⚠️注意nchar函数与length函数的区别,如果用length(x),返回的是2(有两个字符串),但可以使用str_length()函数
length(x)
# [1] 2
str_length(x)
# [1] 5 4
1.2 大小写的转换
toupper函数:小写变大写
tolower函数:大写变小写
toupper('huake')
# [1] "HUAKE"
tolower('WUDA')
# [1] "wuda"
1.3 paste()
函数和paste0()
函数:连接字符串
paste函数
stringa <- LETTERS[1:5]
STRINGB <- 1:5
paste(stringa,STRINGB)
# [1] "A 1" "B 2" "C 3" "D 4" "E 5"
# sep参数可以定义黏贴参数间的连接方法
paste(stringa,STRINGB,sep='-')
# [1] "A-1" "B-2" "C-3" "D-4" "E-5"
#collapse参数,把所有参数粘贴在一起,并定义连接方法
paste(stringa,STRINGB,collapse ='-')
# [1] "A 1-B 2-C 3-D 4-E 5"
paste0函数 (0代表粘贴在一起后没有间隔)
paste0(stringa,STRINGB)
# [1] "A1" "B2" "C3" "D4" "E5"
#使用sep和collapse也无法插入到中间
paste0(stringa,STRINGB,sep='-')
# [1] "A1-" "B2-" "C3-" "D4-" "E5-"
paste0(stringa,STRINGB,collapse ='-')
# [1] "A1-B2-C3-D4-E5"
若对paste函数设置sep="",效果和paste0一样
1.4 拆分函数strsplit()
拆分后生成列表
stringC <- paste(stringa,STRINGB,sep='/')
stringC
# [1] "A/1" "B/2" "C/3" "D/4" "E/5"
M <- strsplit(stringC,split = '/')
M
# [[1]]
# [1] "A" "1"
# [[2]]
# [1] "B" "2"
# [[3]]
# [1] "C" "3"
# [[4]]
# [1] "D" "4"
# [[5]]
# [1] "E" "5"
class(M)
# [1] "list"
1.5 字符串的截取函数 substr
# 从2-4位截取
stringd <- c('python','java','ruby','php','huazhongda')
sub_str <- substr(stringd,start = 2,stop = 4)
sub_str
# [1] "yth" "ava" "uby" "hp" "uaz"
# 除了截取,还可以赋值 #将2-4位换成aaa
substr(stringd,start = 2,stop = 4) <- 'aaa'
stringd
# [1] "paaaon" "jaaa" "raaa" "paa" "haaahongda"
1.6 grep()
函数和grepl()
函数
处理比较复杂的字符串
# 语法
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
生成向量
seq_names <- c('EU_FRA02_C1_S2008','AF_COM12_B0_2004','AF_COM17_F0_S2008',
'AS_CHN11_C3_2004','EU-FRA-C3-S2007','NAUSA02E02005',
'AS_CHN12_N0_05','NA_USA03_C2_S2007','NA USA04 A3 2004',
'EU_UK01_A0_2009','eu_fra_a2_s98','SA/BRA08/B0/1996')
# 有大写有小写,有斜杠有下划线,有确定年份有不确定年份。。。
# 如第一个,EU是欧洲,FRA是法国,02是法国的第二个序列,C1是序列亚型,2008是样本收集年份,S是2008年是一个推测的数值,并不确定。
grep()函数提取法国的元素
fra_seq <- grep(pattern = 'FRA|fra',x=seq_names)
fra_seq
# [1] 1 5 11
seq_names[fra_seq]
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007" "eu_fra_a2_s98"
#也可通过设置value = TRUE来返回得到的元素
fra_seq <- grep(pattern = 'FRA|fra',x=seq_names,value = TRUE)
fra_seq
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007" "eu_fra_a2_s98"
# 通过设置ignore.case = T来忽略大小写
grep(pattern = 'FRA|fra',x=seq_names,value = TRUE,ignore.case = T)
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007" "eu_fra_a2_s98"
#这里用到了正则表达式
grepl()函数返回的是TRUE或FALSE
grepl(pattern = 'FRA|fra',x=seq_names)
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# 用[]提取
seq_names[grepl(pattern = 'FRA|fra',x=seq_names)]
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007" "eu_fra_a2_s98"
✨练习:提取如上向量中有明确收集年份的序列。
(思路:找出不明确年份的序列(含s或S的),然后取非。)
spe_seq <- seq_names[!grepl(pattern = '[s|S][0-9]{2,4}\\b',seq_names)]
spe_seq
# [1] "AF_COM12_B0_2004" "AS_CHN11_C3_2004" "NAUSA02E02005" "AS_CHN12_N0_05"
# [5] "NA USA04 A3 2004" "EU_UK01_A0_2009" "SA/BRA08/B0/1996"
# \\是转义符,\\b是去匹配boundary,放在右边说明是去匹配字符的结尾。
# 前面[s|S]的意思是在s或S中取值,[0-9]的意思是在0-9中取值,{2,4}紧跟在[0-9]后面的意思在0-9中取值取2-4次。
1.7 gsub()
函数和sub()
函数
# 语法
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
money <- c('$1888','$2888','$3888')
# 由于美元符的存在,不能直接使用as.numeric
as.numeric(money)
# [1] NA NA NA
gsub()函数
# $本身也有含义,不能直接使用,需要在前面加上转义符\\,之后再用as.numeric转换。
money1 <- gsub('\\$',replacement = '',money)
money1
[1] "1888" "2888" "3888"
as.numeric(money1)
# [1] 1888 2888 3888
gsub函数可以替换它找到的所有的字符
sub函数只能替换它找到的第一个字符
sub('\\$',replacement = '',money)
# [1] "1888" "2888" "3888"
money <- c('$1888 $2888 $3888')
sub('\\$',replacement = '',money)
# [1] "1888 $2888 $3888"
gsub('\\$',replacement = '',money)
# [1] "1888 2888 3888"
1.8 regexpr()
函数、gregexpr()
函数和regexec()
函数
功能非常类似
# 语法
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
以regexpr()为例:
# 寻找test_string里含有pp的字符串
test_string <- c('happy','apple','application','apolotoc')
regexpr('pp',test_string)
# [1] 3 2 2 -1
# attr(,"match.length")
# [1] 2 2 2 -1
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
# 返回的3 2 2 -1的意思是,第一个字符串里的pp出现在第三位,第二个和第三个出现在第二位。最后一个没有找到,返回-1。
1.9 agrep()函数和agrepl()函数
以agrep()为例
string1 <- c('I need a favour','my favorite sport','you made an error')
agrep('favor',string1)
# [1] 1 2
英式英语和美式英语的写法可以自动被识别
2. stringr和stringi包
stringr和stringi功能类似,stringi功能更强大,但更依赖于正则表达式的使用。
# 查看这两个包中的函数
library(stringr)
library(stringi)
ls('package:stringr')
ls('package:stringi')
# stringr中有52个函数, stringi中有252个函数。
2.1 stringr包⚠️
常用函数 | 功能 |
---|---|
str_split() /str_c() | 字符串拆分与组合 |
str_length() | 检测字符串长度 |
str_sub() | 按位置提取字符 |
str_dup | 识别重复的字符串 |
str_trim | 去除字符串首尾的空格 |
str_to_upper()/str_to_lower()/str_to_title() | 大小写转换 |
str_locate() | 字符串定位 |
str_detect(x,“h”) | 字符检测 –返回逻辑值 |
str_extract()/ str_extract_all() | 字符提取 |
str_remove()/ str_remove_all() | 字符删除 |
str_replace()/str_replace_all() | 字符串替换 |
- 2.1.1 str_c()和str_split()
str_c()函数与paste函数类似
library(stringr)
str_c('a','b')
# [1] "ab"
str_c('a','b',sep='-')
# [1] "a-b"
str_split()⚠️
x <- "The birch canoe slid on the smooth planks."
x
# [1] "The birch canoe slid on the smooth planks."
str_split(x," ") #生成的是列表
# [[1]]
# [1] "The" "birch" "canoe" "slid" "on"
# [6] "the" "smooth" "planks."
x[[1]] #得到向量
[1] "The birch canoe slid on the smooth planks."
y = c("john 150","mike 140","lucy 152")
str_split(y," ")
# [[1]]
# [1] "john" "150"
# [[2]]
# [1] "mike" "140"
# [[3]]
# [1] "lucy" "152"
str_split(y," ",simplify = T) #‘simplify = T’生成矩阵⚠️
# [,1] [,2]
# [1,] "john" "150"
# [2,] "mike" "140"
# [3,] "lucy" "152"
-
2.1.2 str_length()函数
对字符串进行计数,与nchar()类似 -
2.1.3 str_sub()函数:按位置提取字符
aaa <- 'huake tongji cardio'
str_sub(aaa,c(1,4,8),c(2,7,11))
# [1] "hu" "ke t" "ongj"
#第一个是1-2个字符,第二个是4-7个字符,第三个是8-11个字符
- 2.1.4 str_dup
fruit <- c('apple','pear','banana')
str_dup(fruit,2) #2表示把字符串重复两次
# [1] "appleapple" "pearpear" "bananabanana"
str_dup(fruit,2:4)
# [1] "appleapple" "pearpearpear" "bananabananabananabanana"
- 2.1.5 str_trim 去除字符串首尾的空格
string <- c(' Huake is good ')
string
# [1] " Huake is good "
str_trim(string,side = 'both')
# [1] "Huake is good"
- 2.1.6 str_locate 字符串定位
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "$")
# start end
# [1,] 6 5
# [2,] 7 6
# [3,] 5 4
# [4,] 10 9
str_locate(fruit, "a")
# start end
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 5 5
str_locate(fruit, c("a", "b", "p", "p"))
# start end
# [1,] 1 1
# [2,] 1 1
# [3,] 1 1
# [4,] 1 1
- 2.1.7 str_detect 字符检测⚠️
fruit <- c("apple", "banana", "pear", "pinapple")
str_detect(fruit, "a")
# [1] TRUE TRUE TRUE TRUE
str_detect(fruit, "^a")
# [1] TRUE FALSE FALSE FALSE
str_detect(fruit, "a$")
# [1] FALSE TRUE FALSE FALSE
str_detect(fruit, "b")
# [1] FALSE TRUE FALSE FALSE
str_detect(fruit, "[aeiou]")
# [1] TRUE TRUE TRUE TRUE
❗️:str_detect()和ifelse()联合使用可以根据字符串中是否存在某字符将字符串分为两类,常用于GEO等分析时根据样本名判断该样本是正常样本还是病例(如肿瘤)样本。
用法:
ifelse(str_detect(colname(a), ''tumor), 'tumor', 'normal' )
# 如果在数据框a的列名中搜索到tumor,返回tumor,没有搜索到返回normal。
- 2.1.8 str_extract和str_extract_all
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract(shopping_list, "\\d")
# [1] "4" NA NA "2"
str_extract(shopping_list, "[a-z]+")
# [1] "apples" "bag" "bag" "milk"
str_extract(shopping_list, "[a-z]{1,4}")
# [1] "appl" "bag" "bag" "milk"
str_extract(shopping_list, "\\b[a-z]{1,4}\\b")
# [1] NA "bag" "bag" "milk"
str_extract_all(shopping_list, "[a-z]+")
# [[1]]
# [1] "apples" "x"
# [[2]]
# [1] "bag" "of" "flour"
# [[3]]
# [1] "bag" "of" "sugar"
# [[4]]
# [1] "milk" "x"
- 2.1.9 str_remove()和str_remove_all()
fruits <- c("one apple", "two pears", "three bananas")
str_remove(fruits, "[aeiou]")
# [1] "ne apple" "tw pears" "thre bananas"
str_remove_all(fruits, "[aeiou]")
# [1] "n ppl" "tw prs" "thr bnns"
- 2.1.10 str_replace()和str_replace_all()
fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "[aeiou]", "-")
# [1] "-ne apple" "tw- pears" "thr-e bananas"
str_replace_all(fruits, "[aeiou]", "-")
# [1] "-n- -ppl-" "tw- p--rs" "thr-- b-n-n-s"
2.2 stringi包
- 2.2.1 stri_join 字符串的粘贴
stri_join(1:13, letters)
# [1] "1a" "2b" "3c" "4d" "5e" "6f" "7g" "8h" "9i" "10j" "11k"
# [12] "12l" "13m" "1n" "2o" "3p" "4q" "5r" "6s" "7t" "8u" "9v"
# [23] "10w" "11x" "12y" "13z"
stri_join(1:13, letters, sep=',')
# [1] "1,a" "2,b" "3,c" "4,d" "5,e" "6,f" "7,g" "8,h" "9,i" "10,j"
# [11] "11,k" "12,l" "13,m" "1,n" "2,o" "3,p" "4,q" "5,r" "6,s" "7,t"
# [21] "8,u" "9,v" "10,w" "11,x" "12,y" "13,z"
stri_join(1:13, letters, collapse='; ')
# [1] "1a; 2b; 3c; 4d; 5e; 6f; 7g; 8h; 9i; 10j; 11k; 12l; 13m; 1n; 2o; 3p; 4q; 5r; 6s; 7t; 8u; 9v; 10w; 11x; 12y; 13z"
2.2.2 stri_cmp_eq和stri_cmp_neq
stri_cmp_eq 判断两个字符串是否完全一样
stri_cmp_neq 判断两个字符串是否不一样
stri_cmp_eq('AB','AB')
# [1] TRUE
stri_cmp_eq('AB','aB')
# [1] FALSE
stri_cmp_neq('AB','aB')
# [1] TRUE
- 2.2.3 stri_cmp_lt和stri_cmp_gt
stri_cmp_lt 小于
stri_cmp_gt 大于
字符串之间的比较,针对数字时按数字大小,针对字母的时候按字母表的顺序,后出现的大
stri_cmp_lt('121','221')
# [1] TRUE
stri_cmp_lt('a121','b221')
# [1] TRUE
- 2.2.4 stri_count
s <- 'Lorem ipsum dolor sit amet, consectetur adipisicing elit.'
stri_count(s, fixed='dolor')
# [1] 1
stri_count(s, regex='\\p{L}+')
# [1] 8
- 2.2.5 stri_dup
stri_dup('a', 1:5)
# [1] "a" "aa" "aaa" "aaaa" "aaaaa"
stri_dup(c('a', NA, 'ba'), 4)
# [1] "aaaa" NA "babababa"
# stri_dup(c('abc', 'pqrst'), c(4, 2))
[1] "abcabcabcabc" "pqrstpqrst"
- 2.2.6 stri_detect_fixed
stri_detect_fixed(c('stringi R', 'R STRINGI', '123'), c('i', 'R', '0'))
# [1] TRUE TRUE FALSE
向量化的从前面那个里面寻找后面那个,找到了就返回TRUE,找不到就返回FALSE
- 2.2.7 stri_detect_regex
# 寻找以ab开头的和以t结尾的
stri_detect_regex(c('above','abort','about','abnormal','abandon'),'^ab')
# [1] TRUE TRUE TRUE TRUE TRUE
stri_detect_regex(c('above','abort','about','abnormal','abandon'),'t\\b')
# [1] FALSE TRUE TRUE FALSE FALSE
# case_insensitive=TRUE是忽视大小写
stri_detect_regex(c('ABove','abort','About','aBnormal','abandon'),'^ab',case_insensitive=TRUE)
# [1] TRUE TRUE TRUE TRUE TRUE
- 2.2.8 stri_startswith_fixed 判断是不是以某个字符开始
stri_startswith_fixed(c('a1','a2','b3','a4','c5'),'a1')
# [1] TRUE FALSE FALSE FALSE FALSE
stri_startswith_fixed(c('abaDc','asdfh','abiude'),'ba',from=2)
# [1] TRUE FALSE FALSE
# from定义从第几个字符开始匹配
- 2.2.9 stri_endswith_fixed 判断是不是以某个字符结束
stri_endswith_fixed(c('abaDc','asdfh','abiudba'),'ba')
# [1] FALSE FALSE TRUE
stri_endswith_fixed(c('abaDc','asdfh','abiudba'),'ba',to=3)
# [1] TRUE FALSE FALSE
# to表示匹配到第几位
- 2.2.10 stri_extract_all
stri_extract_all('XaaaaX', regex=c('\\p{Ll}', '\\p{Ll}+', '\\p{Ll}{2,3}', '\\p{Ll}{2,3}?'))
# [[1]]
# [1] "a" "a" "a" "a"
# [[2]]
# [1] "aaaa"
# [[3]]
# [1] "aaa"
# [[4]]
# [1] "aa" "aa"
stri_extract_all('Bartolini', coll='i')
# [[1]]
# [1] "i" "i"
stri_extract_all('stringi is so good!', charclass='\\p{Zs}') # all white-spaces
# [[1]]
# [1] " " " " " "
- 2.2.11 stri_extract_all_fixed
参数overlap=TRUE意思是可以重复的对字符串进行匹配
stri_extract_all_fixed('abaBAba', 'Aba', case_insensitive=TRUE)
# [[1]]
# [1] "aba" "Aba"
stri_extract_all_fixed('abaBAba', 'Aba', case_insensitive=TRUE, overlap=TRUE)
# [[1]]
# [1] "aba" "aBA" "Aba"
- 2.2.12 stri_extract_all_boundaries 提取字符串的边界
根据空格提取的。问题是提取出来的字符串也带空格
stri_extract_all_boundaries('stringi: THE string processing package 123.48...')
# [[1]]
# [1] "stringi: " "THE " "string "
# [4] "processing " "package " "123.48..."
- 2.2.13 stri_extract_all_words 提取单词
stri_extract_all_words('stringi: THE string processing package 123.48...')
# [[1]]
# [1] "stringi" "THE" "string" "processing"
# [5] "package" "123.48"
- 2.2.14 stri_isempty 判断字符串中是否存在空字符
注意:空格不算空字符
stri_isempty(letters[1:3])
# [1] FALSE FALSE FALSE
stri_isempty(c(',', '', 'abc', '123', '\u0105\u0104'))
# [1] FALSE TRUE FALSE FALSE FALSE
stri_isempty(character(1))
[1] TRUE
- 2.2.15 stri_locate_all 定位函数 可以找到匹配字符在字符串中出现的位置
stri_locate_all('Bartolini', fixed='i')
# [[1]]
# start end
# [1,] 7 7
# [2,] 9 9