R数据清洗生信学习

stringr-处理字符的函数

2022-01-30  本文已影响0人  萍智医信

准备工作:安装R包载入数据

rm(list = ls())
if(!require(stringr))install.packages('stringr')
library(stringr)
x <- "The birch canoe slid on the smooth planks."

1.检测字符串长度

length(x)
str_length(x)
str_length(" ")
最后一行代码说明空格也占一个字符

2.字符串拆分与组合

str_split(x," ")
class(str_split(x," "))

可以看出拆分后,向量变成了列表,可以通过列表取子集的方式来重新提取向量。

x2 = str_split(x," ")[[1]]
class(x2)
x2

用下列代码拆分后生成的是矩阵

str_split(x," ",simplify = T)
class(str_split(x," ",simplify = T))

下面我们把拆分的字符合并起来

x2
str_c(x2,collapse = " ")
str_c(x2,1234,sep = "+")

3.提取字符串的一部分

x
str_sub(x,5,9)

很明显空格占一个字符。

4.大小写转换

#全部转换成大写
str_to_upper(x2)
#全部转换成小写
str_to_lower(x2)
#全部首字母大写
str_to_title(x2)

5.字符串排序

x2
str_sort(x2)

按26英文字母顺序排序

6.字符检测

str_detect(x2,"h")
str_starts(x2,"T")
str_ends(x2,"e")
与sum和mean连用,可以统计匹配的个数和比例
str_detect(x2,"h")
sum(str_detect(x2,"h"))
mean(str_detect(x2,"h"))

mean(str_detect(x2,"h"))得出的结果为什么是0.5,看下图,先把str_detect(x2,"h")得出的逻辑型向量转换成数值型向量,TURE:1,FALSE:0,其中1占4个,总数为8,4/8=0.5,故TURE占50%,x2向量中含h占总数的50%。

7.提取匹配到的字符串

x2
#方法一
str_subset(x2,"h")
#方法二
x2[str_detect(x2,"h")]

8.字符计数

x
str_count(x," ")

统计x中的空格数,有7个空格

x2
str_count(x2,"o")

x2向量中,每个元素中o的个数

str_count(x)
length(x)
x
str_count(x2)
length(x2)
x2

9.字符串替换

x2
str_replace(x2,"o","A")
str_replace_all(x2,"o","A")

------------------------------------------小练习----------------------------------------

#Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community.
#1.将上面这句话作为一个长字符串,赋值给tmp
tmp = "Bioinformatics is a new subject of genetic data collection,analysis and dissemination to the research community."
#2.拆分为一个由单词组成的向量,赋值给tmp2(注意标点符号)
library(stringr)
tmp2 = tmp %>% 
  str_replace(","," ") %>%
  str_remove("[.]") %>% 
  str_split(" ")
tmp2 = tmp2[[1]]

参考资料:生信技能树-小洁老师

上一篇 下一篇

猜你喜欢

热点阅读