R字符串处理1

2016-09-02  本文已影响376人  一刀YiDao

统计

1 、字符数统计:nchar()

> x<-c("R","is","funny")
> nchar(x)
[1] 1 2 5
> length("")
[1] 1
> nchar("")
[1] 0

2 、转化为小写:tolower()

> DNA <- "AtGCtttACC"
> tolower(DNA)
[1] "atgctttacc"

3 、转化为大写:toupper()

> DNA <- "AtGCtttACC"
> toupper(DNA)
[1] "ATGCTTTACC"

4、替换函数:chartr("","",x)

chartr("A","B",x):字符串x中使用B替换A

> DNA <- "AtGCtttACC"
> chartr("Tt","Bb",DNA)
[1] "AbGCbbbACC"
> chartr("Tt","BB",DNA)
[1] "ABGCBBBACC"

字符串连接

5、字符串连接函数:paste()

> paste("Var",1:5,sep="")
[1] "Var1" "Var2" "Var3" "Var4" "Var5"

> x<-list(a='aaa',b='bbb',c="ccc")
> y<-list(d="163.com",e="qq.com")
> paste(x,y,sep="@")
[1] "aaa@163.com" "bbb@qq.com"  "ccc@163.com"

#增加collapse参数,设置分隔符
> paste(x,y,sep="@",collapse=';')
[1] "aaa@163.com;bbb@qq.com;ccc@163.com"
> paste(x,collapse=';')
[1] "aaa;bbb;ccc"

字符串拆分

6、字符串拆分:strsplit()

语法格式:strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

> text<-"today is a \nnice day!"
> text
[1] "today is a \nnice day!"

> strsplit(text," ")
[[1]]
[1] "today"  "is"     "a"      "\nnice" "day!"

#换行符\n
> strsplit(text,'\\s')
[[1]]
[1] "today" "is"    "a"     ""      "nice"  "day!"
> class(strsplit(text, '\\s'))
[1] "list"

> strsplit(text,"")
[[1]]
 [1] "t"  "o"  "d"  "a"  "y"  " "  "i"  "s"  " "  "a"  " "  "\n" "n"  "i"  "c"  "e"  " "  "d"  "a" 
[20] "y"  "!"

字符串查询

7、字符串查询:grep(),grepl()

语法格式

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, 
     fixed = FALSE, useBytes = FALSE, invert = FALSE) 
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, 
      fixed = FALSE, useBytes = FALSE) 

两者的差别

> grep("\\.r$",files)
 [1]  3  5  8  9 10 11 12 16 18 20 22 24 25 26 29

 > grepl("\\.r$",files)
 [1] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
[17] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
[33] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

两者用于提取数据子集的结果是一样的

> files[grep("\\.r$",files)]
 [1] "agricolae.r"             "cluster.r"               "gam_model.r"            
 [4] "GBM.r"                   "gbm_model.r"             "gbm_model1.r"           
 [7] "GBM1.r"                  "item-based CF推荐算法.r" "MASS_e107_rpart.r"      
[10] "PankRange.r"             "quantmod.r"              "recommenderlab.r"       
[13] "Rplot.r"                 "rules.r"                 "survival.r" 

> files[grepl("\\.r$",files)]
 [1] "agricolae.r"             "cluster.r"               "gam_model.r"            
 [4] "GBM.r"                   "gbm_model.r"             "gbm_model1.r"           
 [7] "GBM1.r"                  "item-based CF推荐算法.r" "MASS_e107_rpart.r"      
[10] "PankRange.r"             "quantmod.r"              "recommenderlab.r"       
[13] "Rplot.r"                 "rules.r"                 "survival.r"

8、字符串查询:regexpr(),gregexpr(),regexec()

> text<-c("Hello, Adam","Hi,Adam!","How are you,Adam")
> text
[1] "Hello, Adam"      "Hi,Adam!"         "How are you,Adam"
> regexpr("Adam",text)
[1]  8  4 13
attr(,"match.length")
[1] 4 4 4
attr(,"useBytes")
[1] TRUE
> gregexpr("Adam",text)
[[1]]
[1] 8
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 4
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE

[[3]]
[1] 13
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE

> regexec("Adam",text)
[[1]]
[1] 8
attr(,"match.length")
[1] 4

[[2]]
[1] 4
attr(,"match.length")
[1] 4

[[3]]
[1] 13
attr(,"match.length")
[1] 4

字符串替换

9、字符串替换:sub(),gsub()

> text<-c("Hello, Adam","Hi,Adam!","How are you,Ava")
> sub(pattern="Adam",replacement="word",text)
[1] "Hello, word"     "Hi,word!"        "How are you,Ava"
> sub(pattern="Adam|Ava",replacement="word",text)
[1] "Hello, word"      "Hi,word!"         "How are you,word"
> gsub(pattern="Adam|Ava",replacement="word",text)
[1] "Hello, word"      "Hi,word!"         "How are you,word"

字符串提取

substr(x, start, stop) 
substring(text, first, last = 1000000L) 
> x <- "123456789" 
> substr(x, c(2,4), c(4,5,8)) 
[1] "234" 
> substring(x, c(2,4), c(4,5,8)) 
[1] "234"     "45"      "2345678" 

因为x的向量长度为1,substr获得的结果只有1个字串,
即第2和第3个参数向量只用了第一个组合:起始位置2,终止位置4。
substring的语句三个参数中最长的向量为c(4,5,8),执行时按短向量循环使用的规则第一个参数事实上就是c(x,x,x),
第二个参数就成了c(2,4,2),最终截取的字串起始位置组合为:2-4, 4-5和2-8。

上一篇 下一篇

猜你喜欢

热点阅读