R基础五(字符串)
2020-02-20 本文已影响0人
多啦A梦的时光机_648d
缺失信息
R中用NA表示缺失信息。往往很多数据中包含NA值,需要去除。
> a = c(NA,1:49)
> a
[1] NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[19] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
[37] 36 37 38 39 40 41 42 43 44 45 46 47 48 49
> sum(a)
[1] NA
> mean(a)
[1] NA
> sum(a,na.rm = TRUE) ##很多函数里都有去除NA的选项
[1] 1225
- 查看数据集中是否含义缺失值(is.na())
> is.na(a)
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[28] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[46] FALSE FALSE FALSE FALSE FALSE
- 去除数据集中的缺失值(na.omit())
> b = c(NA,1:7,NA,NA)
> c = na.omit(b)
> c
[1] 1 2 3 4 5 6 7
attr(,"na.action")
[1] 1 9 10
attr(,"class")
[1] "omit"
> b
[1] NA 1 2 3 4 5 6 7 NA NA
> is.na(c)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
当用na.omit处理一个数据框时,是将包含NA的那一行全都删除。
但是删除太多缺失值会删除太多数据,就有很多包处理缺失值。
处理缺失值
- 其他缺失值
NaN:代表不可能的值,可以用is.nan()识别不可能值
Inf:表示无穷,有正无穷和负无穷,可以用is.infinite()识别无穷值
字符串
强调:字符串出现的地方必须加引号""/''
- 统计字符串
> nchar('hello world')
[1] 11
> month.name
[1] "January" "February" "March" "April" "May" "June"
[7] "July" "August" "September" "October" "November" "December"
> nchar(month.name)
[1] 7 8 5 5 3 4 4 6 9 7 8 8
length()返回的是向量中元素的个数,而nchar()则是返回每个元素的字符串的个数。
- paste()用来合并字符串
> paste('I','love','you',sep = '-')
[1] "I-love-you"
如果是一个向量和字符串连接,并不是字符串直接添加到向量的尾部,而是向量中的元素分别于字符串相连。
> names = c('yao','jia','ling')
> paste(names,'shi pig')
[1] "yao shi pig" "jia shi pig" "ling shi pig"
- substr()提取字符串
substr(字符串,起始点,终止点),返回起始点和终止点之间的字符串。
> substr(month.name,1,3)
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep"
[10] "Oct" "Nov" "Dec"
toupper()转换为大写,tolower()转换为小写。
> temp = substr(month.name,1,3)
> toupper(temp)
[1] "JAN" "FEB" "MAR" "APR" "MAY" "JUN" "JUL" "AUG" "SEP"
[10] "OCT" "NOV" "DEC"
> tolower(temp)
[1] "jan" "feb" "mar" "apr" "may" "jun" "jul" "aug" "sep"
[10] "oct" "nov" "dec"
- 字符串替换
要想变成首字母大写,则可以用sun()或者gsub()进行替换。
sub()替换一次,gsub()全局替换
> gsub('^(\\w)','\\U\\1',tolower(temp))
[1] "Ujan" "Ufeb" "Umar" "Uapr" "Umay" "Ujun" "Ujul"
[8] "Uaug" "Usep" "Uoct" "Unov" "Udec"
> gsub('^(\\w)','\\U\\1',tolower(temp),perl = T)
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep"
[10] "Oct" "Nov" "Dec“
('^(\\w)' ^表示首字母,\\w表示字符集的简写
'\\U\\1'替换为大写,只替换一次,
tolower(temp) 需要替换的字符串
perl = T 支持perl类型的正则表达式
> toupper(temp)
[1] "JAN" "FEB" "MAR" "APR" "MAY" "JUN" "JUL" "AUG" "SEP"
[10] "OCT" "NOV" "DEC"
> gsub('^(\\w)','\\L\\1',toupper(temp),perl = T)
[1] "jAN" "fEB" "mAR" "aPR" "mAY" "jUN" "jUL" "aUG" "sEP"
[10] "oCT" "nOV" "dEC"
\\L表示替换为小写
- 字符串查找(grep)
> a =c('a','A+','AC')
> grep('A+',a) ## +表示可以匹配1到正无穷,所以A+和AC都可以
[1] 2 3
> a =c('a','A+','AC')
> grep('A+',a,fixed = TRUE) ##fixed = TRUE表示搜索一个文本字符串,fixed = False则表示支持正则表达式。返回值为匹配的下标。
[1] 2
> grep('A+',a,fixed = F)
[1] 2 3
- 字符串匹配(match())
不支持正则表达式,有一点相当于%in%的包含于。
> match('A+',a)
[1] 2
- 字符串分割strsplit()
需要两部分,字符串和分隔符,返回值为列表,便于存储。
> path = '/usr/local/bin/R'
> strsplit(path,'/')
[[1]]
[1] "" "usr" "local" "bin" "R" ## 第一个为空是英文usr前的斜杠前为空。
> strsplit(c(path,path),'/') ##分割两个path的情况
[[1]]
[1] "" "usr" "local" "bin" "R"
[[2]]
[1] "" "usr" "local" "bin" "R"
小tips
如何生成字符串成对组合,也叫笛卡尔集。需要使用outer()函数。
> face = 1:13
> suit = c('spades','clubs','hearts','diamonds')
> outer(face,suit,FUN = paste)
[,1] [,2] [,3] [,4]
[1,] "1 spades" "1 clubs" "1 hearts" "1 diamonds"
[2,] "2 spades" "2 clubs" "2 hearts" "2 diamonds"
[3,] "3 spades" "3 clubs" "3 hearts" "3 diamonds"
[4,] "4 spades" "4 clubs" "4 hearts" "4 diamonds"
[5,] "5 spades" "5 clubs" "5 hearts" "5 diamonds"
[6,] "6 spades" "6 clubs" "6 hearts" "6 diamonds"
[7,] "7 spades" "7 clubs" "7 hearts" "7 diamonds"
[8,] "8 spades" "8 clubs" "8 hearts" "8 diamonds"
[9,] "9 spades" "9 clubs" "9 hearts" "9 diamonds"
[10,] "10 spades" "10 clubs" "10 hearts" "10 diamonds"
[11,] "11 spades" "11 clubs" "11 hearts" "11 diamonds"
[12,] "12 spades" "12 clubs" "12 hearts" "12 diamonds"
[13,] "13 spades" "13 clubs" "13 hearts" "13 diamonds"