R语言基础（二）：数据类型篇

2017-06-16 本文已影响105人花若离枝rain

只有知道数据是什么类型的数据，才能去使用或者查找相应的函数来对数据进行处理。了解完R的五种数据结构，接下来了解一下R可以处理的数据类型(class)或模式(mode)，函数class()可以查看具体的数据类型
is.* 函数可以判断是否为某个特定类型的数据

> is.character("red lorry, yellow lorry")
[1] TRUE

虽然也可以写成is("red lorry, yellow lorry"，“character”)，但是上述方式效率更高
同样，在改变数据类型时，as.*函数比as(x,"type")效率更高

> x <- "123.456" 
> as(x, "numeric")
[1] 123.5 
> as.numeric(x) 
[1] 123.5

1、数字类

数字类的数据类型主要有三种：针对浮点值(floating point values)的numeric型，针对整数（integer）的integer型，针对复数（ complex numbers）的complex型，需要注意的是，R中所有浮点值都是双精度的

> class(3 + 1i)
[1] "complex" 
> class(1:5)
[1] "integer" 
> class(0.5:4.5)
[1] "numeric" 
> mode(0.5:4.5)
[1] "numeric"
> typeof(0.5:4.5)
[1] "double"

使用语句options(digits =n) （其中n介于1-22之间）可以更改小数点位数
R中有几个函数值得注意，class()、mode()、typeof()这三个函数虽然都是返回了变量的所属的类，但是三者返回的类是逐渐细化的，class返回的是变量所属的类，mode返回数据大类，而typeof返回数据最小的类

2、逻辑型（logical）

逻辑型其实就是TRUE和FALSE

> class(TRUE)
[1] "logical"

3、字符型（character）

在处理方式上，R语言并没有对字符串（strings）和单个的字符（character）进行区分，其类别（class）都是character，而字符串其实是字符向量的每一个元素的非官方说法，其处理方式都是一样的

> class(c("she", "sells", "seashells", "on", "the", "sea", "shore"))
[1] "character"

R语言存在一些特殊的字符（R中\表示转义符，不表示路径）
\t：制表符
\n：将光标移至另一行
\：反斜杠
\0：空字符，用于终止字符串
\a：报警字符，让电脑发出嘟嘟声

> cat("foo\"bar")
foo"bar
> cat("foo\nbar")
foo
bar

4、字符串的常用操作

4.1、创建与连接

文本数据一般都是用字符型数据存储的，c()函数可以用于创建字符串，字符串可以使用双引号，而如果字符串内部有双引号则需要使用单引号将字符串括在其中

> c(  "You should use double quotes most of the time", 
      'Single quotes are better for including " inside the string' ) 
[1] "You should use double quotes most of the time" 
[2] "Single quotes are better for including \" inside the string"

paste()函数用于组合字符串，其中，sep参数设置组合时使用的方式，默认使用空格，collapse参数用于设置结果是否合并为一个字符串，默认情况不合并，否则需要设置使用什么方式来进行连接（逗号，空格，-，甚至字母都可以）

paste (..., sep = " ", collapse = NULL)
paste0(..., collapse = NULL)

> paste(c("red", "yellow"), "lorry")等价于paste0(c("red", "yellow"), "lorry") 
[1] "red lorry"    "yellow lorry" 
> paste(c("red", "yellow"), "lorry", sep = "-") 
[1] "red-lorry"    "yellow-lorry" 
> paste(c("red", "yellow"), "lorry", collapse = ", ") 
[1] "red lorry, yellow lorry"

toString()函数，paste函数的变异形式，该函数将其中的每一个元素用逗号分隔符分开，width参数可以用于限制输出的长度

toString(x, width = NULL, ...)

> x<-(1:4)^2
> x
[1]  1  4  9 16
> toString(x)
[1] "1, 4, 9, 16"
> toString(x,width=3)
[1] "1,...."

cat()函数，简单的把字符串连接起来，noquote()函数，去掉字符串中包含的双引号

> cat("red", "yellow", "lorry") 
red yellow lorry 
> noquote(c("red", "yellow")) 
[1] red    yellow

4.2、格式化数字

formatC()函数使用C的格式来对数字进行格式化，digits参数可以设置有效数字位数，width参数设置输出的长度，format参数设置输出的数字形式（如e为科学记数法），flag参数设置对齐方式（“-”为左对齐，“+”右对齐并在数字前加上加号，“0”用数字0在数字前方进行填充，“#”与“ ”右对齐，长度不够的用空格填充）。该函数的输入要求必须为numeric形式（向量、矩阵、数组），输出均转化为字符向量、字符矩阵或者字符数组

formatC(x, digits = NULL, width = NULL, format = NULL, flag = "")

> a <- matrix(1:12,nrow=3)
> a
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
> formatC(a,width=5)
     [,1]    [,2]    [,3]    [,4]   
[1,] "    1" "    4" "    7" "   10"
[2,] "    2" "    5" "    8" "   11"
[3,] "    3" "    6" "    9" "   12"
> formatC(a,width=5,flag="+")
     [,1]    [,2]    [,3]    [,4]   
[1,] "   +1" "   +4" "   +7" "  +10"
[2,] "   +2" "   +5" "   +8" "  +11"
[3,] "   +3" "   +6" "   +9" "  +12"

sprintf()函数，返回一个包含文本和变量值的格式化组合的字符向量，fmt参数表示某一种格式的字符串向量，该参数定义了我们希望其返回的格式（如%s表示另一个字符串，%d表示整数），其后是需要传递给fmt参数的值，可以是逻辑值、整型、实数和字符串向量

sprintf(fmt, ...)

> sprintf("In my life, %s is %s",c("she","he"),c("everything","nothing"))
[1] "In my life, she is everything" "In my life, he is nothing"

format()函数与上面提到的formatC()函数略有不同,trim参数为逻辑值（T表示数字前面的空格被全部省略，默认F，即统一数字和复数值的位数并对不足的部分用空格进行填充,设置数字为右对齐），digits参数表示有效数字位数，nsmall参数表示(实数和虚数使用非科学记数法时)小数点后的最小有效数字位数，justify参数默认左对齐

format(x, trim = FALSE, digits = NULL, nsmall = 0L, justify = c("left", "right", "centre", "none"), width = NULL)

> format(a,trim=F)
     [,1] [,2] [,3] [,4]
[1,] " 1" " 4" " 7" "10"
[2,] " 2" " 5" " 8" "11"
[3,] " 3" " 6" " 9" "12"

prettyNum()函数用于优化（可能是格式化的）数字,big.mark参数用于指定分隔小数点之前的数字的分隔符（逗号，空格等），small.mark参数用于指定分隔小数点之后的数字的分隔符，big.interval参数用于指定小数点之前分隔符之间的数字位数，默认三个数字，small.interval参数用于指定小数点之后分隔符之间的数字位数，默认五个数字

prettyNum(x, big.mark = "",   big.interval = 3L, small.mark  = "", small.interval = 5L, decimal.mark = getOption("OutDec"), input.d.mark = decimal.mark, preserve.width = c("common", "individual", "none"), zero.print = NULL, drop0trailing = FALSE, is.cmplx = NA, ...)

> prettyNum(c(1e10,99,9,0.9,1e-10,0.0004), big.mark=" ", big.interval = 5L, small.mark=",", scientific=F, preserve.width = "individual")
[1] "1 00000 00000" "99"            "9"             "0.9"          
[5] "0.00000,00001" "0.0004"

4.3、字符串的搜索

grep()函数在字符串向量x中搜索给定的字符串pattern，返回字符串在x中对应的位置
regexpr()函数在字符串text中搜索给定的字符串pattern，返回与pattern匹配的第一个子字符串的起始位置
gregexpr()函数在字符串text中搜索给定的字符串pattern，返回与pattern匹配的全部子字符串的起始位置

grep(pattern, x)
regexpr(pattern, text)
gregexpr(pattern, text)

> grep("ap",c("apple","banana","pineapple"))
[1] 1 3
> regexpr("ap","apple banana pineapple")
[1] 1
> gregexpr("ap","apple banana pineapple")
[1]  1 18

4.4、字符串的截取、替代与拆分

substring()函数用于提取或替代子串，text为字符向量，first和last参数为整数，返回从first：last区间的子字符串
substr()函数用于提取或替代子串，x为字符向量，start和stop参数为整数，返回从start：stop区间的子字符串
两个函数唯一的区别在于substring()函数可以不指定last值，默认截取到最后一个数值，而substr()函数必须要指定截取的开始值和结束值
value表示赋值给这部分的字符串向量
strsplit()在一些特定的分割点将字符串分开，split参数为字符向量中的某一成分，用于界定分割点

substr(x, start, stop)
substring(text, first, last = 1000000L)
substr(x, start, stop) <- value
substring(text, first, last = 1000000L) <- value
strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

> substring("abcdef", 2)
[1] "bcdef"
> substr("abcdef", 2,4)
[1] "bcd"
> strsplit("something that is not important"," ")
[[1]]
[1] "something" "that"      "is"        "not"       "important"

4.5、字符串的其他操作

toupper()函数用于将字符串全部大写
tolower()函数用于将字符串全部小写

> toupper("I'm Shouting") 
[1] "I'M SHOUTING" 
> tolower("I'm Whispering") 
[1] "i'm whispering"

nchar()函数返回字符串中字符的数量

> nchar("what is wrong")
[1] 13

4.6、stringr包

R语言关于字符处理的基本函数中，较多都存在参数不一致的情况，理解和使用不够简洁。stringr包是处理字符串的一个非常好用的包，其中常用的函数都是以str_开头（该处只简单介绍几个）

str_detect()函数用于查找与输入字符串相匹配的字符，string参数表示输入的向量，pattern参数表示想要匹配的字符，返回逻辑向量，表示想要匹配的字符是否存在
str_split()函数返回列表，功能等同于基本函数中的strsplit()函数，pattern参数表示用于分割字符向量的任一成分，n参数表示返回的长度，超过长度的部分不会被分割，不够的部分只有simplify=T时才会被自动用空格填充，此时作用与str_split_fixed()函数相同，否则只返回被分割后的长度，simplify参数默认为F，表示返回列表，当为T时，表示返回字符矩阵
str_split_fixed()函数，当n大于本身返回的长度时，不够的部分空格填充
str_count()函数返回指定pattern出现的次数
str_replace()函数替换首个指定的pattern
str_replace_all()函数替换所有匹配的pattern

str_detect(string, pattern)
str_split(string, pattern, n = Inf, simplify = FALSE)
str_split_fixed(string, pattern, n)
str_count(string, pattern = "")
str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)

> fruit <- c("apple", "banana", "pear", "pinapple")
> str_detect(fruit, "a")
[1] TRUE TRUE TRUE TRUE
fruits <- c( "apples and oranges and pears and bananas",  "pineapples and mangos and guavas")
> str_count(fruit, c("a", "b", "p", "p"))
[1] 1 1 1 3
> str_replace(fruits, "[aeiou]", "-")
[1] "-pples and oranges and pears and bananas" "p-neapples and mangos and guavas"        
> str_replace_all(fruits, "[aeiou]", "-")
[1] "-ppl-s -nd -r-ng-s -nd p--rs -nd b-n-n-s" "p-n--ppl-s -nd m-ng-s -nd g--v-s"        
> str_split(fruits, " and ", n=3)
[[1]]
[1] "apples"            "oranges"           "pears and bananas"
[[2]]
[1] "pineapples" "mangos"     "guavas"    
> str_split(fruits, " and ", n=6)
[[1]]
[1] "apples"  "oranges" "pears"   "bananas"
[[2]]
[1] "pineapples" "mangos"     "guavas"    
> str_split(fruits, " and ", n=6,simplify = TRUE)
     [,1]         [,2]      [,3]     [,4]      [,5] [,6]
[1,] "apples"     "oranges" "pears"  "bananas" ""   ""  
[2,] "pineapples" "mangos"  "guavas" ""        ""   ""  
> str_split_fixed(fruits, " and ", n=6)
     [,1]         [,2]      [,3]     [,4]      [,5] [,6]
[1,] "apples"     "oranges" "pears"  "bananas" ""   ""  
[2,] "pineapples" "mangos"  "guavas" ""        ""   ""

5、因子（factor）

因子是一种用来存储类别变量的特殊的变量形式，有时候可以当作字符串，有时候可以当作整数处理，其本质上是一种分类，而不同的分类称为水平（level）

5.1、创建

factor()函数用于创建因子，x参数为输入的字符，level参数可以设定其中包含的因子（“level=”可以省略），ordered参数可以根据level参数设置的因子顺序对因子进行排序
as.factor()直接将输入转化成因子
gl()函数通过指定因子的模式从而创建因子，n参数表示水平的数量（整数），k参数表示水平重复的数量，length参数表示返回结果的长度（当长度大于前面两者的乘积时，重复次数会自动增加以达到该长度），labels参数表示因子不同水平的标签

factor(x = character(), levels, labels = levels, ordered = is.ordered(x))
gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)

> b <- c("a","b","c","d","e","f")
> factor(b)
[1] a b c d e f
Levels: a b c d e f
> factor(b,level=c("b","c"))
[1] <NA> b    c    <NA> <NA> <NA>
Levels: b c
> factor(b,level=c("c","b"),ordered=T)
[1] <NA> b    c    <NA> <NA> <NA>
Levels: c < b
> gl(3,2,labels = c("one","two","three"))
[1] one   one   two   two   three three
Levels: one two three

当在R中创建数据框时，其中的文本列会被默认为分类数据并将其转换为因子

> a <- 1:6
> b <- c("a","b","c","d","e","f")
> c <- data.frame(a,b)
> c
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
> class(c)
[1] "data.frame"
> class(c$b)
[1] "factor"
> c$b
[1] a b c d e f
Levels: a b c d e f

需要注意的是，将数据框创建之后，R会默认字符列的水平为现存的水平，如果要将字符列的某一项重新赋值为新的水平，系统将会报错（可以重新赋值为已有的水平）

> c$b[1] <- "g"
Warning message:
In `[<-.factor`(`*tmp*`, 1, value = c(NA, 2L, 3L, 4L, 5L, 6L)) : invalid factor level, NA generated
> c$b[1] <- "c"
> c
  a b
1 1 c
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f

5.2、查询因子的水平及水平数

levels()函数可以查看变量的不同水平
nlevels()函数可以查询变量的水平数

> levels(c$b)
[1] "a" "b" "c" "d" "e" "f"
> nlevels(c$b)
[1] 6

5.3、更改水平的顺序

relevel()函数可以更改因子的顺序

> relevel(c$b,"d")
[1] a b c d e f
Levels: d a b c e f

5.4、删除因子的水平

droplevels()函数，如果数据框中存在某些因子水平对应的数据是NA，如果单单对数据做删除，因子水平仍旧存在，那么可以使用该函数将对应NA的因子去除

> g <- data.frame(  mode = c(    "bike", "car", "bus", "car", "walk", "bike", "car", "bike", "car", "car" ),  time_mins = c(25, 13, NA, 22, 65, 28, 15, 24, NA, 14) ) 
> g
   mode time_mins
1  bike        25
2   car        13
3   bus        NA
4   car        22
5  walk        65
6  bike        28
7   car        15
8  bike        24
9   car        NA
10  car        14
> h <- na.omit(g)
> h
   mode time_mins
1  bike        25
2   car        13
4   car        22
5  walk        65
6  bike        28
7   car        15
8  bike        24
10  car        14
> h$mode
[1] bike car  car  walk bike car  bike car 
Levels: bike bus car walk
> unique( h$mode)
[1] bike car  walk
Levels: bike bus car walk
> droplevels( h$mode)
[1] bike car  car  walk bike car  bike car 
Levels: bike car walk

5.5、对因子水平进行排序

ordered()函数可以对因子水平排序，通常用于一些调研类的数据
通常，排序后的因子还是因子，但是普通因子却不是排序的
as.ordered()直接对因子按照字母顺序默认排序

> y <- ordered(c$b,c("f","a","c","e","b","d"))
> y
[1] a b c d e f
Levels: f < a < c < e < b < d
> is.ordered(y)
[1] TRUE
> is.ordered(c$b)
[1] FALSE
> is.factor(y)
[1] TRUE

5.6、根据因子分组

cut()函数将一个连续型变量分割成区间，并返回因子。x参数表示需要分割变成因子的数值向量，breaks参数要么是一个数值向量表示分割点，要么是一个大于2的数字表示需要分割的因子水平数量，该函数返回x中的每一个元素所在的区间
tapply()函数将向量根据因子分割成组然后对每组调用函数，X参数为作用的向量，INDEX参数为作为索引的因子（其中可以设定多个因子，即利用多列因子进行索引），FUN参数赋予想对x参数进行的函数操作，simplify参数为F时，将返回由每一个元素组成的列表
split()函数与tapply()函数不同，该函数只是将向量根据因子分割成组，返回分组的列表，x参数为包含数值的向量或者数据框，f参数为因子，drop参数表示没有出现的因子是否需要移除
当利用split函数进行赋值时，通过因子分组后，可以使得因子的每个水平下的数据通过重新赋值变得全部相同
unsplit()函数是split()函数的逆，其可以返回通过因子分组后每组的第一个数值，返回向量
by()函数与tapply()函数类似，但是可以作用于向量和数据框，根据因子对数据框分组，在分别对分组后的数据框应用函数,data参数表示输入的向量或者数据框，INDICES参数表示分组的因子

cut(x, breaks)
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
split(x, f, drop = FALSE, ...)
split(x, f, drop = FALSE, ...) <- value
unsplit(value, f, drop = FALSE)
by(data, INDICES, FUN, ..., simplify = TRUE)

> cut(1:9,c(0,2,5,8))
[1] (0,2] (0,2] (2,5] (2,5] (2,5] (5,8] (5,8] (5,8] <NA> 
Levels: (0,2] (2,5] (5,8]
> cut(1:9,3)
[1] (0.992,3.67] (0.992,3.67] (0.992,3.67] (3.67,6.33]  (3.67,6.33]  (3.67,6.33]  (6.33,9.01]  (6.33,9.01]  (6.33,9.01] 
Levels: (0.992,3.67] (3.67,6.33] (6.33,9.01]

> patientID <- c(1,2,3,4)
> age <- c(25,34,28,52)
> diabetes <- c("Type1","Type2","Type1","Type1")
> status <- c("Poor","Improved","Excellent","Poor")
> day <- c(33,28,23,55)
> patientdata <- data.frame(patientID,age,diabetes,status,day)
> patientdata
  patientID age diabetes    status day
1         1  25    Type1           Poor  33
2         2  34    Type2    Improved  28
3         3  28    Type1    Excellent  23
4         4  52    Type1           Poor  55
> tapply(patientdata$age,patientdata$diabetes,mean)
Type1 Type2 
   35    34 
> tapply(patientdata$age,patientdata[,c(3,4)],mean)
               status
diabetes Excellent Improved Poor
   Type1        28       NA 38.5
   Type2        NA       34   NA
> split(patientdata$age,patientdata$diabetes)
$Type1
[1] 25 28 52
$Type2
[1] 34
> x <- sample(1:100,25)
> x
 [1] 78 62  5 50 80 57 66 54 36 68 65 41 51 34 14 45  9  7 49 77 69  6 43  3 93
> y <- c("a","b","c","d","e")
> split(x,y)
$a
[1] 78 57 65 45 69
$b
[1] 62 66 41  9  6
$c
[1]  5 54 51  7 43
$d
[1] 50 36 34 49  3
$e
[1] 80 68 14 77 93
> unsplit(x,y)
[1] 78 62 5 50 80
> split(x,y) <- c(1,2,3,4,5)
> x
 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
> by(patientdata,patientdata$diabetes, function(patientdata) lm(patientdata$day ~ patientdata$age))
patientdata$diabetes: Type1
Call:
lm(formula = patientdata[, 5] ~ patientdata[, 2])
Coefficients:
     (Intercept)  patientdata[, 2]  
           1.521             1.014  
---------------------------------------------------------------------------------------------------------------
patientdata$diabetes: Type2
Call:
lm(formula = patientdata[, 5] ~ patientdata[, 2])
Coefficients:
     (Intercept)  patientdata[, 2]  
              28                NA

5.7、因子的组合

interaction()函数

> a <- gl(3,2,labels = c("one","two","three"))
> b <- gl(2,1,labels = c("apple","banana"))
> interaction(a,b)
[1] one.apple    one.banana   two.apple    two.banana   three.apple  three.banana
Levels: one.apple two.apple three.apple one.banana two.banana three.banana

6、时间数据

R语言的基础包中包含了两种类型的时间数据，一类是POSIXct/POSIXlt类型数据，另一类是DATE类型数据，两者区别在于，前者包括了日期、时间和时区信息，而后者只包括日期信息。

6.1、POSIXct/POSIXlt类

两者返回的都是具体的日期、时间、时区信息，但是内部的存储机制不同
POSIXct：存储的是从 1970年1月1号开始，在UTC的时区中，距离目标时间的秒数。ct是"calendar time"（公历时间）的缩写
POSIXlt：存储的是包含秒、分钟、小时、月份等成分的列表
Sys.time()函数返回POSIXct形式存储的时间

> a <- Sys.time()
> a
[1] "2017-04-27 15:17:29 CST"
> class(a)
[1] "POSIXct" "POSIXt" 
> unclass(a)
[1] 1493277450
> mode(a)
[1] "numeric"
> typeof(a)
[1] "double"
> b <- as.POSIXlt(a)
> b
[1] "2017-04-27 15:17:29 CST"
> class(b)
[1] "POSIXlt" "POSIXt" 
> mode(b)
[1] "list"
> typeof(b)
[1] "list"
> unclass(b)
$sec
[1] 29.68839
$min
[1] 17
$hour
[1] 15
$mday
[1] 27
$mon
[1] 3
$year
[1] 117
$wday
[1] 4
$yday
[1] 116
$isdst
[1] 0
$zone
[1] "CST"
$gmtoff
[1] 28800
attr(,"tzone")
[1] ""    "CST" "CDT"

6.2、DATE类

返回目标日期，日期的内部存储机制与POSIXct类似，存储的是从 1970年1月1号开始，距离目标日期的天数
Sys.Date()函数返回DATE形式存储的时间

> Sys.Date()
[1] "2017-04-27"
> class(Sys.Date())
[1] "Date"
> unclass(Sys.Date())
[1] 17283

注意，date()函数返回的是字符串数据

> date()
[1] "Thu Apr 27 15:48:29 2017"
> class(date())
[1] "character"
> unclass(date())
[1] "Thu Apr 27 15:48:46 2017"
> mode(Sys.Date())
[1] "numeric"
> typeof(Sys.Date())
[1] "double"

6.3、时区

Sys.timezone()函数返回目前的时区
OlsonNames()函数返回目前所有的时区
在处理时间数据时，为了避免时区混乱，可以将所有时区都转化为UTC

6.4、字符串转换成日期

日期值通常是以字符串的形式输入到R中，然后需要用相应的函数将其转化为以数值形式存储的日期变量，有四种转化函数
as.Date()函数，将字符串转化为DATE日期型
as.POSIXct()函数，转化为POSIXct型
strptime()函数，string parse time的缩写，转化为POSIXlt型
as.POSIXlt()函数，转化为POSIXlt型（该函数忽略了时区的变化）

as.Date(x, format, ...)，x为字符串（character），format参数为一个字符串，表示输入的字符串的模式，如果没有设置该参数，默认先尝试 "%Y-%m-%d" 然后尝试 "%Y/%m/%d"，如果都对不上则输出error
as.Date(x, origin, ...)，x为数值型（numeric），origin参数是一个日期对象，表示将系统时间的第一天重新设置为所给日期，该函数表示在给定的第一天上返回加上天数表示的那个时间
as.Date(x, tz = "UTC", ...)，x为POSIXct型，tz参数表示时区名字
(UTC即为universal time)

as.Date(x, format, ...)
as.Date(x, origin, ...)
as.Date(x, tz = "UTC", ...)
strptime(x, format, tz = "")
as.POSIXct(x, tz = "", ...)
as.POSIXlt(x, tz = "", ...)
as.POSIXlt(x, tz = "", format, ...)
as.POSIXlt(x, tz = "", origin, ...)

> as.Date("2017/4/27")
[1] "2017-04-27"
> as.Date("2017/4/27","%Y-%m-%d")
[1] NA
> as.Date("2017/4/27","%Y/%m/%d")
[1] "2017-04-27"
> as.Date("2017年4月27日","%Y年%m月%d日")
[1] "2017-04-27"
> as.Date(60,"2017/4/27")
[1] "2017-06-26"
> strptime("2017-04-27 15:17:29","%Y-%m-%d %H:%M:%S")
[1] "2017-04-27 15:17:29 CST"

日期值的格式
默认情况下，年月日之间是以/或者-进行分隔，而时分秒则以：进行分隔

格式	意义	例子
%Y	年份，四位数字	2017
%y	年份，两位数字	17
%B	月份，非缩写的月份名，英文	February
%b	月份，缩写的月份名，英文	Feb
%m	月份，数字形式（00~12）	04
%A	非缩写的星期名	Monday
%a	缩写的星期名	Mon
%d	数字表示的日期（01~31）	27

6.5、日期转换成字符串

format(x, ...)，针对DATE型数据
strftime(x, format = "", tz = "", usetz = FALSE, ...)，string format time的缩写，x参数可以为POSIXct/POSIXlt类型

> strftime("2017-04-27 15:17:29","It is %Y-%m-%d %H:%M:%S")
[1] "It is 2017-04-27 15:17:29"

6.6、时间数据的计算

加法：对于POSIX类加上的是秒数，对于DATE类加上的是天数

> Sys.time()+3600
[1] "2017-04-28 11:44:19 CST"
> Sys.Date()+3600
[1] "2027-03-07"

减法：计算两日期之差
由于时间数据存储的最小单位是double型，所以可以直接相减
也可以使用difftime()函数计算相关的秒数、分钟数、小时数、天数、周数
as.difftime计算距离0时0分0秒的秒数、分钟数、小时数、天数、周数

time1 - time2
difftime(time1, time2, tz, units = c("auto", "secs", "mins", "hours", "days", "weeks"))
as.difftime(tim, format = "%X", units = "auto")

> as.Date("2017/4/28")-as.Date("2016/4/28")
Time difference of 365 days
> class(as.Date("2017/4/28")-as.Date("2016/4/28"))
[1] "difftime"
> unclass(as.Date("2017/4/28")-as.Date("2016/4/28"))
[1] 365
attr(,"units")
[1] "days"
> difftime(as.Date("2017/4/28"),as.Date("2016/4/28"),units="week")
Time difference of 52.14286 weeks
> as.difftime(c("0:3:20", "11:23:15"))
Time differences in mins
[1]   3.333333 683.250000
> as.difftime(c("3:20", "23:15", "2:"), format = "%H:%M") 
Time differences in hours
[1]  3.333333 23.250000        NA
> as.difftime(c("3:20", "23:15", "2:"), format = "%H:%M",units="secs") 
Time differences in secs
[1] 12000 83700    NA

seq()函数可以用于生成等差序列的时间，from参数表示起始日期，to参数表示终止日期，by表示间隔，length.out参数表示生成序列的长度，along.with参数表示返回与该参数相同的长度

seq(...)
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL, along.with = NULL, ...)
seq.int(from, to, by, length.out, along.with, ...)
seq_along(along.with)
seq_len(length.out)

> seq(as.Date("2016/4/28"),as.Date("2017/4/28"),by="50 days")
[1] "2016-04-28" "2016-06-17" "2016-08-06" "2016-09-25" "2016-11-14" "2017-01-03" "2017-02-22" "2017-04-13"

6.7、其他相关函数

weekdays()、months()、quarters()函数返回相应的星期、月份和季度

> weekdays(as.Date("2017/4/28"))
[1] "星期五"
> months(as.Date("2017/4/28"))
[1] "四月"
> quarters(as.Date("2017/4/28"))
[1] "Q2"

6.8、Lubridate包

同字符串一样，R基本包中的函数所涉及的参数多变，用起来不够简洁，所以针对字符串用stringr包处理非常方便，而日期数据用lubridate包处理也非常方便。lubridate包主要有两类函数，一类处理时刻数据（instants），另一类处理时段数据（time spans）

6.8.1、时刻（instants）数据

now()函数返回POSIXct类的当前时间日期
today()函数返回DATE类的当前日期

> now()
[1] "2017-05-02 17:48:45 CST"
> today()
[1] "2017-05-02"
> unclass(today())
[1] 17288

时刻数据的操作包括解析（Parsing）、抽取（Extracting）
解析函数包括：dmy()、dym()、myd()、mdy()、ymd()、ydm()、ymd_hms()
解析函数可以将输入的字符型数据解析成POSIXct 日期格式，并自动识别字符型数据中的分隔符
抽取函数包括：second()、minute()、hour()、day()、yday()、mday()、wday()、week()、month()、year()、tz()、dst()
抽取函数可以用于提取某一日期数据中的某一部分，但是这些函数的输入必须为日期型（非字符型）数据
此外，通过对这些函数赋值，可用于修改某一变量所代表的时间

second(x)
second(x) <- value

> ymd("17/5\\1...")
[1] "2017-05-01"
> mday(ymd("17/5\\1..."))
[1] 1
> yday(ymd("17/5\\1..."))
[1] 121
> x <- ymd("17/5\\1...")
> yday(x)<- 150
> x
[1] "2017-05-30"

6.8.2、时段（time spans）数据

时段数据有三类对象，Duration类、Period类和Interval类
Durations和Periods两类对象都不知道具体的时刻信息，只知道包含的时段长度
Durations去除了时间两端的信息，将时间段识别为相应的秒数，同时也兼容基本包中的difftime类型对象，不考虑闰年和闰秒，其计算的一年是标准不变的365天
函数包括： is.duration()，as.duration()，duration()，dseconds()，dminutes()，dhours()，ddays()，dweeks()和dyears()
Periods根据较长的时钟周期来计算时段长度，它考虑了闰年和闰秒，适用于长期的时间计算，其计算的一年有365也有366天
函数包括： is.period()，as.period()，period()，seconds()，minutes()，hours()，days()，weeks()，months()，years()
Intervals是最简单的时段对象，由两个时刻构成。开始于某一特定时刻并且终止与某一特定时刻的时间段，保留了一个时间段的完整信息，可以对Durations和Periods两者进行转换
函数包括： is.interval()，as.interval()，interval()，int_shift()，int_flip()，int_aligns()，int_overlaps()，%within%

duration(num = NULL, units = "seconds", ...)
period(num = NULL, units = "second", ...)
interval(start, end, tzone = attr(start, "tzone"))

> dyears(1:2)
[1] "31536000s (~52.14 weeks)" "63072000s (~2 years)"    
> dyears(1)
[1] "31536000s (~52.14 weeks)"
> years(1)
[1] "1y 0m 0d 0H 0M 0S"
> weeks(1)
[1] "7d 0H 0M 0S"
> a <- interval("2016/9/1",today())
> a
[1] 2016-09-01 UTC--2017-05-02 UTC
> is.duration(a)
[1] FALSE
> as.duration(a)
[1] "20995200s (~34.71 weeks)"
> as.period(a)
[1] "8m 1d 0H 0M 0S"
> period(c(3, 1, 2, 13, 1), c("second", "minute", "hour", "day", "week"))
[1] "20d 2H 1M 3S"
> period(second = 3, minute = 1, hour = 2, day = 13, week = 1)
[1] "20d 2H 1M 3S"
> period(c(1, -60), c("hour", "minute"), hour = c(1, 2), minute = c(3, 4))
[1] "1H -60M 0S" "1H 3M 0S"   "2H 4M 0S"  
> period("2hours 2minutes 1second")
[1] "2H 2M 1S"
> duration(second = 3, minute = 1.5, hour = 2, day = 6, week = 1)
[1] "1130493s (~1.87 weeks)"

6.8.3、时区信息

lubridate包提供了三个函数帮助处理时区信息，：
tz()：提取时间数据的时区
with_tz(time, tzone = "")：将时间数据转换为另一个时区的同一时间
force_tz(time, tzone = "")：将时间数据的时区强制转换为另一个时区，其原本的时间已经被改变

> with_tz(now(),tzone = "UTC")
[1] "2017-05-02 11:13:30 UTC"
> force_tz(now(),tzone = "UTC")
[1] "2017-05-02 19:13:52 UTC"

6.8.4、其他

leap_year(date)函数检查该时期是否处在闰年，返回T/F，输入需要为DATE类
decimal_date(date)函数将日期转换为小数形式，输入POSIXt类或者DATE类
round_date(x, unit = "second")将日期向最靠近的方向取整
floor_date(x, unit = "seconds")将日期向下取整
ceiling_date(x, unit = "seconds", change_on_boundary = NULL)将日期向上取整

> leap_year(today())
[1] FALSE
> decimal_date(now())
[1] 2017.333
> round_date(now(), "hours")
[1] "2017-05-02 22:00:00 CST"
> floor_date(now(), "hours")
[1] "2017-05-02 21:00:00 CST"
> ceiling_date(now(), "hours")
[1] "2017-05-02 22:00:00 CST"