[R语言] Vectors 向量操作《R for data sc
《R for Data Science》第二十章 Vectors 啃书知识点积累
参考链接:R for Data Science
Vector basics
向量有两种类型:
- Atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors. (homogeneous)
- Lists, which are sometimes called recursive vectors because lists can contain other lists. (heterogeneous)
NULL
is often used to represent the absence of a vector.
NA
is used to represent the absence of a value in a vector.
- Every vector has two key properties:
-
Its type, which you can determine with
typeof()
.typeof(letters) #> [1] "character" typeof(1:10) #> [1] "integer"
-
Its length, which you can determine with
length()
.x <- list("a", "b", 1:10) length(x) #> [1] 3
- augmented vectors
- Factors are built on top of integer vectors.
- Dates and date-times are built on top of numeric vectors.
- Data frames and tibbles are built on top of lists.
Important types of atomic vector
- Logical
Logical vectors can take only three possible values: FALSE
, TRUE
, and NA
.
(尤其注意NA
是逻辑型)
c(TRUE, TRUE, FALSE, NA)
#> [1] TRUE TRUE FALSE NA
- Numeric
To make an integer, place an L
after the number
typeof(1)
#> [1] "double"
typeof(1L)
#> [1] "integer"
1.5L
#> [1] 1.5
# integer和double的取值差异,不重要
.Machine$integer.max
#> [1] 2147483647
.Machine$double.xmax
#> [1] 1.8e+308
.Machine$double.base
#> [1] 2
.Machine$double.digits
#> [1] 53
.Machine$double.exponent
#> [1] 11
.Machine$double.eps
#> [1] 2.22e-16
.Machine$double.neg.eps
#> [1] 1.11e-16
需要注意的integer
和double
区别:
- Doubles are approximations.
x <- sqrt(2) ^ 2
x
#> [1] 2
x - 2
#> [1] 4.44e-16
x - 2 == 0
#> [1] FALSE
dplyr::near(x - 2, 0)
#> [1] TRUE
# near的原理:不比较精确相等,而是有个判断
dplyr::near
# function (x, y, tol = .Machine$double.eps^0.5)
# {
# abs(x - y) < tol
# }
# <bytecode: 0x000002bd0ce7c7e8>
# <environment: namespace:dplyr>
- Integers have one special value:
NA
,
while doubles have four:NA, NaN, Inf, -Inf
.
c(-1, 0, 1) / 0
#> [1] -Inf NaN Inf
(X
表示TRUE)
# 可以注意到NA和NaN有限和无限判断均为FALSE
is.infinite(NA)
# [1] FALSE
is.finite(NA)
# [1] FALSE
# 举一个更明确的例子
x <- c(0, NA, NaN, Inf, -Inf)
is.finite(x)
#> [1] TRUE FALSE FALSE FALSE FALSE
!is.infinite(x)
#> [1] TRUE TRUE TRUE FALSE FALSE
-
double
tointeger
tibble(
x = c(
1.8, 1.5, 1.2, 0.8, 0.5, 0.2,
-0.2, -0.5, -0.8, -1.2, -1.5, -1.8
),
`Round down` = floor(x),
`Round up` = ceiling(x),
`Round towards zero` = trunc(x),
`Nearest, round half to even` = round(x)
)
- Character
R uses a global string pool.
This means that each unique string is only stored in memory once.
This reduces the amount of memory needed by duplicated strings.
x <- "This is a reasonably long string."
pryr::object_size(x)
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#> 152 B
y <- rep(x, 1000)
pryr::object_size(y)
#> 8.14 kB
原因:
A pointer is 8 bytes, so 1000 pointers to a 136 B string is 8 * 1000 + 136 = 8.13 kB.
- Missing values
Note that each type of atomic vector has its own missing value:
NA # logical
#> [1] NA
NA_integer_ # integer
#> [1] NA
NA_real_ # double
#> [1] NA
NA_character_ # character
#> [1] NA
Using atomic vectors
- Test functions
Base R provides many functions like is.vector() and is.atomic(), but they often return surprising results.
Instead, it’s safer to use theis_*
functions provided bypurrr
, which are summarised in the table below.
- 如果检查是否是标量可以用
scalar
x <- c(TRUE)
y <- c(TRUE, FALSE)
is_scalar_logical(x)
# [1] TRUE
is_scalar_logical(y)
# [1] FALSE
- Scalars and recycling rules
The vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar.
tibble(x = 1:4, y = 1:2)
#> Error: Tibble columns must have consistent lengths, only values of length one are recycled:
#> * Length 2: Column `y`
#> * Length 4: Column `x`
tibble(x = 1:4, y = rep(1:2, 2))
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 1
#> 4 4 2
tibble(x = 1:4, y = rep(1:2, each = 2))
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 3 2
#> 4 4 2
- Naming vectors
两种方法:c()
内部设置和purrr::set_names()
c(x = 1, y = 2, z = 4)
#> x y z
#> 1 2 4
set_names(1:3, c("a", "b", "c"))
#> a b c
#> 1 2 3
-
purrr::set_name
和setNames
setNames(1:4, c("a", "b", "c", "d"))
#> a b c d
#> 1 2 3 4
purrr::set_names(1:4, c("a", "b", "c", "d"))
#> a b c d
#> 1 2 3 4
# 即使多个向量但符合数据长度也可以
purrr::set_names(1:4, "a", "b", "c", "d")
#> a b c d
#> 1 2 3 4
setNames(1:4, c("a", "b"))
#> a b <NA> <NA>
#> 1 2 3 4
# 如果名字长度和数据长度不同则set_names无法起作用
purrr::set_names(1:4, c("a", "b"))
#> `nm` must be `NULL` or a character vector the same length as `x`
- Subsetting
- By repeating a position, you can actually make a longer output than input:
# 允许重复取子集下标
x[c(1, 1, 5, 5, 5, 2)]
#> [1] "one" "one" "five" "five" "five" "two"
- It’s an error to mix positive and negative values:
x[c(1, -1)]
#> Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts
- The error message mentions subsetting with zero, which returns no values:
x[0]
#> character(0)
- 利用逻辑值取子集
x <- c(10, 3, NA, 5, 8, 1, NA)
x[x > 0]
# [1] 10 3 NA 5 8 1 NA
subset(x, x > 0)
# [1] 10 3 5 8 1
# 可去除NA
[[
only ever extracts a single element, and always drops names.
-
x[x >= 0]
和x[- which(x < 0)]
的区别
x
# [1] 10 4 NA 5 8 1 NA
x[x >= 0]
# [1] 10 4 NA 5 8 1 NA
x[-which(x < 0)]
# numeric(0)
# 如果which取子集取不到,则无法删除和取反
y
# [1] 10 -4 NA 5 8 1 NA
y[y >= 0]
# [1] 10 NA 5 8 1 NA
y[-which(y < 0)]
# [1] 10 NA 5 8 1 NA
# 可取到子集则相同
Recursive vectors (lists)
Lists are a step up in complexity from atomic vectors, because lists can contain other lists.
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
#> List of 3
#> $ a: num 1
#> $ b: num 2
#> $ c: num 3
y <- list("a", 1L, 1.5, TRUE)
str(y)
#> List of 4
#> $ : chr "a"
#> $ : int 1
#> $ : num 1.5
#> $ : logi TRUE
# 嵌套list
z <- list(list(1, 2), list(3, 4))
str(z)
#> List of 2
#> $ :List of 2
#> ..$ : num 1
#> ..$ : num 2
#> $ :List of 2
#> ..$ : num 3
#> ..$ : num 4
- Visualising lists
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
- Subsetting
str(a[1:4])
# List of 4
# $ a: int [1:3] 1 2 3
# $ b: chr "a string"
# $ c: num 3.14
# $ d:List of 2
# ..$ : num -1
# ..$ : num -5
str(a[2:3])
# List of 2
# $ b: chr "a string"
# $ c: num 3.14
str(a[4])
#> List of 1
#> $ d:List of 2
#> ..$ : num -1
#> ..$ : num -5
-
list
的两个操作符:[[
$
(1) [[
extracts a single component from a list. It removes a level of hierarchy from the list.
str(a[4])
# List of 1
# $ d:List of 2
# ..$ : num -1
# ..$ : num -5
str(a[[4]])
# List of 2
# $ : num -1
# $ : num -5
(2) $
is a shorthand for extracting named elements of a list.
a$d
# [[1]]
# [1] -1
#
# [[2]]
# [1] -5
Attributes
x <- 1:10
attr(x, "greeting")
#> NULL
attr(x, "greeting") <- "Hi!"
attr(x, "farewell") <- "Bye!"
attributes(x)
#> $greeting
#> [1] "Hi!"
#>
#> $farewell
#> [1] "Bye!"
涉及了泛型函数generic functions
的概念
methods("as.Date")
#> [1] as.Date.character as.Date.default as.Date.factor
#> [4] as.Date.numeric as.Date.POSIXct as.Date.POSIXlt
#> [7] as.Date.vctrs_sclr* as.Date.vctrs_vctr*
#> see '?methods' for accessing help and source code
For example, if x is a character vector,
as.Date()
will callas.Date.character()
; if it’s a factor, it’ll callas.Date.factor()
.
You can see the specific implementation of a method with getS3method()
:
getS3method("as.Date", "default")
#> function (x, ...)
#> {
#> if (inherits(x, "Date"))
#> x
#> else if (is.logical(x) && all(is.na(x)))
#> .Date(as.numeric(x))
#> else stop(gettextf("do not know how to convert '%s' to class %s",
#> deparse(substitute(x)), dQuote("Date")), domain = NA)
#> }
#> <bytecode: 0x4f30d48>
#> <environment: namespace:base>
getS3method("as.Date", "numeric")
#> function (x, origin, ...)
#> {
#> if (missing(origin))
#> stop("'origin' must be supplied")
#> as.Date(origin, ...) + x
#> }
#> <bytecode: 0x84fa058>
#> <environment: namespace:base>
Augmented vectors
- Factors
- Dates
x <- as.Date("1971-01-01")
unclass(x)
#> [1] 365
typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "Date"
- Date-times
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
#> [1] 3600
#> attr(,"tzone")
#> [1] "UTC"
typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "POSIXct" "POSIXt"
#>
#> $tzone
#> [1] "UTC"
If you find you have a POSIXlt, you should always convert it to a regular data time lubridate::as_date_time().
- Tibbles
Tibbles are augmented lists: they have class “
tbl_df” + “tbl” + “data.frame
”, andnames (column)
androw.names
attributes
- Q: Try and make a tibble that has columns with different lengths. What happens?
# 如果是标量会循环遍历,不等长非标量则无法创建
tibble(x = 1, y = 1:5)
#> # A tibble: 5 x 2
#> x y
#> <dbl> <int>
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> 4 1 4
#> 5 1 5
tibble(x = 1:3, y = 1:4)
#> Tibble columns must have consistent lengths, only values of length one are recycled:
#> * Length 3: Column `x`
#> * Length 4: Column `y`