《Learning R》笔记 Chapter 7 下 Facto

2018-02-13 本文已影响0人天火燎原天

创建

在创建和读入dataframe时，R在默认条件下会自动将含有字符串的column转化为factor。factor()函数则能够手动将string转化为factor。levels()和nlevel()能够查看factor的具体情况。

> x <- iris$Species
> class(x)
[1] "factor"
> levels(x) ; nlevels(x)
[1] "setosa"     "versicolor" "virginica" 
[1] 3

操作

要改动factor中的levels先后顺序，应当在factor(... , levels=c() )中改动，不能直接向levels()中传递变量，否则极容易出错。relevel()则是较为安全的一个函数，它能够将某个level直接提到最前作为ref level，适用于某些回归分析。
事实上relevel()是factor()的wrapper。

relevel(x, ref, ...)
> y <- sample(x,6) ; y
[1] versicolor setosa     virginica  setosa     virginica  versicolor
Levels: setosa versicolor virginica
> relevel(y , 'versicolor') 
[1] versicolor setosa     virginica  setosa     virginica  versicolor
Levels: versicolor setosa virginica #函数直接输出新的string

如果在数据清洗过程中，某个level对应的值全部被删除，以至于string存在无用的‘空’level。可以使用droplevels()来进行精简。接受factor或df输入，输出新的factor或df。

## S3 method for class 'factor'
droplevels(x, exclude = if(anyNA(levels(x))) NULL else NA, ...)
## S3 method for class 'data.frame'
droplevels(x, except, exclude, ...)

从连续变量中构建

R中的cut()函数能够将连续变量转换为区间分割的factor。在这里breaks是#either a numeric vector of two or more unique cut points or a single number。也就是说要么输入间隔数，要么输入一个vector来规定所有的间隔刻度线，不能只给出中部的刻度不给两端，否则会产生NA。

cut(x, breaks, labels = NULL, 
    include.lowest = FALSE, right = TRUE, dig.lab = 3, #左开右闭
    ordered_result = FALSE, ...)

> x=runif(5,0,10)
> x
[1] 3.2502069 3.7256012 8.8114966 9.6004756 0.8837793
> cut(x,c(3,6,9)) #上限和下限都没有定义
[1] (3,6] (3,6] (6,9] <NA>  <NA> 
Levels: (3,6] (6,9]
> cut(x,c(3,6,9,Inf)) #下限没有定义
[1] (3,6]   (3,6]   (6,9]   (9,Inf] <NA>   
Levels: (3,6] (6,9] (9,Inf]
> cut(x,c(-Inf,3,6,9,Inf)) #正确方式
[1] (3,6]    (3,6]    (6,9]    (9, Inf] (-Inf,3]
Levels: (-Inf,3] (3,6] (6,9] (9, Inf]

数据清洗时的一个小trick

一个vector本应全是numeric类型，但由于来源输入的问题，导致这个vector成了string，此时应当怎么办？
例如一个vector x=c( 4.645 6.843 2.187 6.351 7.338 6.367) ,由于mistyping，成了c( "4.645" "6..843" "2.187" "6.351" "7.338" "6.367" ) 。而在读入时，由于R还会自动尝试把字符串转换为factor，导致事实上我们手头得到的是这样一个factor y.

> y
[1] 4.645  6..843 2.187  6.351  7.338  6.367 
Levels: 2.187 4.645 6..843 6.351 6.367 7.338

书中推荐按照factor -- string -- numeric 的顺序来清洗。R的手册中推荐更有效的方式是首先将factor的levels转换为数值，再将数值按照原factor中unclass的数值来进行排列（因为as.integer(某factor)得到的是unclass数值）

> as.numeric(as.character(y))
[1] 4.645    NA 2.187 6.351 7.338 6.367
Warning message:
NAs introduced by coercion 
> as.numeric(levels(y))[as.integer(y)] #推荐方法
[1] 4.645    NA 2.187 6.351 7.338 6.367
Warning message:
NAs introduced by coercion

快速生成levels / Generate Factor Levels

gl()是factor的另一个wrapper，能够快速生成factor.

gl(n, k, length = n*k, labels = seq_len(n), ordered = FALSE)
> gl(3,3,8,labels = LETTERS[1:3])
[1] A A A B B B C C
Levels: A B C

交互 / Interaction

将两个factor交互，产生新的factor。

> x=gl(3,3,labels=LETTERS[1:3])
> y=gl(3,3,labels = LETTERS[24:26])
> interaction(x,y)
[1] A.X A.X A.X B.Y B.Y B.Y C.Z C.Z C.Z
Levels: A.X B.X C.X A.Y B.Y C.Y A.Z B.Z C.Z

《Learning R》笔记 Chapter 7 下 Facto

创建

操作

从连续变量中构建

数据清洗时的一个小trick

快速生成levels / Generate Factor Levels

交互 / Interaction

猜你喜欢

热点阅读