R 数据处理（二十二）— readxl

2021-02-02 本文已影响0人名本无名

单元格和列的类型

默认情况下， read_excel() 会自动推断列的类型，当然你也可以通过 col_types 显式设置类型。

col_types 参数的使用是非常灵活的，你可以将实际类型与 skip 和 guess 混合使用，如果是单一的类型，会被重复使用。

read_excel("yo.xlsx")
read_excel("yo.xlsx", col_types = "numeric")
read_excel("yo.xlsx", col_types = c("date", "skip", "guess", "numeric"))

1. 类型推测

前面我们介绍过 readr 的类型推测，但是 readxl 与 readr 有点不太一样

readr：根据数据猜测列类型
readxl：根据 Excel 单元格类型猜测列类型

Excel 电子表格中的每个单元格都有其自己的类型

empty < boolean < numeric < text

read_excel() 会将 guess_max 行内或遍历完数据之后选择最大的类型作为该列的类型。

例如

> readxl_example("deaths.xlsx") %>% 
+     read_excel(range = cell_rows(5:15))
# A tibble: 10 x 6                                                                                   
   Name               Profession   Age `Has kids` `Date of birth`     `Date of death`    
   <chr>              <chr>      <dbl> <lgl>      <dttm>              <dttm>             
 1 David Bowie        musician      69 TRUE       1947-01-08 00:00:00 2016-01-10 00:00:00
 2 Carrie Fisher      actor         60 TRUE       1956-10-21 00:00:00 2016-12-27 00:00:00
 3 Chuck Berry        musician      90 TRUE       1926-10-18 00:00:00 2017-03-18 00:00:00
 4 Bill Paxton        actor         61 TRUE       1955-05-17 00:00:00 2017-02-25 00:00:00
 5 Prince             musician      57 TRUE       1958-06-07 00:00:00 2016-04-21 00:00:00
 6 Alan Rickman       actor         69 FALSE      1946-02-21 00:00:00 2016-01-14 00:00:00
 7 Florence Henderson actor         82 TRUE       1934-02-14 00:00:00 2016-11-24 00:00:00
 8 Harper Lee         author        89 FALSE      1926-04-28 00:00:00 2016-02-19 00:00:00
 9 Zsa Zsa Gábor      actor         99 TRUE       1917-02-06 00:00:00 2016-12-18 00:00:00
10 George Michael     musician      53 FALSE      1963-06-25 00:00:00 2016-12-25 00:00:00

2. 类型比较

Excel	base R	col_types
anything	non-existent	"skip"
empty	logical, but all NA	you cannot request this
boolean	logical	"logical"
numeric	numeric	"numeric"
datetime	POSIXct	"date"
text	character	"text"
anything	list	"list"

对于前两行进行说明：

如果你不希望解析并加载某一列，可以使用 skip
如果某一列全部为空，会自动跳过该列

例如，使用 skip 和 guess

> readxl_example("deaths.xlsx") %>% 
+   read_excel(range = cell_rows(5:15),
+              col_types = c("guess", "skip", "guess", "skip", "skip", "skip"))
# A tibble: 10 x 2                                                                                   
   Name                 Age
   <chr>              <dbl>
 1 David Bowie           69
 2 Carrie Fisher         60
 3 Chuck Berry           90
 4 Bill Paxton           61
 5 Prince                57
 6 Alan Rickman          69
 7 Florence Henderson    82
 8 Harper Lee            89
 9 Zsa Zsa Gábor         99
10 George Michael        53

而对于 list 类型，会将该列解析为长度为 1 的 list。如果某一列包含不同的数据类型，使用这种解析方式将会很方便

我们使用 clippy.xlsx 文件进行说明，它的第二列包含了关于 Clippy 的信息，如果只使用一种类型，就很难存储这些信息

> (clippy <- readxl_example("clippy.xlsx") %>% 
+               read_excel(col_types = c("text", "list")))
# A tibble: 4 x 2                                                                                    
  name                 value     
  <chr>                <list>    
1 Name                 <chr [1]> 
2 Species              <chr [1]> 
3 Approx date of death <dttm [1]>
4 Weight in grams      <dbl [1]>

# 将包含两列的数据框转换为命名的向量或列表
> tibble::deframe(clippy)
$Name
[1] "Clippy"

$Species
[1] "paperclip"

$`Approx date of death`
[1] "2007-01-01 UTC"

$`Weight in grams`
[1] 0.9

> sapply(clippy$value, class)
[[1]]
[1] "character"

[[2]]
[1] "character"

[[3]]
[1] "POSIXct" "POSIXt" 

[[4]]
[1] "numeric"

工作流程

我们处理 Excel 文件是不是每次都要将其读取进来，然后通过一些列的操作将我们需要的数据从中提取出来呢？

其实不是的，我们完全可以将对我们有用的信息保存下来，而不是每次读取原始文件，然后再进行一般数据提取。

一般，我们可以将处理好的数据保存为 csv 文件，这样可以方便我们后续的读取与分析。

我们可以将 read_excel() 的输出直接传递到 readr::write_csv，如下所示

iris_xl <- readxl_example("datasets.xlsx") %>%
  read_excel(sheet = "iris") %>%
  write_csv("iris-raw.csv")

readr::write_csv() 是一个便捷的写入函数，它在完成写入之后会以不可见的方式返回输入的内容。

上面的代码从 datasets.xlsx 中读取 iris 表，并将其写入文件中。

> iris_xl
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows
> dir(pattern = "iris")
[1] "iris-raw.csv"

写入 CSV 的数据是否与我们从 Excel 导入的数据完全相同?

> (iris_alt <- read_csv("iris-raw.csv"))

─ Column specification ────────────────────────────────────────
cols(
  Sepal.Length = col_double(),
  Sepal.Width = col_double(),
  Petal.Length = col_double(),
  Petal.Width = col_double(),
  Species = col_character()
)

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

# 删除上面的推测类型的信息
> attr(iris_alt, "spec") <- NULL
> identical(iris_xl, iris_alt)
[1] FALSE

合并数据表

要将 Excel 表中的所有数据表加载到列表中，主要分为两步

获取所有的数据表名称
使用 purrr::map 迭代读取数据表

> path <- readxl_example("datasets.xlsx")
> path %>%
+   excel_sheets() %>%
+   set_names() %>%
+   map(read_excel, path = path)
$iris                                                                                                
# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# … with 140 more rows

$mtcars
# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# … with 22 more rows

$chickwts
# A tibble: 71 x 2
   weight feed     
    <dbl> <chr>    
 1    179 horsebean
 2    160 horsebean
 3    136 horsebean
 4    227 horsebean
 5    217 horsebean
 6    168 horsebean
 7    108 horsebean
 8    124 horsebean
 9    143 horsebean
10    140 horsebean
# … with 61 more rows

$quakes
# A tibble: 1,000 x 5
     lat  long depth   mag stations
   <dbl> <dbl> <dbl> <dbl>    <dbl>
 1 -20.4  182.   562   4.8       41
 2 -20.6  181.   650   4.2       15
 3 -26    184.    42   5.4       43
 4 -18.0  182.   626   4.1       19
 5 -20.4  182.   649   4         11
 6 -19.7  184.   195   4         12
 7 -11.7  166.    82   4.8       43
 8 -28.1  182.   194   4.4       15
 9 -28.7  182.   211   4.7       35
10 -17.5  180.   622   4.3       19
# … with 990 more rows

如果我们想一次读取所有的数据表并同时缓存到 CSV 呢？

我们定义如下函数，用于将单个数据表写入 CSV 文件中

read_then_csv <- function(sheet, path) {
  pathbase <- path %>%
    basename() %>%
    tools::file_path_sans_ext()
  path %>%
    read_excel(sheet = sheet) %>%
    write_csv(paste0(pathbase, "-", sheet, ".csv"))
}

应用 map 函数将 datasets.xlsx 的所有数据表写入

> path <- readxl_example("datasets.xlsx")
> path %>%
+     excel_sheets() %>%
+     set_names() %>%
+     map(read_then_csv, path = path)

既然已经能够一次性读取所有的数据表了，那怎么将它们合并成一个数据框呢？

我们使用 readxl 附带的数据 deaths.xlsx，其中包含了 2016-2017 去世的著名人物。

还包含了两个名为 arts 和 other 的数据表，每个电子表格的布局都相同，并且数据表具有相同的变量，例如名称和死亡日期

purrr 中的 map_df 函数可以很容易的在数据表上迭代，并将数据合并

> path <- readxl_example("deaths.xlsx")
> deaths <- path %>%
+   excel_sheets() %>%
+   set_names() %>%
+   map_df(~ read_excel(path = path, sheet = .x, range = "A5:F15"), .id = "sheet")
                                                                                                     
> deaths
# A tibble: 20 x 7
   sheet Name               Profession   Age `Has kids` `Date of birth`     `Date of death`    
   <chr> <chr>              <chr>      <dbl> <lgl>      <dttm>              <dttm>             
 1 arts  David Bowie        musician      69 TRUE       1947-01-08 00:00:00 2016-01-10 00:00:00
 2 arts  Carrie Fisher      actor         60 TRUE       1956-10-21 00:00:00 2016-12-27 00:00:00
 3 arts  Chuck Berry        musician      90 TRUE       1926-10-18 00:00:00 2017-03-18 00:00:00
 4 arts  Bill Paxton        actor         61 TRUE       1955-05-17 00:00:00 2017-02-25 00:00:00
 5 arts  Prince             musician      57 TRUE       1958-06-07 00:00:00 2016-04-21 00:00:00
 6 arts  Alan Rickman       actor         69 FALSE      1946-02-21 00:00:00 2016-01-14 00:00:00
 7 arts  Florence Henderson actor         82 TRUE       1934-02-14 00:00:00 2016-11-24 00:00:00
 8 arts  Harper Lee         author        89 FALSE      1926-04-28 00:00:00 2016-02-19 00:00:00
 9 arts  Zsa Zsa Gábor      actor         99 TRUE       1917-02-06 00:00:00 2016-12-18 00:00:00
10 arts  George Michael     musician      53 FALSE      1963-06-25 00:00:00 2016-12-25 00:00:00
11 other Vera Rubin         scientist     88 TRUE       1928-07-23 00:00:00 2016-12-25 00:00:00
12 other Mohamed Ali        athlete       74 TRUE       1942-01-17 00:00:00 2016-06-03 00:00:00
13 other Morley Safer       journalist    84 TRUE       1931-11-08 00:00:00 2016-05-19 00:00:00
14 other Fidel Castro       politician    90 TRUE       1926-08-13 00:00:00 2016-11-25 00:00:00
15 other Antonin Scalia     lawyer        79 TRUE       1936-03-11 00:00:00 2016-02-13 00:00:00
16 other Jo Cox             politician    41 TRUE       1974-06-22 00:00:00 2016-06-16 00:00:00
17 other Janet Reno         lawyer        78 FALSE      1938-07-21 00:00:00 2016-11-07 00:00:00
18 other Gwen Ifill         journalist    61 FALSE      1955-09-29 00:00:00 2016-11-14 00:00:00
19 other John Glenn         astronaut     95 TRUE       1921-07-28 00:00:00 2016-12-08 00:00:00
20 other Pat Summit         coach         64 TRUE       1952-06-14 00:00:00 2016-06-28 00:00:00

在这里我们使用了 range = "A5:E15" 来提取矩形区域，因为表格顶部和底部包含了一些非数据行。

最后，我们可以将这些代码串联起来，完成一个完整的从读取到合并再写出的过程

path <- readxl_example("deaths.xlsx")
sheets <- path %>%
  excel_sheets() %>%
  set_names()
ranges <- list("A5:F15", cell_rows(5:15))
deaths <- map2_df(
  sheets,
  ranges,
  ~ read_excel(path, sheet = .x, range = .y),
  .id = "sheet"
) %>%
  write_csv("deaths.csv")

下面我们使用基础的 R 代码来实现同样的功能的，以便大家进行比较

缓存为 CSV

iris_xl <- read_excel(readxl_example("datasets.xlsx"), sheet = "iris")
write.csv(iris_xl, "iris-raw.csv", row.names = FALSE, quote = FALSE)
iris_alt <- read.csv("iris-raw.csv", stringsAsFactors = FALSE)
## coerce iris_xl back to a data.frame
identical(as.data.frame(iris_xl), iris_alt)

迭代 Excel 文件的所有数据表

path <- readxl_example("datasets.xls")
sheets <- excel_sheets(path)
xl_list <- lapply(excel_sheets(path), read_excel, path = path)
names(xl_list) <- sheets

迭代并缓存

read_then_csv <- function(sheet, path) {
  pathbase <- tools::file_path_sans_ext(basename(path))
  df <- read_excel(path = path, sheet = sheet)
  write.csv(df, paste0(pathbase, "-", sheet, ".csv"),
            quote = FALSE, row.names = FALSE)
  df
}
path <- readxl_example("datasets.xlsx")
sheets <- excel_sheets(path)
xl_list <- lapply(excel_sheets(path), read_then_csv, path = path)
names(xl_list) <- sheets

合并为数据框

path <- readxl_example("deaths.xlsx")
sheets <- excel_sheets(path)
xl_list <-
  lapply(excel_sheets(path), read_excel, path = path, range = "A5:F15")
xl_list <- lapply(seq_along(sheets), function(i) {
  data.frame(sheet = I(sheets[i]), xl_list[[i]])
})
xl_list <- do.call(rbind, xl_list)

串联整个流程

path <- readxl_example("deaths.xlsx")
sheets <- excel_sheets(path)
ranges <- list("A5:F15", cell_rows(5:15))
xl_list <- mapply(function(x, y) {
  read_excel(path = path, sheet = x, range = y)
}, sheets, ranges, SIMPLIFY = FALSE)
xl_list <- lapply(seq_along(sheets), function(i) {
  data.frame(sheet = I(sheets[i]), xl_list[[i]])
})
xl_list <- do.call(rbind, xl_list)
write.csv(xl_list, "deaths.csv", row.names = FALSE, quote = FALSE)