【R】加速你的R代码
2022-05-31 本文已影响0人
caokai001
背景
最近在做项目,对不同文件夹下的sas文件进行整合。
发现文件太多,合并效率非常低,于是请教了一下同事常用的一些加速方法;
待优化的步骤
1.读取数据过程
读入数据过程选取特点列,再进行下游操作,这样一定程度上可以提速(影响不太大)
read_sas(data_file, col_select = NULL )
2.循环优化(for循环)
需要提前创建空表。循环进行写入;
- foreach + doParallel 多核并行方案
library(foreach)
library(doParallel)
#registerDoParallel(no_cores)也可以
registerDoParallel(makeCluster(no_cores))
- foreach()函数需要%dopar%命令并行化程序
#输出向量设置.combine = c
foreach(exponent = 1:5, .combine = c) %dopar% base^exponent
[1] 3 9 27 81 243
#输出矩阵设置.combine = rbind
foreach(exponent = 1:5, .combine = rbind) %dopar% base^exponent
[,1]
result.1 3
result.2 9
result.3 27
result.4 81
result.5 243
#输出列表设置.combine = list
foreach(exponent = 1:5, .combine = list, .multicombine=TRUE) %dopar% base^exponent
[[1]]
[1] 3
[[2]]
[1] 9
[[3]]
[1] 27
[[4]]
[1] 81
[[5]]
[1] 243
#输出数据框设置.combine = data.frame
foreach(exponent = 1:5, .combine = data.frame) %dopar% base^exponent
result.1 result.2 result.3 result.4 result.5
1 2 4 8 16 32
#关闭集群
stopImplicitCluster()
3.mclappy加速
适合管道思路,一直往下写;
all_data <- mclapply(
1:length(all_ds_name) %>% set_names(all_ds_name),
function(i) {
read_sas(
str_glue('{folder_rawdata_snap}/{all_ds_name[i]}.sas7bdat'),
col_select = c(
project, Site, StudySiteNumber, Subject, InstanceName, DataPageName, RecordPosition, MinCreated,
instanceId, Folder
)
)
},
mc.cores = 8, mc.preschedule = F
)
4.purrr::map 家族函数
-
map(.x, .f, ...)
:返回值为列表 -
map_lgl()
、map_int()
、map_dbl()
、map_chr()
:返回特定数据类型的向量,在使用map_int()
时需要注意数据类型的自动提升问题 -
map_dfc()
、map_dfr()
:对数据进行col_binding
、row_binding
得到数据框
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map_dfr(~ as.data.frame(t(as.matrix(coef(.)))))
# 返回值
(Intercept) wt
1 39.57120 -5.647025
2 28.40884 -2.780106
3 23.86803 -2.192438
5.分组或排序过程
- group_by 或 arrage 比较多变量,分组较多很慢,dtplyr加速
query_posting_delay_query_detail <- query_detail %>%
select(Study, SiteName, StudyEnvironmentSiteNumber, SubjectName, Folder, Form, Field, `Log#`, MarkingGroupName, QryOpenDate, Name) %>%
filter(!(Name %in% c('Cancelled'))) %>%
filter(!(MarkingGroupName %in% c('Site from System'))) %>%
mutate(QryOpenDate = as_date(mdy_hms(QryOpenDate))) %>%
# use data.table for faster group summarize: 479.858 -> 0.112 seconds, under 95008 groups
lazy_dt() %>%
group_by(Study, SiteName, StudyEnvironmentSiteNumber, SubjectName, Folder, Form, Field, `Log#`, MarkingGroupName) %>%
summarise(QryOpenDate = min(QryOpenDate, na.rm = T)) %>%
ungroup() %>%
as_tibble()
欢迎评论交流~
参考:
https://www.jianshu.com/p/c498c9d4cfaf