有道单词本数据检索方法

2018-05-10 本文已影响3人董八七

用有道保存了一些查询过的单词和词组，想要对这些内容进行检索。

数据清理

导出单词本为文本文件，在notepad中打开。
合并行 #因为一个entry分到了3-5行
- 替换\r\n
- 替换\n
去掉entry名 #导出的每个条目前都有排序编号
- 替换\d{1,},
替换连续的多个空格为单个空格
去掉音标 #在R中会以乱码显示
- 替换[.*]

导入R

library(tidyverse)
vocab <- readLines("input/vocabulary_youdao.txt", encoding = "UTF-8") %>% as.tibble

# extrac word according a pattern
d_ex_vocab <- function(patt) {
  library(magrittr)
  extrt <- stringr::str_extract(vocab$value, patt)
  dong_word_extract <- vocab[!is.na(extrt), ]
  if (dim(dong_word_extract)[1] == 0) 
    stop("No word extracted, plz check the spell!")
  write.csv(dong_word_extract, paste0("output/", patt, ".csv"), quote = F, row.names = F)
  return(dong_word_extract)
}

ex_word <- d_ex_vocab("取决于")

有道单词本数据检索方法

数据清理

导入R

猜你喜欢

热点阅读