数据源 http://labfile.oss.aliyuncs.com/courses/375/summer.tar.gz
英文中很多单词并没有实际上的关键词作用,例如冠词 the、人称代词 he 等单词不能指向实际的意义,它们只是构成一句话的语法组成元素,那么这一类的词就需要被剔除,被清洗,这就是数据清洗,清洗是对文本关键词提取没有影响的单词。

> library(Rwordseg)
> text <- readLines('summer.txt')
> res <- text[text!=""]
> words <- unlist(lapply(X=res, FUN=segmentCN))
> word <- lapply(X=words, FUN=strsplit, " ")
> v <- table(unlist(word))
> v <- sort(v, decreasing=T)
> datas <- data.frame(word=names(v), freq=v)


> summary(datas$freq.Freq)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   5.353   3.000 475.000 
> length(datas$freq.Freq)
[1] 3362
> head(datas, 100)
         word freq.Var1 freq.Freq
1         the       the       475
2           I         I       469
3         and       and       369
4          to        to       277
5          of        of       256
6         you       you       237
7           a         a       235
8          in        in       211
9         And       And       204
10         is        is       186
11         me        me       180
12         my        my       176
13        not       not       160
14       with      with       149
15          d         d       139
16       that      that       137
17          s         s       126
18         it        it       116
19       this      this       114
20       your      your       114
21       will      will       111
22        for       for       109
23       love      love       103
24         as        as        99
25         be        be        97
26         do        do        97
27        her       her        97
28       thou      thou        94
29       have      have        91
30        The       The        86
31         so        so        85
32        his       his        84
33        all       all        83
34         he        he        78
35         on        on        65
36      shall     shall        65
37       thee      thee        64
38          O         O        63
39         we        we        63
40        but       but        62
41         no        no        62
42         To        To        62
43        But       But        61
44   LYSANDER  LYSANDER        61
45        are       are        60
47     HERMIA    HERMIA        57
48        thy       thy        57
49         am        am        55
50     BOTTOM    BOTTOM        55
51    THESEUS   THESEUS        55
52        she       she        54
53         by        by        53
54       here      here        53
55        our       our        52
56        him       him        51
57      night     night        49
58       That      That        49
59      their     their        49
60        man       man        47
61       from      from        46
62       PUCK      PUCK        46
63         or        or        45
64     QUINCE    QUINCE        45
65     HELENA    HELENA        44
66       more      more        44
67    Pyramus   Pyramus        44
68      sweet     sweet        44
69     Hermia    Hermia        43
70       must      must        42
71        now       now        42
72       eyes      eyes        41
73  Demetrius Demetrius        40
74   Lysander  Lysander        40
75       come      come        39
76       What      What        39
77      Enter     Enter        38
78        see       see        38
79        You       You        38
80         at        at        37
81        one       one        37
82       what      what        37
83       good      good        36
84     OBERON    OBERON        36
85       play      play        36
86       This      This        36
87     Thisby    Thisby        36
88        For       For        35
89       hath      hath        35
90         if        if        35
91          A         A        34
92     should    should        34
93      would     would        34
94        did       did        33
95       make      make        33
96       Exit      Exit        32
97         go        go        32
98         ll        ll        31
99       some      some        31
100   TITANIA   TITANIA        31

频数的分布并不均匀,中位数为 1,说明文中有一半左右的单词都只是出现了一次而已,单词出现一次的可以忽略不计,且第 3 分位数为 3,也就是说 3362 个单词中,关键词主要分布在频数为 3-475 的单词中,因此需要将数据集中频数比较低的数据剔除。
频数排名前 22 的单词均是冠词、人称代词、连接词、介词之类与文章内容无关的单词,从第 23 个开始,与文章有关的单词才开始出现,因此需要将这一类组成英文句子语法结构的单词剔除。
首先,使用 subset 函数实现数据集的初步筛选,剔除频数低的单词,然后剔除与文章无关的单词。

> datas <- datas[, c(1, 3)]
> colnames(datas)[2] <- "freq"
> newdatas <- subset(datas, freq>=3)
> newdatas <- newdatas[-c(1:22), ]
> nrow(newdatas)
[1] 990
> head(newdatas, 30)
        word freq
23      love  103
24        as   99
25        be   97
26        do   97
27       her   97
28      thou   94
29      have   91
30       The   86
31        so   85
32       his   84
33       all   83
34        he   78
35        on   65
36     shall   65
37      thee   64
38         O   63
39        we   63
40       but   62
41        no   62
42        To   62
43       But   61
44  LYSANDER   61
45       are   60
47    HERMIA   57
48       thy   57
49        am   55
50    BOTTOM   55
51   THESEUS   55
52       she   54

由于初次清洗后的数据集仍然比较大,而且根据 head 函数查看排名前 30 的数据中仍然有很多与文章内容无关的单词,因此需要抽样;但是剔除这些单词,观察到一些大写的单词:LYSANDER、HERMIA、BOTTOM、THESEUS 这些专有名词都是需要关注的关键词,可能是人名或者地名。

> set.seed(3000)
> sample_1 <- floor(runif(30,min=1,max=990))
> new_sample <- newdatas[c(sample_1,1),]

这里的抽样数量选择 31(30+1),是取总体数量 990 的开平方 31,其中由于排名第一的 love 这个单词是与文章内容有关的关键词,因此需要加入到样本中。所以选取的样本数据集应该就是将 floor 抽取的行号和第 1 行所代表的数据。

> library(wordcloud)
> wordcloud(words=new_sample$word,freq=new_sample$freq,scale=c(10,.5),col=rainbow(length(new_sample$freq)))
