09基于R语言实现文字云的绘制
数据源 http://labfile.oss.aliyuncs.com/courses/375/summer.tar.gz
数据清洗
英文中很多单词并没有实际上的关键词作用,例如冠词 the、人称代词 he 等单词不能指向实际的意义,它们只是构成一句话的语法组成元素,那么这一类的词就需要被剔除,被清洗,这就是数据清洗,清洗是对文本关键词提取没有影响的单词。
一个单词在文中出现的频率越高就越能证明它在文章中占有的地位,也就是关键字。
> library(Rwordseg)
>
> text <- readLines('summer.txt')
> res <- text[text!=""]
> words <- unlist(lapply(X=res, FUN=segmentCN))
> word <- lapply(X=words, FUN=strsplit, " ")
> v <- table(unlist(word))
> v <- sort(v, decreasing=T)
> datas <- data.frame(word=names(v), freq=v)
数据预处理就是噪声数据清除
> summary(datas$freq.Freq)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 5.353 3.000 475.000
> length(datas$freq.Freq)
[1] 3362
> head(datas, 100)
word freq.Var1 freq.Freq
1 the the 475
2 I I 469
3 and and 369
4 to to 277
5 of of 256
6 you you 237
7 a a 235
8 in in 211
9 And And 204
10 is is 186
11 me me 180
12 my my 176
13 not not 160
14 with with 149
15 d d 139
16 that that 137
17 s s 126
18 it it 116
19 this this 114
20 your your 114
21 will will 111
22 for for 109
23 love love 103
24 as as 99
25 be be 97
26 do do 97
27 her her 97
28 thou thou 94
29 have have 91
30 The The 86
31 so so 85
32 his his 84
33 all all 83
34 he he 78
35 on on 65
36 shall shall 65
37 thee thee 64
38 O O 63
39 we we 63
40 but but 62
41 no no 62
42 To To 62
43 But But 61
44 LYSANDER LYSANDER 61
45 are are 60
46 DEMETRIUS DEMETRIUS 58
47 HERMIA HERMIA 57
48 thy thy 57
49 am am 55
50 BOTTOM BOTTOM 55
51 THESEUS THESEUS 55
52 she she 54
53 by by 53
54 here here 53
55 our our 52
56 him him 51
57 night night 49
58 That That 49
59 their their 49
60 man man 47
61 from from 46
62 PUCK PUCK 46
63 or or 45
64 QUINCE QUINCE 45
65 HELENA HELENA 44
66 more more 44
67 Pyramus Pyramus 44
68 sweet sweet 44
69 Hermia Hermia 43
70 must must 42
71 now now 42
72 eyes eyes 41
73 Demetrius Demetrius 40
74 Lysander Lysander 40
75 come come 39
76 What What 39
77 Enter Enter 38
78 see see 38
79 You You 38
80 at at 37
81 one one 37
82 what what 37
83 good good 36
84 OBERON OBERON 36
85 play play 36
86 This This 36
87 Thisby Thisby 36
88 For For 35
89 hath hath 35
90 if if 35
91 A A 34
92 should should 34
93 would would 34
94 did did 33
95 make make 33
96 Exit Exit 32
97 go go 32
98 ll ll 31
99 some some 31
100 TITANIA TITANIA 31
频数的分布并不均匀,中位数为 1,说明文中有一半左右的单词都只是出现了一次而已,单词出现一次的可以忽略不计,且第 3 分位数为 3,也就是说 3362 个单词中,关键词主要分布在频数为 3-475 的单词中,因此需要将数据集中频数比较低的数据剔除。
频数排名前 22 的单词均是冠词、人称代词、连接词、介词之类与文章内容无关的单词,从第 23 个开始,与文章有关的单词才开始出现,因此需要将这一类组成英文句子语法结构的单词剔除。
首先,使用 subset 函数实现数据集的初步筛选,剔除频数低的单词,然后剔除与文章无关的单词。
> datas <- datas[, c(1, 3)]
> colnames(datas)[2] <- "freq"
> newdatas <- subset(datas, freq>=3)
> newdatas <- newdatas[-c(1:22), ]
> nrow(newdatas)
[1] 990
> head(newdatas, 30)
word freq
23 love 103
24 as 99
25 be 97
26 do 97
27 her 97
28 thou 94
29 have 91
30 The 86
31 so 85
32 his 84
33 all 83
34 he 78
35 on 65
36 shall 65
37 thee 64
38 O 63
39 we 63
40 but 62
41 no 62
42 To 62
43 But 61
44 LYSANDER 61
45 are 60
46 DEMETRIUS 58
47 HERMIA 57
48 thy 57
49 am 55
50 BOTTOM 55
51 THESEUS 55
52 she 54
由于初次清洗后的数据集仍然比较大,而且根据 head 函数查看排名前 30 的数据中仍然有很多与文章内容无关的单词,因此需要抽样;但是剔除这些单词,观察到一些大写的单词:LYSANDER、HERMIA、BOTTOM、THESEUS 这些专有名词都是需要关注的关键词,可能是人名或者地名。
> set.seed(3000)
> sample_1 <- floor(runif(30,min=1,max=990))
> new_sample <- newdatas[c(sample_1,1),]
这里的抽样数量选择 31(30+1),是取总体数量 990 的开平方 31,其中由于排名第一的 love 这个单词是与文章内容有关的关键词,因此需要加入到样本中。所以选取的样本数据集应该就是将 floor 抽取的行号和第 1 行所代表的数据。
绘制文字云
> library(wordcloud)
>
> wordcloud(words=new_sample$word,freq=new_sample$freq,scale=c(10,.5),col=rainbow(length(new_sample$freq)))
