143、python中使用wordcloud包生成词云图
一、Python的wordcloud包在anaconda中安装
- 根据自己的电脑系统及安装的anaconda版本下载对应wordcloud安装包。
window环境下载地址:<u>http://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud</u>
然后把下载到的文件放在所执行的目录文件下。
- 使用pip安装
在终端Anaconda Prompt打开之后:你得先进入这个.whl文件所在的位置,我是放在了C:\notebook文件夹下面,具体你去找自己下载放到了那里!!
输入:pip install wordcloud-1.4.1-cp36-cp36m-win_amd64.whl
二、使用wordcloud包生成词云图
我是参考下面这篇博客来制作词云图,里面有详细介绍wordcloud包的用法
链接:<u>http://blog.csdn.net/u010309756/article/details/67637930</u>
下面是一个要分析的文本文档内容:
How the Word Cloud Generator Works
The layout algorithm for positioning words without overlap is available on GitHub under an open source license as d3-cloud. Note that this is the only the layout algorithm and any code for converting text into words and rendering the final output requires additional development.
As word placement can be quite slow for more than a few hundred words, the layout algorithm can be run asynchronously, with a configurable time step size. This makes it possible to animate words as they are placed without stuttering. It is recommended to always use a time step even without animations as it prevents the browser’s event loop from blocking while placing the words.
The layout algorithm itself is incredibly simple. For each word, starting with the most “important”:
Attempt to place the word at some starting point: usually near the middle, or somewhere on a central horizontal line. If the word intersects with any previously placed words, move it one step along an increasing spiral. Repeat until no intersections are found. The hard part is making it perform efficiently! According to Jonathan Feinberg, Wordle uses a combination of hierarchical bounding boxes and quadtrees to achieve reasonable speeds.
Glyphs in JavaScript
There isn’t a way to retrieve precise glyph shapes via the DOM, except perhaps for SVG fonts. Instead, we draw each word to a hidden canvas element, and retrieve the pixel data.
Retrieving the pixel data separately for each word is expensive, so we draw as many words as possible and then retrieve their pixels in a batch operation.
Sprites and Masks
My initial implementation performed collision detection using sprite masks. Once a word is placed, it doesn't move, so we can copy it to the appropriate position in a larger sprite representing the whole placement area.
The advantage of this is that collision detection only involves comparing a candidate sprite with the relevant area of this larger sprite, rather than comparing with each previous word separately.
Somewhat surprisingly, a simple low-level hack made a tremendous difference: when constructing the sprite I compressed blocks of 32 1-bit pixels into 32-bit integers, thus reducing the number of checks (and memory) by 32 times.
In fact, this turned out to beat my hierarchical bounding box with quadtree implementation on everything I tried it on (even very large areas and font sizes). I think this is primarily because the sprite version only needs to perform a single collision test per candidate area, whereas the bounding box version has to compare with every other previously placed word that overlaps slightly with the candidate area.
Another possibility would be to merge a word’s tree with a single large tree once it is placed. I think this operation would be fairly expensive though compared with the analagous sprite mask operation, which is essentially ORing a whole block.
下面是代码实现部分:
先导入相关包: 3.导入相关包.png(1)使用背景图片制作词云图片
我使用的背景图片如下: 4.love.jpg 生成词云图: 5.png 6.生成词云图.png (2)不使用背景图片: 7.不使用背景图片.png源码:
# coding: utf-8
# # python中使用wordcloud包生成词云图
# 我是参考下面这篇博客来制作词云图,里面有详细介绍wordcloud包的用法
#
# 链接:[生成词云之python中WordCloud包的用法](https://blog.csdn.net/u010309756/article/details/67637930)
# 下面是一个要分析的文本文档内容:
#
# How the Word Cloud Generator Works
#
# The layout algorithm for positioning words without overlap is available on GitHub under an open source license as d3-cloud. Note that this is the only the layout algorithm and any code for converting text into words and rendering the final output requires additional development.
#
# As word placement can be quite slow for more than a few hundred words, the layout algorithm can be run asynchronously, with a configurable time step size. This makes it possible to animate words as they are placed without stuttering. It is recommended to always use a time step even without animations as it prevents the browser’s event loop from blocking while placing the words.
#
# The layout algorithm itself is incredibly simple. For each word, starting with the most “important”:
#
# Attempt to place the word at some starting point: usually near the middle, or somewhere on a central horizontal line.
# If the word intersects with any previously placed words, move it one step along an increasing spiral. Repeat until no intersections are found.
# The hard part is making it perform efficiently! According to Jonathan Feinberg, Wordle uses a combination of hierarchical bounding boxes and quadtrees to achieve reasonable speeds.
#
# Glyphs in JavaScript
#
# There isn’t a way to retrieve precise glyph shapes via the DOM, except perhaps for SVG fonts. Instead, we draw each word to a hidden canvas element, and retrieve the pixel data.
#
# Retrieving the pixel data separately for each word is expensive, so we draw as many words as possible and then retrieve their pixels in a batch operation.
#
# Sprites and Masks
#
# My initial implementation performed collision detection using sprite masks. Once a word is placed, it doesn't move, so we can copy it to the appropriate position in a larger sprite representing the whole placement area.
#
# The advantage of this is that collision detection only involves comparing a candidate sprite with the relevant area of this larger sprite, rather than comparing with each previous word separately.
#
# Somewhat surprisingly, a simple low-level hack made a tremendous difference: when constructing the sprite I compressed blocks of 32 1-bit pixels into 32-bit integers, thus reducing the number of checks (and memory) by 32 times.
#
# In fact, this turned out to beat my hierarchical bounding box with quadtree implementation on everything I tried it on (even very large areas and font sizes). I think this is primarily because the sprite version only needs to perform a single collision test per candidate area, whereas the bounding box version has to compare with every other previously placed word that overlaps slightly with the candidate area.
#
# Another possibility would be to merge a word’s tree with a single large tree once it is placed. I think this operation would be fairly expensive though compared with the analagous sprite mask operation, which is essentially ORing a whole block.
# ### 下面是代码实现部分
# In[1]:
#导入wordcloud模块和matplotlib模块
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
import matplotlib.pyplot as plt
from scipy.misc import imread
# In[6]:
#读取一个txt文件,把上面的文本文档内容复制到一个叫word.txt的文档中,自定义路径
text = open('D:\\Python\\notebook\\word.txt','r').read()
print(text)
# In[3]:
#使用背景图片制作词云图片
'''
mask : nd-array or None (default=None) //如果参数为空,则使用二维遮罩绘制词云。如果 mask 非空,设置的宽高值将被忽略,遮罩形状被 mask 取代。
除全白(#FFFFFF)的部分将不绘制,其余部分会都绘制词云。如:bg_pic = imread('读取一张图片.png'),背景图片画布一定要设置为白色(#FFFFFF),
然后显示的形状为不是白色的其他颜色。可以用ps工具将自己要显示的形状复制到一个纯白色的画布上再保存,就ok了。
background_color : color value (default=”black”) //背景颜色,如background_color='black',背景颜色为黑色。
scale : float (default=1) //按照比例进行放大画布,如设置为1.5,则长和宽都是原来画布的1.5倍。
generate(text) //根据文本生成词云
'''
#读入背景图片
bg_pic = imread('D:\\Python\\notebook\\love.jpg')
#生成词云
wordcloud = WordCloud(mask=bg_pic,background_color='black',scale=1.5).generate(text)
# 从背景图片生成颜色值
image_colors = ImageColorGenerator(bg_pic)
#显示词云图片
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
# In[7]:
#不使用背景图片制作词云图
#生成词云
wordcloud = WordCloud(background_color='black',scale=1.5).generate(text)
#显示词云图片
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
# In[5]:
#保存图片
#wordcloud.to_file('D:\\Python\\notebook\\test.jpg')