sklearn—CountVectorizer详解(转)
2020-05-27 本文已影响0人
快乐自由拉菲犬
![](https://img.haomeiwen.com/i3382609/7d25f22dc169b547.png)
![](https://img.haomeiwen.com/i3382609/1e2ebd1b8cfae763.png)
![](https://img.haomeiwen.com/i3382609/c195fd4310a16b3a.png)
![](https://img.haomeiwen.com/i3382609/0a8d765e76b3767c.png)
设置停用词列表,处理中文文档
![](https://img.haomeiwen.com/i3382609/7c9ebbf37aef759d.png)
训练集也就是a,b 的词频统计结果,词汇列表、字典为:
![](https://img.haomeiwen.com/i3382609/3719be0496498b28.png)
![](https://img.haomeiwen.com/i3382609/ceb37fd6a0a58dce.png)
这个属性一般用来程序员自我检查停用词是否正确,在pickling的时候可以设置stop_words_为None是安全的。
参考如下链接整理:
http://stackoverflow.com/questions/27488446/scikit-learn-countvectorizer
http://www.itkeyword.com/doc/4813494854317445586/TfidfVectorizer-sklearn-CountVectorizer
这个链接写的很棒,主要参考他的:
https://blog.csdn.net/Datawhale/article/details/82317529
————————————————
(转自:https://blog.csdn.net/weixin_38278334/article/details/82320307)