word2vec模型训练问题--解决内存过载out of mem
2018-08-22 本文已影响64人
布口袋_天晴了
选择第四种方法,可以防止加载数据内存过载问题。 因为在模型训练时,sentences是一个迭代器,不会把全部数据加载到内存,只把需要的一部分数据加载到内存中。
语料载入方法:
1)sentences = [['你好', ',', '好久不见', '。'], ['今天', '天气', '真好' ',', '我们', '出去', '玩', '吧', '。']]
![](https://img.haomeiwen.com/i6102062/e5d2c1cbd81d0d5a.png)
2)sentences=word2vec.Text8Corpus('语料名')
![](https://img.haomeiwen.com/i6102062/aa9ba58959d19b0c.png)
![](https://img.haomeiwen.com/i6102062/8271a5e726b509a7.png)
![](https://img.haomeiwen.com/i6102062/7af71ff295af610d.png)
![](https://img.haomeiwen.com/i6102062/a80ba832fcc0e8b9.png)
3)sentences = word2vec.LineSentence('语料名')
Python gensim.models.word2vec.LineSentence() Examples
![](https://img.haomeiwen.com/i6102062/639e77cbff9ef2a8.png)
![](https://img.haomeiwen.com/i6102062/fb0231de31608ce5.png)
4)sentences = word2vec.PathLineSentences('baikeData\\')
处理某个目录下的文件,按照文件名字母排序的顺序处理。
![](https://img.haomeiwen.com/i6102062/803851516269c05c.png)
![](https://img.haomeiwen.com/i6102062/33bc9b5ea13583f9.png)
.txt文件的格式:多行,且词与词之间用空格分开
![](https://img.haomeiwen.com/i6102062/35f3358a73adb4e7.png)
参考文章:
【1】models.word2vec – Word2vec embeddings
【2】gensim