预训练的词向量-那些著名的数据集

2018-11-07  本文已影响156人  readilen

英语语料库

谷歌 word2vec

谷歌新闻预训练词向量 (about 100 billion words). 300维向量,大约3百万个单词和短语。实现论文

download link | source link

脸书 fastText

1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).

download link | source link

2 million word vectors trained on Common Crawl (600B tokens).

download link | source link

斯坦福 GloVe

Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download)

download link | source link

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download)

download link | source link

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

download link | source link

Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download)

download link | source link

中文语料库

word2vec

Wikipedia database, Vector Size 300, Corpus Size 1G, Vocabulary Size 50101, Jieba tokenizor

download link | source link

fastText

Trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We used the Stanford word segmenter for Tokenization

download link | source link
附录,处理方法:
https://github.com/Kyubyong/wordvectors

上一篇 下一篇

猜你喜欢

热点阅读