Word2Vec 生成词向量
两个简单的示例,生成词向量。
1. 英文单词生成词向量
from gensim.models import Word2Vecmodel = Word2Vec(
sentences=[["cat", "say", "meow"], ["dog", "say", "woof"]],
sg=0,
window=2,
vector_size=3,min_count=0,
workers=4)
模型model训练好了,打印词向量
model.wv.vectors输出为:
array([[-0.01787424, 0.00788105, 0.17011166],[ 0.3003091 , -0.31009832, -0.23722696],[ 0.21529575, 0.2990996 , -0.16718094],[-0.12544572, 0.24601682, -0.05111571],[-0.15122044, 0.21846838, -0.16200535]], dtype=float32)
共5个不重复的单词,每个单词由3个数据的词向量表示。打印"cat"的词向量
model.wv.get_vector("cat")输出为:
array([-0.15122044, 0.21846838, -0.16200535], dtype=float32)
2.中文生成词向量,
生成"小猫说话喵喵, 小狗说话汪汪"这句话的词向量
2.1 分词
# !pip install jieba
import jiebaseg_list = jieba.cut(s, cut_all=False)f=" ".join(seg_list)
f
输出为:
'小猫 说话 喵 喵 , 小狗 说话 汪汪'
2.2 将分好的词构造成列表
import nltk
from nltk.tokenize import sent_tokenize, word_tokenizenltk.download('punkt_tab')data = []# iterate through each sentence in the file
for i in sent_tokenize(f):temp = []# tokenize the sentence into wordsfor j in word_tokenize(i):temp.append(j.lower())data.append(temp)data输出为:[['小猫', '说话', '喵', '喵', ',', '小狗', '说话', '汪汪']]
2.3 训练模型model,获得每个分词的向量表示
model = Word2Vec(
sentences=data,
sg=0,
window=2,
vector_size=3,
min_count=0,
workers=4)model.wv.get_vector("小猫")
输出为:array([-0.06053392, 0.09588599, 0.03306246], dtype=float32)model.wv.get_vector("小狗")
输出为:array([-0.12544572, 0.24601682, -0.05111571], dtype=float32)model.wv.vectors
全部词的词向量:
array([[-0.01787424, 0.00788105, 0.17011166],[ 0.3003417 , -0.31010646, -0.23725258],[ 0.21530104, 0.29910693, -0.16718504],[-0.12544572, 0.24601682, -0.05111571],[-0.15122044, 0.21846838, -0.16200535],[-0.06053392, 0.09588599, 0.03306246]], dtype=float32)
参考文献