当前位置：首页 > news >正文

GloVe词向量：原理详解及用python进行训练和应用GloVe

news 2025/10/3 22:02:35

文章目录

- 一、GloVe原理详解
- - 1.1 核心思想：共现矩阵
  - 1.2 目标函数：从共现概率到向量空间
  - 1.3 从目标到算法
  - 1.4 GloVe vs. Word2Vec
  - 1.5 GloVe的优势与局限
- 二、Python实战：训练自己的GloVe模型
- - 2.1 环境准备：
  - 2.2 代码实现
- 三、使用预训练模型实现GloVe
- - 3.1 准备工作
  - 3.2 代码实现
- 四、应用案例
- - 4.1 词相似度计算
  - 4.2 词汇类比任务
  - 4.3 文本聚类
- 五、总结

一、GloVe原理详解

GloVe (Global Vectors for Word Representation) 是由斯坦福大学提出的一种词向量学习方法，它结合了全局统计信息和局部上下文窗口的优点：

全局统计信息：通过词共现矩阵捕获词汇在整个语料库中的统计规律
局部上下文：考虑词汇在具体语境中的使用情况

GloVe是由斯坦福大学的研究者在2014年提出的一种词嵌入方法。它的出现，可以说是对Word2Vec（CBOW和Skip-gram）的一次重要补充和升华。虽然Word2Vec非常成功，但它本质上是一个局部上下文窗口的模型，只考虑了词与周围少数几个词的共现关系。而GloVe则巧妙地将全局统计信息（整个语料库的词-词共现矩阵）融入了模型。

1.1 核心思想：共现矩阵

想象一下，我们有一个巨大的文本语料库。我们可以统计每个词与其他词共同出现的次数。例如，在 “the cat sat on the mat” 这句话中：

“cat” 与 “the” 共现 2 次。
“cat” 与 “sat” 共现 1 次。
“cat” 与 “on” 共现 1 次。
…以此类推。

如果我们将整个语料库中所有词对的共现次数都统计出来，可以得到一个巨大的矩阵，我们称之为词-词共现矩阵，记为 $X$ 。其中， $X_{ij}$ 表示词 $j$ 出现在词 $i$ 的上下文中的总次数。
这个矩阵包含了丰富的语义信息。例如，在矩阵中，ice 的行和 solid 的列的值会很高，steam 的行和 gas 的列的值也会很高。这就是GloVe的基石——全局的词-词共现统计信息是理解语义的关键。

1.2 目标函数：从共现概率到向量空间

Word2Vec通过学习一个上下文窗口来预测中心词或预测中心词的上下文，而GloVe则直接对共现矩阵进行建模。其核心思想是：词向量的点积应该能近似两个词之间的共现概率。
定义：

$w_i$ 是词 $i$ 的词向量。
$w~j\tilde{w}_j$ 是词 $j$ 的上下文词向量。
$P_{ij}$ 是词 $j$ 出现在词 $i$ 的上下文中的概率，即 $Pij=Xij/∑kXikP_{ij} = X_{ij} / \sum_{k} X_{ik}$ 。
$F(wi,wj,w~j)F(w_i, w_j, \tilde{w}_j)$ 是一个损失函数，用来衡量点积 $wiTw~jw_i^T \tilde{w}_j$ 与对数概率 $log P_{ji}$ 之间的差异。
GloVe的作者提出了一个非常巧妙的损失函数：
$\sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^T \tilde{w}_j - \log P_{ij} \right)^2$
这个公式看起来有点复杂，我们来分解一下：

$wiTw~jw_i^T \tilde{w}_j$ : 这是模型预测的两个词之间的“关联度”。
$log P_{ij}$ : 这是数据中观察到的两个词之间的“关联度”（取对数是为了让数值更平滑，避免概率为0的问题）。
$(wiTw~j−log⁡Pij)2\left( w_i^T \tilde{w}_j - \log P_{ij} \right)^2$ : 这是预测值和真实值之间的平方误差。
$f(X_{ij})$ : 这是最关键的部分！它是一个权重函数。它的作用是平衡不同共现频率的词对。
为什么需要权重函数 $f(X_{ij})$ ？
因为共现频率差异巨大。像 “the car” 这样的词对可能出现数万次，而 “asteroid car” 可能只出现一两次。如果我们对所有词对一视同仁，那么模型会为了拟合高频词对而牺牲低频词对的准确性。 $f(X_{ij})$ 的设计就是为了解决这个问题。一个常见的定义是：
$\begin{cases} (x/x_{\max})^\alpha & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}$
其中， $x_{\max}$ 是一个阈值（比如100）， $α\alpha$ 是一个超参数（通常取0.75）。这意味着，对于低频词对（ $X_{ij}$ 很小），它们的误差会被放大，迫使模型更认真地学习它们；而对于高频词对，误差则被适当“削弱”，防止它们主导整个训练过程。

1.3 从目标到算法

有了目标函数 $J$ ，接下来的步骤就和传统机器学习一样了：

初始化：随机初始化所有词向量 $w_i$ 和上下文词向量 $w~j\tilde{w}_j$ 。
计算梯度：对 $J$ 关于每个词向量求偏导，得到梯度。
梯度下降：使用随机梯度下降或其变种（如AdaGrad）来更新所有词向量，使得损失函数 $J$ 不断减小。
迭代：重复2-3步，直到模型收敛。
训练完成后，GloVe会得到两套词向量： $w_i$ 和 $w~j\tilde{w}_j$ 。研究发现，这两套向量都非常有效。通常的做法是将它们相加作为最终的词表示：
$vi=wi+w~iv_i = w_i + \tilde{w}_i$

1.4 GloVe vs. Word2Vec

理论基础：Word2Vec基于“分布式假设”，通过局部上下文进行预测。GloVe基于“全局共现概率假设”，直接对全局统计信息进行建模。
信息利用：Word2Vec利用了局部信息。GloVe同时利用了全局统计信息和局部上下文窗口信息。
效果：在许多标准任务（如词类比、相似度）上，GloVe与Word2Vec效果相当甚至略优，尤其是在处理低频词时，由于其加权机制，表现更稳定。

1.5 GloVe的优势与局限

1、优势：

全局信息利用：充分利用整个语料库的统计信息
训练效率高：相比神经网络方法训练更快
线性关系保持：能较好保持词汇间的线性关系
可扩展性强：容易扩展到大规模语料

2、局限：

内存消耗大：需要存储大型共现矩阵
静态表示：每个词只有一个固定向量表示
OOV问题：无法处理未登录词

二、Python实战：训练自己的GloVe模型

在实际应用中，我们通常直接加载别人预训练好的GloVe模型，因为它们在大规模语料库上训练，效果更好。但了解如何自己训练有助于理解其过程。

我们将使用gensim库，它封装了GloVe的训练过程。

2.1 环境准备：

首先，需要安装gensim和numpy。

pip install gensim numpy

2.2 代码实现

import gensim
from gensim.models import Word2Vec
from gensim.test.utils import datapath
import numpy as np
# 1. 准备数据
# GloVe需要输入一个句子列表，每个句子是分词后的词列表。
# 这里我们用一个简单的例子
sentences = [['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'],['the', 'lazy', 'dog', 'is', 'sleeping'],['the', 'quick', 'fox', 'is', 'clever'],['a', 'fast', 'fox', 'is', 'better', 'than', 'a', 'slow', 'dog'],['natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'ai']
]
# 2. 训练GloVe模型
# gensim的GloVe模型需要一个corpus_iterable和vector_size等参数
# 注意：训练GloVe比训练Word2Vec慢得多，尤其是在大数据集上。
# 对于这个极小的数据集，训练可能很快，但效果不会很好，仅作演示。
print("开始训练GloVe模型...")
# 我们使用gensim的GloVe类，它需要一个词频统计对象
# 首先创建一个词汇表
model = gensim.models.GloVe(corpus_iterable=sentences,vector_size=50,  # 词向量的维度window=5,        # 上下文窗口大小learning_rate=0.05,alpha=0.75,      # f(x)函数中的alphaepochs=50,       # 训练轮数random_state=42,workers=1,       # 线程数x_max=100.0      # f(x)函数中的x_max
)
print("模型训练完成！")
# 3. 查看结果
# 获取某个词的向量
word = 'fox'
vector = model.wv[word]
print(f"\n单词 '{word}' 的向量 (前10维): {vector[:10]}")
# 查找与某个词最相似的词
try:similar_words = model.wv.most_similar(word, topn=3)print(f"\n与 '{word}' 最相似的词: {similar_words}")
except KeyError:print(f"\n单词 '{word}' 不在词汇表中。")
# 保存和加载模型
model.save("glove_custom.model")
loaded_model = gensim.models.GloVe.load("glove_custom.model")
# 验证加载的模型
print("\n验证加载的模型:")
print(f"单词 'dog' 的向量 (前10维): {loaded_model.wv['dog'][:10]}")

注意：用这么小的数据集训练GloVe，效果会很差，向量可能无法捕捉到有意义的语义关系。在实际项目中，你需要一个至少包含数百万词的大规模语料库。

三、使用预训练模型实现GloVe

这才是GloVe最常见和最有价值的用法。我们将使用一个预训练好的GloVe模型，它是在维基百科和Gigaword语料库（总计约60亿个词）上训练的。

3.1 准备工作

下载预训练模型：从GloVe官方项目页面下载。我们以一个较小的 glove.6B.100d 为例（6 billion tokens, 100 dimensions）。下载后解压，你会得到一个 glove.6B.100d.txt 文件。
- GloVe官网
安装gensim和scikit-learn:
```
pip install gensim scikit-learn
```

3.2 代码实现

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models import KeyedVectors
# 1. 加载预训练的GloVe模型
# gensim提供了一个方便的函数来加载GloVe文本文件
print("正在加载GloVe模型（这可能需要一些时间）...")
# glove_file = 'glove.6B.100d.txt' # 确保这个文件和你的脚本在同一个目录下
# model = KeyedVectors.load_word2vec_format(glove_file, no_header=True, binary=False)
# 为了方便演示，gensim也提供了一些内置的预训练模型
# 我们加载一个更小的模型来加快速度
# 你可以从 https://github.com/RaRe-Technologies/gensim-data 下载 'glove-wiki-gigaword-100'
import gensim.downloader as api
print("从Gensim Data服务器下载 'glove-wiki-gigaword-100' 模型...")
model = api.load("glove-wiki-gigaword-100")
print("模型加载完成！")
# 2. 应用案例
# 案例1：计算词义相似度
# 余弦相似度是衡量两个向量方向一致性的常用指标，范围在[-1, 1]之间，越接近1越相似。
def cosine_similarity(vec1, vec2):dot_product = np.dot(vec1, vec2)norm_vec1 = np.linalg.norm(vec1)norm_vec2 = np.linalg.norm(vec2)if norm_vec1 == 0 or norm_vec2 == 0:return 0return dot_product / (norm_vec1 * norm_vec2)
word1 = 'king'
word2 = 'queen'
word3 = 'car'
# 检查词是否在模型中
if word1 in model and word2 in model:vec1 = model[word1]vec2 = model[word2]similarity = cosine_similarity(vec1, vec2)print(f"\n案例1: 词义相似度")print(f"'{word1}' 和 '{word2}' 的余弦相似度: {similarity:.4f}")
else:print(f"单词 '{word1}' 或 '{word2}' 不在模型中。")
if word1 in model and word3 in model:vec1 = model[word1]vec3 = model[word3]similarity = cosine_similarity(vec1, vec3)print(f"'{word1}' 和 '{word3}' 的余弦相似度: {similarity:.4f}")
# 案例2：词向量类比 (经典的 "king - man + woman = ?")
# 这是词向量最神奇的应用之一。
print("\n案例2: 词向量类比")
try:# 模型会计算 (vector('king') - vector('man') + vector('woman'))，# 然后在词汇表中找到与这个结果向量最接近的词。result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)print(f"国王 - 男人 + 女人 ≈ {result[0][0]} (相似度: {result[0][1]:.4f})")result = model.most_similar(positive=['paris', 'germany'], negative=['france'], topn=1)print(f"巴黎 - 法国 + 德国 ≈ {result[0][0]} (相似度: {result[0][1]:.4f})")result = model.most_similar(positive=['walking', 'swam'], negative=['walk'], topn=1)print(f"行走 - walk + swam ≈ {result[0][0]} (相似度: {result[0][1]:.4f})")
except KeyError as e:print(f"执行类比时出错，可能因为某个词不在模型中: {e}")
# 案例3：简单的文本分类（情感分析）
# 这是一个更高级的应用，展示了如何将词向量用于下游任务。
# 我们将使用词向量的平均值来表示整个句子。
print("\n案例3: 简单的文本分类")
def sentence_vector(sentence, model):words = sentence.split()word_vectors = [model[word] for word in words if word in model]if not word_vectors:# 如果句子中的词都不在模型中，返回一个零向量return np.zeros(model.vector_size)# 计算所有词向量的平均值return np.mean(word_vectors, axis=0)
# 准备一些简单的正负面句子数据
positive_sentences = ["i love this movie it was fantastic","this is a great day i am so happy","the food was delicious and the service was excellent"
]
negative_sentences = ["i hate this film it was terrible","this is a bad day i am so sad","the food was awful and the service was poor"
]
# 获取句子向量
positive_vectors = [sentence_vector(s, model) for s in positive_sentences]
negative_vectors = [sentence_vector(s, model) for s in negative_sentences]
# 准备训练数据 (1 for positive, 0 for negative)
X_train = positive_vectors + negative_vectors
y_train = [1] * len(positive_sentences) + [0] * len(negative_sentences)
# 使用一个简单的分类器，比如逻辑回归
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
# 测试新句子
test_sentence = "i feel very joyful and content"
test_vector = sentence_vector(test_sentence, model).reshape(1, -1) # 需要reshape为2D数组
prediction = classifier.predict(test_vector)
sentiment = "正面" if prediction[0] == 1 else "负面"
print(f"\n测试句子: '{test_sentence}'")
print(f"预测情感: {sentiment}")
test_sentence_2 = "i am disappointed with the result"
test_vector_2 = sentence_vector(test_sentence_2, model).reshape(1, -1)
prediction_2 = classifier.predict(test_vector_2)
sentiment_2 = "正面" if prediction_2[0] == 1 else "负面"
print(f"测试句子: '{test_sentence_2}'")
print(f"预测情感: {sentiment_2}")

这个应用案例展示了词向量如何作为特征，用于解决更复杂的NLP任务。虽然这个分类器很简单，但它揭示了现代深度学习NLP管道的核心思想：将文本转换为稠密的、有意义的向量表示，然后将这些向量输入到机器学习或深度学习模型中进行预测。

四、应用案例

4.1 词相似度计算

def cosine_similarity(vec1, vec2):"""计算余弦相似度"""dot_product = np.dot(vec1, vec2)norm1 = np.linalg.norm(vec1)norm2 = np.linalg.norm(vec2)return dot_product / (norm1 * norm2)def find_similar_words(glove, model, word, top_k=5):"""查找相似词汇"""if word not in glove.word_to_idx:return []word_idx = glove.word_to_idx[word]word_vec = model.W[word_idx] + model.W_tilde[word_idx]  # 合并词向量和上下文向量similarities = []for i, other_word in glove.idx_to_word.items():if other_word == word:continueother_vec = model.W[i] + model.W_tilde[i]sim = cosine_similarity(word_vec, other_vec)similarities.append((other_word, sim))# 按相似度排序similarities.sort(key=lambda x: x[1], reverse=True)return similarities[:top_k]# 使用示例
similar_words = find_similar_words(glove, model, "cat")
print("与'cat'相似的词:")
for word, sim in similar_words:print(f"{word}: {sim:.4f}")

4.2 词汇类比任务

def word_analogy(glove, model, word1, word2, word3, top_k=1):"""词汇类比: word1 is to word2 as word3 is to ?"""# 检查词汇是否存在for word in [word1, word2, word3]:if word not in glove.word_to_idx:return []# 获取词向量vec1 = model.W[glove.word_to_idx[word1]] + model.W_tilde[glove.word_to_idx[word1]]vec2 = model.W[glove.word_to_idx[word2]] + model.W_tilde[glove.word_to_idx[word2]]vec3 = model.W[glove.word_to_idx[word3]] + model.W_tilde[glove.word_to_idx[word3]]# 计算类比向量analogy_vec = vec2 - vec1 + vec3# 查找最相似的词similarities = []for i, word in glove.idx_to_word.items():if word in [word1, word2, word3]:continueword_vec = model.W[i] + model.W_tilde[i]sim = cosine_similarity(analogy_vec, word_vec)similarities.append((word, sim))similarities.sort(key=lambda x: x[1], reverse=True)return similarities[:top_k]# 使用示例
result = word_analogy(glove, model, "cat", "cats", "dog")
print(f"cat : cats :: dog : {result[0][0] if result else 'N/A'}")

4.3 文本聚类

from sklearn.cluster import KMeansdef cluster_words(glove, model, n_clusters=3):"""对词汇进行聚类"""# 获取所有词向量word_vectors = []words = []for idx, word in glove.idx_to_word.items():vec = model.W[idx] + model.W_tilde[idx]word_vectors.append(vec)words.append(word)# K-means聚类kmeans = KMeans(n_clusters=n_clusters, random_state=42)clusters = kmeans.fit_predict(word_vectors)# 按聚类分组clustered_words = {}for word, cluster_id in zip(words, clusters):if cluster_id not in clustered_words:clustered_words[cluster_id] = []clustered_words[cluster_id].append(word)return clustered_words# 使用示例
clusters = cluster_words(glove, model, n_clusters=3)
for cluster_id, words in clusters.items():print(f"Cluster {cluster_id}: {', '.join(words)}")

五、总结

GloVe的核心：利用全局词-词共现统计信息，通过一个精心设计的加权损失函数，学习词向量，使得向量的点积能近似共现概率。
如何使用：
- 训练：在gensim中使用GloVe类，但通常只在有特定需求时才自己训练。
- 应用：加载预训练模型是常态。利用gensim的KeyedVectors对象，可以轻松实现相似度计算、词向量类比等操作，并将其作为特征用于下游任务。
价值：GloVe与Word2Vec、FastText等模型一起，构成了现代自然语言处理的基石，是理解更高级的模型（如BERT、GPT）的重要前提。