当前位置：首页 > news >正文

BLEU 中的修正 n-gram 精确度 (Modified n-gram Precision)

news 来源：原创 2025/6/14 8:21:07

BLEU的完整介绍，请参考：

机器翻译指标：BLEU-CSDN博客https://blog.csdn.net/qq_54708219/article/details/148635887?spm=1001.2014.3001.55021.计算机器译文中的n-gram：

第一步是将机器翻译生成的句子（Candidate Translation）分解成连续的词序列片段，这些片段的长度就是 n。

Unigram (1-gram)：单个词。例如，句子 “the cat sat” 的 1-gram 是：["the"], ["cat"], ["sat"]。
Bigram (2-gram)：连续的两个词。例如：["the", "cat"], ["cat", "sat"]。
Trigram (3-gram)：连续的三个词。例如：["the", "cat", "sat"]。
以此类推 (4-gram, 5-gram...)。

2.在参考译文中出现的比例

对于机器译文中分解出来的 每一个 n-gram，检查它是否出现在 至少一个 人工提供的参考译文（Reference Translation）中。
计算一个分数：匹配上的 n-gram 数量 / 机器译文中总的 n-gram 数量。
初步想法 (原始精确度 - Naive Precision)： 直接数机器译文中每个 n-gram 出现的次数 (Count_candidate(ngram))，再数这个 n-gram 在参考译文中出现的次数 (Count_reference(ngram))。匹配数就是 min(Count_candidate(ngram), Count_reference(ngram))。对所有 n-gram 求和匹配数，除以机器译文中 n-gram 的总数。
问题： 如果机器译文简单地大量重复参考译文中出现过的某个词（或短语），就能获得很高的分数，即使译文不通顺或信息不全。例如，机器译文只输出 “the the the the”，而参考译文中有 “the”，那么所有 1-gram “the” 都匹配上了，原始精确度=100%，但这显然不是好翻译。

3.修正 (Modified) 的关键：

为了解决上述问题，BLEU 对计数方式进行了修正，对于机器译文中的 某个特定 n-gram：

计算它在机器译文中出现的次数 (Count_candidate(ngram))。
计算它在 所有参考译文 中出现的 最大次数 (Max_Ref_Count(ngram))
（1）如果有多个参考译文（通常推荐使用多个参考译文以提高评估可靠性），Max_Ref_Count(ngram) 是这个 n-gram 在所有参考译文中出现次数的最大值。
（2）如果只有一个参考译文，Max_Ref_Count(ngram) 就是它在该参考译文中出现的次数。

因此，这个 n-gram 的 有效匹配计数 被限制为：min(Count_candidate(ngram), Max_Ref_Count(ngram))。

核心思想： 机器译文中的某个 n-gram，最多只能匹配它在 任意一个 参考译文里出现的次数。不能因为它在一个句子里重复出现多次，就获得超过参考译文最大支持次数的匹配数。

和原始精确度分析： 常见的原始精确度错误做法是：对于机器译文中出现的 每一种 n-gram 类型 (type)，看它是否在参考译文中出现（出现则为1，否则为0），然后除以机器译文中 n-gram 类型的总数。

机器译文：the the the cat。Bigram 类型只有两种：["the", "the"] 和 ["the", "cat"]。
参考译文中 ["the", "the"] 出现了 => 计 1, ["the", "cat"] 出现了 => 计 1。
匹配的类型数 = 2, 机器中类型总数 = 2。
原始 P₂ (错误类型法) = 2 / 2 = 100% (严重高估！它忽略了重复)

BLEU 修正 vs 原始类型法： BLEU 的修正方法既考虑了 n-gram 的类型是否匹配，更关键的是 考虑了 n-gram 词例 (token) 在句子中重复出现的次数是否超过了参考译文所能支持的最大次数。它惩罚了通过简单重复参考译文中存在的词汇来“刷分”的行为。

总结修正 n-gram 精确度的计算公式：

修正 n-gram 精确度 (Pₙ) = Σ (所有 n-gram 的有效匹配计数) / Σ (机器译文中所有 n-gram 的数量)

其中，对于每个 n-gram：
有效匹配计数 = min(该 n-gram 在机器译文中出现的次数, 该 n-gram 在所有参考译文中出现的最大次数)

举例：

设参考译文 (reference)：

[['the', 'cat', 'is', 'on', 'the', 'mat'],  # 参考1 (6个词)['there', 'is', 'a', 'cat', 'on', 'the', 'mat']]  # 参考2 (7个词)

和候选译文 (candidate):

['the', 'the', 'the', 'cat', 'on', 'the', 'mat']  # 7个词

任务：计算 (1-gram) (2-gram) (3-gram) 和(4-gram) 的修正精确度

(1) 1-gram 精确度 (n=1)

词	候选次数	参考1次数	参考2次数	最大参考次数	有效计数
the	4	2	2	2	min(4,2)=2
cat	1	1	1	1	min(1,1)=1
on	1	1	1	1	min(1,1)=1
mat	1	1	1	1	min(1,1)=1
总和	7	-	-	-	5

1-gram 精确度 = 5/7 ≈ 0.7143

(2) 2-gram 精确度 (n=2)

2-gram	候选次数	参考1出现?	参考2出现?	最大参考次数	有效计数
(the, the)	2	❌	❌	0	min(2,0)=0
(the, the)	2	❌	❌	0	(已计入类型)
(the, cat)	1	✅	❌	1	min(1,1)=1
(cat, on)	1	❌	✅	1	min(1,1)=1
(on, the)	1	✅	✅	1	min(1,1)=1
(the, mat)	1	✅	✅	1	min(1,1)=1
总和	6	-	-	-	4

2-gram 精确度 = 4/6 ≈ 0.6667

(3) 3-gram 精确度 (n=3)

3-gram	候选次数	参考1出现?	参考2出现?	最大参考次数	有效计数
(the, the, the)	1	❌	❌	0	0
(the, the, cat)	1	❌	❌	0	0
(the, cat, on)	1	❌	❌	0	0
(cat, on, the)	1	❌	✅	1	1
(on, the, mat)	1	✅	✅	1	1
总和	5	-	-	-	2

3-gram 精确度 = 2/5 = 0.4

(4) 4-gram 精确度 (n=4)

4-gram	候选次数	参考1出现?	参考2出现?	最大参考次数	有效计数
(the, the, the, cat)	1	❌	❌	0	0
(the, the, cat, on)	1	❌	❌	0	0
(the, cat, on, the)	1	❌	❌	0	0
(cat, on, the, mat)	1	❌	✅	1	1
总和	4	-	-	-	1

4-gram 精确度 = 1/4 = 0.25

代码实现：

from nltk.translate.bleu_score import sentence_bleureference = [['the', 'cat', 'is', 'on', 'the', 'mat'], ['there', 'is', 'a', 'cat', 'on', 'the', 'mat']]
candidate = ['the', 'the', 'the', 'cat', 'on', 'the', 'mat']print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))# Individual 1-gram: 0.714286
# Individual 2-gram: 0.666667
# Individual 3-gram: 0.400000
# Individual 4-gram: 0.250000