当前位置：首页 > news >正文

知识蒸馏 - 基于KL散度的知识蒸馏 HelloWorld 示例采用PyTorch 内置函数F.kl_div的实现方式

news 2025/11/16 9:31:23

知识蒸馏 - 基于KL散度的知识蒸馏 HelloWorld 示例采用PyTorch 内置函数F.kl_div的实现方式

flyfish

kl_div 是 Kullback-Leibler Divergence的英文缩写。
其中，KL 对应提出该概念的两位学者（Kullback 和 Leibler）的姓氏首字母“div”是 divergence（散度）的缩写。

F.kl_div(logQ, P, reduction='sum') 等价于 torch.sum(P * (torch.log(P) - logQ))

import torch
import torch.nn.functional as F# 1. 定义示例输入（教师和学生的logits）
teacher_logits = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], dtype=torch.float32)
student_logits = torch.tensor([[1.2, 2.1, 2.9], [3.8, 5.2, 6.1]], dtype=torch.float32)
T = 2.0  # 温度参数
batch_size = teacher_logits.size(0)# 2. 温度软化处理
teacher_scaled = teacher_logits / T
student_scaled = student_logits / T# 3. 计算分布
teacher_soft = F.softmax(teacher_scaled, dim=-1)  # 教师分布 P
student_log_soft = F.log_softmax(student_scaled, dim=-1)  # 学生对数分布 log Q# 4. 两种方式计算KL散度
# 方式1：手动计算（原始公式）
manual_kl = torch.sum(teacher_soft * (torch.log(teacher_soft) - student_log_soft)) / batch_size
manual_kl *= T**2  # 温度补偿# 方式2：使用PyTorch自带的F.kl_div
# 注意：F.kl_div(input=logQ, target=P, reduction='sum') 对应 sum(P*(logP - logQ))
torch_kl = F.kl_div(student_log_soft, teacher_soft, reduction='sum') / batch_size
torch_kl *= T**2  # 温度补偿# 5. 结果对比
print("===== 教师分布 P (softmax后) =====")
print(teacher_soft)
print("\n===== 学生对数分布 logQ (log_softmax后) =====")
print(student_log_soft)
print("\n===== KL散度计算结果 =====")
print(f"手动计算: {manual_kl.item():.6f}")
print(f"F.kl_div计算: {torch_kl.item():.6f}")
print(f"两者是否等价 (误差<1e-6): {torch.allclose(manual_kl, torch_kl, atol=1e-6)}")

===== 教师分布 P (softmax后) =====
tensor([[0.1863, 0.3072, 0.5065],[0.1863, 0.3072, 0.5065]])===== 学生对数分布 logQ (log_softmax后) =====
tensor([[-1.5909, -1.1409, -0.7409],[-1.8200, -1.1200, -0.6700]])===== KL散度计算结果 =====
手动计算: 0.008507
F.kl_div计算: 0.008507
两者是否等价 (误差<1e-6): True