当前位置：首页 > news >正文

旋转位置编码（RoPE）--结合公式与示例

news 2025/9/9 21:26:03

2.1.1. 核心思想

旋转位置编码（Rotary Position Embedding, RoPE）的核心思想是：通过绝对位置编码的方式，在注意力机制中实现相对位置依赖。

它认为，两个token之间的注意力权重应该只依赖于它们之间的相对距离（m-n），而不是它们的绝对位置（m和n）。RoPE通过在复数空间旋转查询（Query）和键（Key）向量来巧妙地实现这一点。

2.1.2. 数学模型：从复数到向量

RoPE的灵感来源于复数乘法的几何意义。

第一步：复数的旋转

考虑一个复数 $c = a + bi$ ，它可以用向量 $(a, b)$ 表示。用极坐标表示则为 $reiθre^{i\theta}$ ，其中 $r$ 是模长， $θ\theta$ 是角度。
用一个模长为1的复数 $eiϕ=cos⁡ϕ+isin⁡ϕe^{i\phi} = \cos\phi + i\sin\phi$ 去乘 $c$ ，结果为：
$re^{i\theta} \cdot e^{i\phi} = re^{i(\theta + \phi)}$
这相当于将表示 $c$ 的向量逆时针旋转了 $ϕ\phi$ 角度，其模长 $r$ 保持不变。

第二步：将向量视为复数

对于一个 $d$ 维的查询或键向量 $q\mathbf{q}$ （假设 $d$ 是偶数），我们可以将其每两个分量看作一个复数。
例如： $q=[q0,q1,q2,q3,...,qd−2,qd−1]\mathbf{q} = [q_0, q_1, q_2, q_3, ..., q_{d-2}, q_{d-1}]$
可以分组为： $q_0 + iq_1, q_2 + iq_3, ..., q_{d-2} + iq_{d-1}]$

第三步：对每个“复数”进行旋转

对于在位置 $m$ 的 token，其查询向量 $qm\mathbf{q}_m$ 的第 $2 i$ 和 $2 i + 1$ 分量组成的复数，会被旋转一个角度 $mθim\theta_i$ 。这里的 $θi\theta_i$ 是一个预先设定好的角度，通常按频率递减的方式设置（类似于Transformer中的正弦位置编码）：
$θi=10000−2i/d\theta_i = 10000^{-2i/d}$

旋转操作（公式）：
对于查询向量 $q\mathbf{q}$ 在位置 $m$ ，旋转后的向量 $q~m\tilde{\mathbf{q}}_m$ 的分量为：
$\begin{aligned} \tilde{q}_{m, 2i} &= q_{m, 2i} \cos(m\theta_i) - q_{m, 2i+1} \sin(m\theta_i) \\ \tilde{q}_{m, 2i+1} &= q_{m, 2i+1} \cos(m\theta_i) + q_{m, 2i} \sin(m\theta_i) \end{aligned}$

详细解释这个旋转公式的计算过程，并通过具体例子说明。

举例说明

对第三步公式的理解

1. 公式理解：复数旋转的几何意义

这个公式来源于复数乘法的几何意义。将一对分量 $q_{2i}, q_{2i+1})$ 看作一个复数： $q_{2i} + i \cdot q_{2i+1}$

用旋转因子 $eimθi=cos⁡(mθi)+isin⁡(mθi)e^{im\theta_i} = \cos(m\theta_i) + i \sin(m\theta_i)$ 乘以这个复数：
$\begin{aligned} z' &= z \cdot e^{im\theta_i} \\ &= (q_{2i} + i \cdot q_{2i+1}) \cdot (\cos(m\theta_i) + i \sin(m\theta_i)) \\ &= q_{2i}\cos(m\theta_i) + i q_{2i}\sin(m\theta_i) + i q_{2i+1}\cos(m\theta_i) + i^2 q_{2i+1}\sin(m\theta_i) \\ &= [q_{2i}\cos(m\theta_i) - q_{2i+1}\sin(m\theta_i)] + i [q_{2i}\sin(m\theta_i) + q_{2i+1}\cos(m\theta_i)] \end{aligned}$

实部对应 $q~m,2i\tilde{q}_{m, 2i}$ ，虚部对应 $q~m,2i+1\tilde{q}_{m, 2i+1}$ 。

2. 具体计算示例

假设条件：

位置 $m = 3$
角度 $θi=0.8\theta_i = 0.8$ (为简化计算)
原始向量分量： $q_{2i} = 2.0$ , $q_{2i+1} = 1.0$

计算步骤：

步骤1：计算旋转角度 $弧度m\theta_i = 3 \times 0.8 = 2.4 \text{ 弧度}$

步骤2：计算三角函数值 $cos⁡(2.4)≈−0.7374\cos(2.4) \approx -0.7374$ $sin⁡(2.4)≈0.6755\sin(2.4) \approx 0.6755$

步骤3：计算旋转后的分量 $\begin{aligned} \tilde{q}_{m, 2i} &= q_{m, 2i} \cos(m\theta_i) - q_{m, 2i+1} \sin(m\theta_i) \\ &= 2.0 \times (-0.7374) - 1.0 \times 0.6755 \\ &= -1.4748 - 0.6755 = -2.1503 \end{aligned}$

$\begin{aligned} \tilde{q}_{m, 2i+1} &= q_{m, 2i+1} \cos(m\theta_i) + q_{m, 2i} \sin(m\theta_i) \\ &= 1.0 \times (-0.7374) + 2.0 \times 0.6755 \\ &= -0.7374 + 1.3510 = 0.6136 \end{aligned}$

3. 几何解释可视化

原始向量： $(2.0, 1.0)$ 旋转后向量： $(- 2.1503, 0.6136)$

验证模长保持不变：

原始模长： $2.02+1.02=4+1=5≈2.236\sqrt{2.0^2 + 1.0^2} = \sqrt{4 + 1} = \sqrt{5} \approx 2.236$
旋转后模长： $(−2.1503)2+0.61362=4.624+0.376=5≈2.236\sqrt{(-2.1503)^2 + 0.6136^2} = \sqrt{4.624 + 0.376} = \sqrt{5} \approx 2.236$

旋转角度： 2.4弧度 ≈ 137.5°

4. 完整向量示例

假设有4维向量 $q3=[2.0,1.0,−1.0,0.5]\mathbf{q}_3 = [2.0, 1.0, -1.0, 0.5]$ ， $θ0=0.8\theta_0 = 0.8$ ,
$θ1=0.4\theta_1 = 0.4$

第一组 ( $i = 0$ ): $mθ0=3×0.8=2.4m\theta_0 = 3 \times 0.8 = 2.4$ $cos⁡(2.4)≈−0.7374,sin⁡(2.4)≈0.6755\cos(2.4) \approx -0.7374, \sin(2.4) \approx 0.6755$

$\begin{aligned} \tilde{q}_{3, 0} &= 2.0 \times (-0.7374) - 1.0 \times 0.6755 = -2.1503 \\ \tilde{q}_{3, 1} &= 1.0 \times (-0.7374) + 2.0 \times 0.6755 = 0.6136 \end{aligned}$

第二组 ( $i = 1$ ): $mθ1=3×0.4=1.2m\theta_1 = 3 \times 0.4 = 1.2$ $cos⁡(1.2)≈0.3624,sin⁡(1.2)≈0.9320\cos(1.2) \approx 0.3624, \sin(1.2) \approx 0.9320$

$\begin{aligned} \tilde{q}_{3, 2} &= -1.0 \times 0.3624 - 0.5 \times 0.9320 = -0.3624 - 0.4660 = -0.8284 \\ \tilde{q}_{3, 3} &= 0.5 \times 0.3624 + (-1.0) \times 0.9320 = 0.1812 - 0.9320 = -0.7508 \end{aligned}$

最终旋转后向量： $q~3=[−2.1503,0.6136,−0.8284,−0.7508]\tilde{\mathbf{q}}_3 = [-2.1503, 0.6136, -0.8284, -0.7508]$

5. 关键特点

保持模长：旋转操作不改变向量的长度
线性操作：可以表示为矩阵乘法
位置编码：不同位置有不同的旋转角度
相对位置：注意力计算时会出现相对位置项 $(m−n)θi(m-n)\theta_i$

这种设计使得模型能够自然地理解token之间的相对位置关系，是现代Transformer架构中非常重要的创新。

上述符号举例解释:

在公式中的下标表示：

$qm\mathbf{q}_m$ ：表示在位置 $m$ 的查询向量
$q_{m, 2i}$ ：表示位置 $m$ 的查询向量的第 $2 i$ 个分量

所以：

$m = 3$ ：这是token的位置索引（第3个位置）
$q_{m, 2i} = 2.0$ ：这是位置3的查询向量的第 $2 i$ 个分量的数值（具体数值）

让我用更清晰的符号重新表述：

假设条件：

Token位于位置 $m = 3$
考虑向量中的第 $i$ 组复数（比如 $i = 0$ ，即第一组）
这组分量的值为： $q_{3,0} = 2.0$ （第0个分量）， $q_{3,1} = 1.0$ （第1个分量）
旋转角度参数： $θ0=0.8\theta_0 = 0.8$

计算过程： $\begin{aligned} \text{旋转角度} &= m \times \theta_0 = 3 \times 0.8 = 2.4 \text{ 弧度} \\ \cos(2.4) &\approx -0.7374 \\ \sin(2.4) &\approx 0.6755 \\ \\ \tilde{q}_{3,0} &= q_{3,0} \cdot \cos(2.4) - q_{3,1} \cdot \sin(2.4) \\ &= 2.0 \times (-0.7374) - 1.0 \times 0.6755 \\ &= -1.4748 - 0.6755 = -2.1503 \\ \\ \tilde{q}_{3,1} &= q_{3,1} \cdot \cos(2.4) + q_{3,0} \cdot \sin(2.4) \\ &= 1.0 \times (-0.7374) + 2.0 \times 0.6755 \\ &= -0.7374 + 1.3510 = 0.6136 \end{aligned}$

总结符号含义：

下标 $m$ ：位置索引（哪个token）
下标 $2 i$ , $2 i + 1$ ：向量中的分量索引（向量的哪个元素）
$q_{m,2i}$ ：位置 $m$ 的向量的第 $2 i$ 个分量的数值

所以 $q_{3,0} = 2.0$ 的意思是："在位置3的查询向量中，第0个分量的值是2.0，这与 $m = 3$ 并不矛盾。

同样地，对于键向量 $k\mathbf{k}$ 在位置 $n$ ，旋转后的向量 $k~n\tilde{\mathbf{k}}_n$ 的分量为： $\begin{aligned} \tilde{k}_{n, 2i} &= k_{n, 2i} \cos(n\theta_i) - k_{n, 2i+1} \sin(n\theta_i) \\ \tilde{k}_{n, 2i+1} &= k_{n, 2i+1} \cos (n\theta_i) + k_{n, 2i} \sin(n\theta_i) \end{aligned}$

第四步：矩阵形式

上面的操作可以简洁地写成一个矩阵乘法：
$\tilde{\mathbf{q}}_m = \mathbf{R}^d_{\Theta, m} \mathbf{q}_m$
其中旋转矩阵 $RΘ,md\mathbf{R}^d_{\Theta, m}$ 是一个分块对角矩阵：
$\mathbf{R}^d_{\Theta, m} = \begin{pmatrix} \cos m\theta_0 & -\sin m\theta_0 & 0 & 0 & \cdots & 0 & 0 \\ \sin m\theta_0 & \cos m\theta_0 & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m\theta_1 & -\sin m\theta_1 & \cdots & 0 & 0 \\ 0 & 0 & \sin m\theta_1 & \cos m\theta_1 & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m\theta_{d/2-1} & -\sin m\theta_{d/2-1} \\ 0 & 0 & 0 & 0 & \cdots & \sin m\theta_{d/2-1} & \cos m\theta_{d/2-1} \\ \end{pmatrix}$

键向量的旋转矩阵 $RΘ,nd\mathbf{R}^d_{\Theta, n}$ 形式完全相同。

2.1.3. 关键特性：相对位置的显现

RoPE最巧妙的地方在于，计算注意力分数时，位置信息会以相对距离的形式体现。

注意力分数 $a_{m,n}$ 由旋转后的查询和键向量计算：
$\begin{aligned} a_{m,n} &= \langle \tilde{\mathbf{q}}_m, \tilde{\mathbf{k}}_n \rangle \\ &= (\mathbf{R}^d_{\Theta, m} \mathbf{q}_m)^T (\mathbf{R}^d_{\Theta, n} \mathbf{k}_n) \\ &= \mathbf{q}_m^T (\mathbf{R}^d_{\Theta, m})^T \mathbf{R}^d_{\Theta, n} \mathbf{k}_n \end{aligned}$

由于旋转矩阵是正交矩阵（ $RTR=I\mathbf{R}^T\mathbf{R} = \mathbf{I}$ ），并且满足 $(RΘ,md)TRΘ,nd=RΘ,n−md(\mathbf{R}^d_{\Theta, m})^T \mathbf{R}^d_{\Theta, n} = \mathbf{R}^d_{\Theta, n-m}$ ，上式可继续化简：
$a_{m,n} = \mathbf{q}_m^T \mathbf{R}^d_{\Theta, n-m} \mathbf{k}_n$

最终结果只依赖于相对位置 $(n - m)$ ！ 这意味着，无论两个token的绝对位置 $m$ 和 $n$ 在哪里，只要它们的距离 m-n 相同，它们之间的注意力分数计算方式就是一致的。这完美地捕获了相对位置信息。

2.1.4. 举例说明

假设我们的向量维度 d=4，那么我们可以将其分为2组复数。
预设基础角度： $θ0=10000−0/4=1\theta_0 = 10000^{-0/4} = 1$ , $θ1=10000−2/4=0.01\theta_1 = 10000^{-2/4} = 0.01$ （为简化计算，此处数字为示例）

Token A 在位置 m=1，其查询向量为 $q1=[q0,q1,q2,q3]\mathbf{q}_1 = [q_0, q_1, q_2, q_3]$
Token B 在位置 n=3，其键向量为 $k3=[k0,k1,k2,k3]\mathbf{k}_3 = [k_0, k_1, k_2, k_3]$

第1组复数（对应 $θ0=1\theta_0=1$ ）的旋转：

$q~1,0=q0cos⁡(1∗1)−q1sin⁡(1∗1)=q0cos⁡(1)−q1sin⁡(1)\tilde{q}_{1, 0} = q_0 \cos(1*1) - q_1 \sin(1*1) = q_0 \cos(1) - q_1 \sin(1)$
$q~1,1=q1cos⁡(1∗1)+q0sin⁡(1∗1)=q1cos⁡(1)+q0sin⁡(1)\tilde{q}_{1, 1} = q_1 \cos(1*1) + q_0 \sin(1*1) = q_1 \cos(1) + q_0 \sin(1)$
$k~3,0=k0cos⁡(3∗1)−k1sin⁡(3∗1)=k0cos⁡(3)−k1sin⁡(3)\tilde{k}_{3, 0} = k_0 \cos(3*1) - k_1 \sin(3*1) = k_0 \cos(3) - k_1 \sin(3)$
$k~3,1=k1cos⁡(3∗1)+k0sin⁡(3∗1)=k1cos⁡(3)+k0sin⁡(3)\tilde{k}_{3, 1} = k_1 \cos(3*1) + k_0 \sin(3*1) = k_1 \cos(3) + k_0 \sin(3)$

第2组复数（对应 $θ1=0.01\theta_1=0.01$ ）的旋转：

$q~1,2=q2cos⁡(1∗0.01)−q3sin⁡(1∗0.01)=q2cos⁡(0.01)−q3sin⁡(0.01)\tilde{q}_{1, 2} = q_2 \cos(1*0.01) - q_3 \sin(1*0.01) = q_2 \cos(0.01) - q_3 \sin(0.01)$
$q~1,3=q3cos⁡(1∗0.01)+q2sin⁡(1∗0.01)=q3cos⁡(0.01)+q2sin⁡(0.01)\tilde{q}_{1, 3} = q_3 \cos(1*0.01) + q_2 \sin(1*0.01) = q_3 \cos(0.01) + q_2 \sin(0.01)$
$k~3,2=k2cos⁡(3∗0.01)−k3sin⁡(3∗0.01)=k2cos⁡(0.03)−k3sin⁡(0.03)\tilde{k}_{3, 2} = k_2 \cos(3*0.01) - k_3 \sin(3*0.01) = k_2 \cos(0.03) - k_3 \sin(0.03)$
$k~3,3=k3cos⁡(3∗0.01)+k2sin⁡(3∗0.01)=k3cos⁡(0.03)+k2sin⁡(0.03)\tilde{k}_{3, 3} = k_3 \cos(3*0.01) + k_2 \sin(3*0.01) = k_3 \cos(0.03) + k_2 \sin(0.03)$

现在计算注意力分数 $a_{1,3}$ ：
$a_{1,3} = \tilde{q}_{1, 0}\tilde{k}_{3, 0} + \tilde{q}_{1, 1}\tilde{k}_{3, 1} + \tilde{q}_{1, 2}\tilde{k}_{3, 2} + \tilde{q}_{1, 3}\tilde{k}_{3, 3}$

根据之前的推导，这个结果等价于用一个旋转了 (3-1)=2 弧度的矩阵去乘原始向量后再做内积，即 $a1,3=⟨q1,RΘ,24k3⟩a_{1,3} = \langle \mathbf{q}_1, \mathbf{R}^4_{\Theta, 2} \mathbf{k}_3 \rangle$ 。这个值只取决于相对位置差 2。