当前位置：首页 > news >正文

Qwen3 Embedding 测试

news 来源：原创 2025/6/15 10:34:00

环境准备

uv init
uv venv
.venv/Script/activate
uv pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
uv pip install transformers

初始化

import torch
import torch.nn.functional as Ffrom torch import Tensor
from transformers import AutoTokenizer, AutoModel

这部分导入了PyTorch相关库和Hugging Face的Transformers库，为使用Qwen3-Embedding模型做准备。

def last_token_pool(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])if left_padding:print("left_padding")return last_hidden_states[:, -1]else:sequence_lengths = attention_mask.sum(dim=1) - 1batch_size = last_hidden_states.shape[0]return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]def get_detailed_instruct(task_description: str, query: str) -> str:return f'Instruct: {task_description}\nQuery:{query}'

这里定义了两个关键函数：

last_token_pool：从模型输出的隐藏状态中提取最后一个有效token的表示。它处理两种填充情况：
- 左侧填充：直接取最后一个位置的向量
- 右侧填充：计算每个序列的实际长度，然后取对应位置的向量
get_detailed_instruct：创建指令格式的查询字符串，将任务描述和查询组合在一起

例子

task = 'Given a web search query, retrieve relevant passages that answer the query'queries = [get_detailed_instruct(task, 'What is the capital of China?'),get_detailed_instruct(task, 'Explain gravity')
]
# No need to add instruction for retrieval documents
documents = ["The capital of China is Beijing.","Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun."
]
input_texts = queries + documents
input_texts

['Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:What is the capital of China?','Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:Explain gravity','The capital of China is Beijing.','Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.']

这部分准备了测试数据：

定义了一个检索任务描述
创建了两个带指令格式的查询
准备了两个文档作为检索目标
将查询和文档合并为一个输入列表

tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Embedding-0.6B', padding_side='left')
model = AutoModel.from_pretrained('Qwen/Qwen3-Embedding-0.6B')

这里加载了Qwen3-Embedding-0.6B模型和对应的分词器：

使用padding_side='left'设置左侧填充，这对于后续提取最后一个token的表示很重要
模型加载为CPU模式（如需GPU加速可添加.cuda()）

max_length = 8192# Tokenize the input texts
batch_dict = tokenizer(input_texts,padding=True,truncation=True,max_length=max_length,return_tensors="pt",
)
batch_dict.keys()

dict_keys(['input_ids', 'attention_mask'])

这部分对输入文本进行分词处理：

设置最大长度为8192（Qwen3-Embedding支持的上下文长度）
启用填充和截断，确保所有序列长度一致
返回PyTorch张量格式
输出显示batch_dict包含’input_ids’和’attention_mask’两个键

batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
embeddings.shape

这部分进行模型推理和嵌入提取：

将输入数据移至模型所在设备
通过模型获取输出
使用last_token_pool函数提取每个序列的表示向量
输出显示"left_padding"，表明使用了左侧填充
嵌入向量形状为[4, 1024]，表示4个输入文本，每个嵌入维度为1024

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
scores

    tensor([[0.7646, 0.1414],[0.1355, 0.6000]], grad_fn=<MmBackward0>)

最后一部分计算相似度分数：

对嵌入向量进行L2归一化，使其长度为1
计算查询嵌入（前2个）与文档嵌入（后2个）的点积，得到相似度矩阵
结果显示查询1与文档1的相似度为0.7646（高），与文档2的相似度为0.1414（低）
查询2与文档1的相似度为0.1355（低），与文档2的相似度为0.6000（高）

这表明模型成功地将查询与相关文档匹配起来："中国首都"的查询与包含"北京"的文档相似度高，"解释重力"的查询与描述重力的文档相似度高。

另个例子

task = 'Given a web search query, retrieve relevant passages that answer the query'queries = [get_detailed_instruct(task, 'How does photosynthesis work?'),get_detailed_instruct(task, 'What are the benefits of exercise?')
]
# No need to add instruction for retrieval documents
documents = ["Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll. During this process, plants convert light energy into chemical energy, absorb carbon dioxide and release oxygen.","Regular exercise offers numerous benefits including improved cardiovascular health, stronger muscles and bones, better weight management, enhanced mental health, reduced risk of chronic diseases, improved sleep quality, and increased energy levels."
]
input_texts = queries + documents
max_length = 8192# Tokenize the input texts
batch_dict = tokenizer(input_texts,padding=True,truncation=True,max_length=max_length,return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)scores

left_paddingtensor([[0.6436, 0.1156],[0.2306, 0.7621]], grad_fn=<MmBackward0>)

参考链接：https://github.com/QwenLM/Qwen3-Embedding?tab=readme-ov-file

【python深度学习】Day53 对抗生成网络

squirrel 语言入门教程

TLSF 内存分配器

Boost.Pool 和 Boost.Fast_Pool 介绍与使用

FreeRTOS的低功耗Tickless模式

【计算机网络】非阻塞IO——epoll 编程与ET模式详解——(easy)高并发网络服务器设计

负载均衡器：Ribbon和LoadBalance

thinkphp8提升之查询

深度解析JavaScript闭包：从原理到高级应用

物理“硬核”切换镜头！Pura 80 Ultra一镜双目镜头切换的仪式感

Veeam Backup Replication系统的安装与使用

低温对FPGA的核心影响

温度对IO通信的影响

LCEL：LangChain 表达式语言详解与测试工程师的实践指南

【unitrix】 1.7 规范化常量类型结构(standardization.rs)

java面试总结-20250609

python+django/flask+uniapp宠物中心信息管理系统app

JAVA理论第十八章-JWT杂七杂八

写作词汇积累（A）：颇有微词、微妙（“微”字的学习理解）

大小端的区别

用wordpress还是php/seo招聘要求

外国人可以在中国做网站吗/seo整站优化费用

Qwen3 Embedding 测试

目录

环境准备

初始化

例子

另个例子

相关文章：