当前位置: 首页 > news >正文

【腾讯拥抱开源】Youtu-Embedding:基于CoDiEmb的一个协作而独特的框架,用于信息检索与语义文本相似性中的统一表征学习

🎯 简介

Youtu-Embedding 是由腾讯优图实验室开发的尖端通用文本嵌入模型。该模型在信息检索(IR)、语义文本相似度(STS)、聚类、重排序和分类等各类自然语言处理任务中均展现出卓越性能。

  • 顶尖性能表现:截至2025年9月,在权威的CMTEB(中文大规模文本嵌入基准)评测中以77.46分位列榜首,彰显其强大稳健的文本表征能力。

  • 创新训练框架:采用协同判别式微调框架,通过统一数据格式、任务差异化损失函数及动态单任务采样机制,有效解决多任务学习中的"负迁移"问题。

:您可基于自有数据轻松适配微调模型以适应领域任务,具体实现请参考训练代码。

🤗 模型下载

模型名称参数量维度序列长度下载
Youtu-Embedding2B20488KModel

🚀 使用说明

1. 使用 transformers

📦 安装

pip install transformers==4.51.3 liger_kernel==0.5.4

⚙️ 用法

import torch
import numpy as np
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizerclass LLMEmbeddingModel():def __init__(self, model_name_or_path, batch_size=128, max_length=1024, gpu_id=0):self.model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right")self.device = torch.device(f"cuda:{gpu_id}")self.model.to(self.device).eval()self.max_length = max_lengthself.batch_size = batch_sizequery_instruction = "Given a search query, retrieve passages that answer the question"if query_instruction:self.query_instruction = f"Instruction: {query_instruction} \nQuery:"else:self.query_instruction = "Query:"self.doc_instruction = ""print(f"query instruction: {[self.query_instruction]}\ndoc instruction: {[self.doc_instruction]}")def mean_pooling(self, hidden_state, attention_mask):s = torch.sum(hidden_state * attention_mask.unsqueeze(-1).float(), dim=1)d = attention_mask.sum(dim=1, keepdim=True).float()embedding = s / dreturn embedding@torch.no_grad()def encode(self, sentences_batch, instruction):inputs = self.tokenizer(sentences_batch,padding=True,truncation=True,return_tensors="pt",max_length=self.max_length,add_special_tokens=True,).to(self.device)with torch.no_grad():outputs = self.model(**inputs)last_hidden_state = outputs[0]instruction_tokens = self.tokenizer(instruction,padding=False,truncation=True,max_length=self.max_length,add_special_tokens=True,)["input_ids"]if len(np.shape(np.array(instruction_tokens))) == 1:inputs["attention_mask"][:, :len(instruction_tokens)] = 0else:instruction_length = [len(item) for item in instruction_tokens]assert len(instruction) == len(sentences_batch)for idx in range(len(instruction_length)):inputs["attention_mask"][idx, :instruction_length[idx]] = 0embeddings = self.mean_pooling(last_hidden_state, inputs["attention_mask"])embeddings = torch.nn.functional.normalize(embeddings, dim=-1)return embeddingsdef encode_queries(self, queries):queries = queries if isinstance(queries, list) else [queries]queries = [f"{self.query_instruction}{query}" for query in queries]return self.encode(queries, self.query_instruction)def encode_passages(self, passages):passages = passages if isinstance(passages, list) else [passages]passages = [f"{self.doc_instruction}{passage}" for passage in passages]return self.encode(passages, self.doc_instruction)def compute_similarity_for_vectors(self, q_reps, p_reps):if len(p_reps.size()) == 2:return torch.matmul(q_reps, p_reps.transpose(0, 1))return torch.matmul(q_reps, p_reps.transpose(-2, -1))def compute_similarity(self, queries, passages):q_reps = self.encode_queries(queries)p_reps = self.encode_passages(passages)scores = self.compute_similarity_for_vectors(q_reps, p_reps)scores = scores.detach().cpu().tolist()return scoresqueries = ["What's the weather like?"]
passages = ['The weather is lovely today.',"It's so sunny outside!",'He drove to the stadium.'
]model_name_or_path = "tencent/Youtu-Embedding"
model = LLMEmbeddingModel(model_name_or_path)
scores = model.compute_similarity(queries, passages)
print(f"scores: {scores}")
2. 使用 sentence-transformers

📦 安装

pip install sentence-transformers==5.1.0

⚙️ 使用

from sentence_transformers import SentenceTransformermodel = SentenceTransformer("tencent/Youtu-Embedding", trust_remote_code=True)
queries = ["What's the weather like?"]
passages = ['The weather is lovely today.',"It's so sunny outside!",'He drove to the stadium.'
]
queries_embeddings = model.encode_query(queries)
passages_embeddings = model.encode_document(passages)similarities = model.similarity(queries_embeddings, passages_embeddings)
print(similarities)
3. 使用 LangChain 🦜

轻松将模型集成到你的 LangChain 应用中,例如 RAG 管道。

📦 安装

pip install langchain==0.3.27 langchain-community==0.3.29 langchain-huggingface==0.3.1 sentence-transformers==5.1.0 faiss-cpu==1.11.0

⚙️ 用法

import torch
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddingsmodel_name_or_path = "tencent/Youtu-Embedding"
device = "cuda" if torch.cuda.is_available() else "cpu"model_kwargs = {'trust_remote_code': True,'device': device
}embedder = HuggingFaceEmbeddings(model_name=model_name_or_path,model_kwargs=model_kwargs,
)query_instruction = "Instruction: Given a search query, retrieve passages that answer the question \nQuery:"
doc_instruction = ""data = ["Venus is often called Earth's twin because of its similar size and proximity.","Mars, known for its reddish appearance, is often referred to as the Red Planet.","Jupiter, the largest planet in our solar system, has a prominent red spot.","Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]documents = [Document(page_content=text, metadata={"id": i}) for i, text in enumerate(data)]
vector_store = FAISS.from_documents(documents, embedder, distance_strategy="MAX_INNER_PRODUCT")query = "Which planet is known as the Red Planet?"
instructed_query = query_instruction + query
results = vector_store.similarity_search_with_score(instructed_query, k=3)print(f"Original Query: {query}\n")
print("Results:")
for doc, score in results:print(f"- Text: {doc.page_content} (Score: {score:.4f})")
4. 使用 LlamaIndex 🦙

这非常适合将模型集成到您的 LlamaIndex 搜索和检索系统中。

📦 安装

pip install llama-index==0.14.2 llama-index-embeddings-huggingface==0.6.1 sentence-transformers==5.1.0 llama-index-vector-stores-faiss==0.5.1

⚙️ 使用说明

import faiss
import torch
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores import VectorStoreQuery
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbeddingmodel_name_or_path = "tencent/Youtu-Embedding"
device = "cuda" if torch.cuda.is_available() else "cpu"embeddings = HuggingFaceEmbedding(model_name=model_name_or_path,trust_remote_code=True,device=device,query_instruction="Instruction: Given a search query, retrieve passages that answer the question \nQuery:",  text_instruction=""
)data = ["Venus is often called Earth's twin because of its similar size and proximity.","Mars, known for its reddish appearance, is often referred to as the Red Planet.","Jupiter, the largest planet in our solar system, has a prominent red spot.","Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]nodes = [TextNode(id_=str(i), text=text) for i, text in enumerate(data)]for node in nodes:node.embedding = embeddings.get_text_embedding(node.get_content())embed_dim = len(nodes[0].embedding)
store = FaissVectorStore(faiss_index=faiss.IndexFlatIP(embed_dim))
store.add(nodes)query = "Which planet is known as the Red Planet?"
query_embedding = embeddings.get_query_embedding(query)results = store.query(VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=3)
)print(f"Query: {query}\n")
print("Results:")
for idx, score in zip(results.ids, results.similarities):print(f"- Text: {data[int(idx)]} (Score: {score:.4f})")

📊 CMTEB

模型参数量平均(任务)平均(类型)分类聚类配对分类重排序检索语义相似度
bge-multilingual-gemma29B67.6468.5275.3159.3079.3068.2873.7355.19
ritrieve_zh_v1326M72.7173.8576.8866.5085.9872.8676.9763.92
Qwen3-Embedding-4B4B72.2773.5175.4677.8983.3466.0577.0361.26
Qwen3-Embedding-8B8B73.8475.0076.9780.0884.2366.9978.2163.53
Conan-embedding-v21.4B74.2475.9976.4768.8492.4474.4178.3165.48
Seed1.6-embedding-75.6376.6877.9873.1188.7171.6579.6968.94
QZhou-Embedding7B76.9978.5879.9970.9195.0774.8578.8071.89
Youtu-Embedding2B77.5878.8678.6584.2786.1275.1080.2168.82

注意:对比分数来自MTEB排行榜,记录日期为2025年9月28日。

🎉 Citation

@misc{zhang2025codiemb,title={CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity},author={Zhang, Bowen and Song, Zixin and Chen, Chunquan and Zhang, Qian-Wen and Yin, Di and Sun, Xing},year={2025},eprint={2508.11442},archivePrefix={arXiv},url={https://arxiv.org/abs/2508.11442},
}
http://www.dtcms.com/a/461207.html

相关文章:

  • 西蔵自治区建设厅网站wordpress防盗链插件
  • VSCode中使用conda activate 虚拟环境,没报错,但没进入环境
  • vue修改element-ui的默认的class
  • ModuleNotFoundError: No module named ‘UI_xiangmu‘
  • 网站建设方案及报价霍州做网站
  • mybatis-generator插件自动生成mapper及其实体模型配置
  • 计算机毕业设计 基于k-means的校园美食推荐系统 Python 大数据毕业设计 Hadoop毕业设计选题【附源码+文档报告+安装调试】
  • 【代码大模型-后门安全】Backdoors in Neural Models of Source Code
  • javaweb后端优雅处理枚举
  • 帝国cms小程序搞起来简直好用的不行
  • 高效批量调整图像尺寸的方案
  • 单片机供电处3.3V大电容导致程序可进调试但是无法暂停到具体语句
  • Java 实战:为 Word 文档中的文本与段落添加边框
  • 创业网站怎么做网站建设与网页设计...
  • git大文件储存机制是什么-为什么有大文件会出错并且处理大文件非常麻烦-优雅草卓伊凡
  • 机器学习之规则学习(Rule Learning)
  • 【神经网络】从逻辑回归到神经网络
  • 工厂抽烟检测系统 智能化安全管控新方案 加油站吸烟检测技术 吸烟行为智能监测
  • 做外贸怎么看外国网站wordpress配置邮件发送
  • 使用Python对PDF进行拆分与合并
  • 嵌入式软件/硬件工程师面试答案
  • 6.DSP学习记录之定时器
  • 阳春新农村建设网站中铁建设集团有限公司西北分公司
  • 简化OffSec考试报告编写:OSCP-Exam-Report-Template-Markdown项目详解
  • 北京网站设计制作过程数据服务网站开发
  • Go基础:一文掌握Go语言网络编程
  • TENGJUN-3.5MM耳机插座(JA06-BPF032-A):反向沉板结构下的4极音频连接解决方案
  • 使用IOT-Tree接入各种设备转OPC UA Server输出
  • 【大模型实战篇】从Python函数到MCP服务器:完整转换示例
  • 怎样增加网站的权重小企业做网站有用吗