【腾讯拥抱开源】Youtu-Embedding:基于CoDiEmb的一个协作而独特的框架,用于信息检索与语义文本相似性中的统一表征学习
🎯 简介
Youtu-Embedding 是由腾讯优图实验室开发的尖端通用文本嵌入模型。该模型在信息检索(IR)、语义文本相似度(STS)、聚类、重排序和分类等各类自然语言处理任务中均展现出卓越性能。
-
顶尖性能表现:截至2025年9月,在权威的CMTEB(中文大规模文本嵌入基准)评测中以77.46分位列榜首,彰显其强大稳健的文本表征能力。
-
创新训练框架:采用协同判别式微调框架,通过统一数据格式、任务差异化损失函数及动态单任务采样机制,有效解决多任务学习中的"负迁移"问题。
注:您可基于自有数据轻松适配微调模型以适应领域任务,具体实现请参考训练代码。
🤗 模型下载
模型名称 | 参数量 | 维度 | 序列长度 | 下载 |
---|---|---|---|---|
Youtu-Embedding | 2B | 2048 | 8K | Model |
🚀 使用说明
1. 使用 transformers
📦 安装
pip install transformers==4.51.3 liger_kernel==0.5.4
⚙️ 用法
import torch
import numpy as np
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizerclass LLMEmbeddingModel():def __init__(self, model_name_or_path, batch_size=128, max_length=1024, gpu_id=0):self.model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True)self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right")self.device = torch.device(f"cuda:{gpu_id}")self.model.to(self.device).eval()self.max_length = max_lengthself.batch_size = batch_sizequery_instruction = "Given a search query, retrieve passages that answer the question"if query_instruction:self.query_instruction = f"Instruction: {query_instruction} \nQuery:"else:self.query_instruction = "Query:"self.doc_instruction = ""print(f"query instruction: {[self.query_instruction]}\ndoc instruction: {[self.doc_instruction]}")def mean_pooling(self, hidden_state, attention_mask):s = torch.sum(hidden_state * attention_mask.unsqueeze(-1).float(), dim=1)d = attention_mask.sum(dim=1, keepdim=True).float()embedding = s / dreturn embedding@torch.no_grad()def encode(self, sentences_batch, instruction):inputs = self.tokenizer(sentences_batch,padding=True,truncation=True,return_tensors="pt",max_length=self.max_length,add_special_tokens=True,).to(self.device)with torch.no_grad():outputs = self.model(**inputs)last_hidden_state = outputs[0]instruction_tokens = self.tokenizer(instruction,padding=False,truncation=True,max_length=self.max_length,add_special_tokens=True,)["input_ids"]if len(np.shape(np.array(instruction_tokens))) == 1:inputs["attention_mask"][:, :len(instruction_tokens)] = 0else:instruction_length = [len(item) for item in instruction_tokens]assert len(instruction) == len(sentences_batch)for idx in range(len(instruction_length)):inputs["attention_mask"][idx, :instruction_length[idx]] = 0embeddings = self.mean_pooling(last_hidden_state, inputs["attention_mask"])embeddings = torch.nn.functional.normalize(embeddings, dim=-1)return embeddingsdef encode_queries(self, queries):queries = queries if isinstance(queries, list) else [queries]queries = [f"{self.query_instruction}{query}" for query in queries]return self.encode(queries, self.query_instruction)def encode_passages(self, passages):passages = passages if isinstance(passages, list) else [passages]passages = [f"{self.doc_instruction}{passage}" for passage in passages]return self.encode(passages, self.doc_instruction)def compute_similarity_for_vectors(self, q_reps, p_reps):if len(p_reps.size()) == 2:return torch.matmul(q_reps, p_reps.transpose(0, 1))return torch.matmul(q_reps, p_reps.transpose(-2, -1))def compute_similarity(self, queries, passages):q_reps = self.encode_queries(queries)p_reps = self.encode_passages(passages)scores = self.compute_similarity_for_vectors(q_reps, p_reps)scores = scores.detach().cpu().tolist()return scoresqueries = ["What's the weather like?"]
passages = ['The weather is lovely today.',"It's so sunny outside!",'He drove to the stadium.'
]model_name_or_path = "tencent/Youtu-Embedding"
model = LLMEmbeddingModel(model_name_or_path)
scores = model.compute_similarity(queries, passages)
print(f"scores: {scores}")
2. 使用 sentence-transformers
📦 安装
pip install sentence-transformers==5.1.0
⚙️ 使用
from sentence_transformers import SentenceTransformermodel = SentenceTransformer("tencent/Youtu-Embedding", trust_remote_code=True)
queries = ["What's the weather like?"]
passages = ['The weather is lovely today.',"It's so sunny outside!",'He drove to the stadium.'
]
queries_embeddings = model.encode_query(queries)
passages_embeddings = model.encode_document(passages)similarities = model.similarity(queries_embeddings, passages_embeddings)
print(similarities)
3. 使用 LangChain
🦜
轻松将模型集成到你的 LangChain 应用中,例如 RAG 管道。
📦 安装
pip install langchain==0.3.27 langchain-community==0.3.29 langchain-huggingface==0.3.1 sentence-transformers==5.1.0 faiss-cpu==1.11.0
⚙️ 用法
import torch
from langchain.docstore.document import Document
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddingsmodel_name_or_path = "tencent/Youtu-Embedding"
device = "cuda" if torch.cuda.is_available() else "cpu"model_kwargs = {'trust_remote_code': True,'device': device
}embedder = HuggingFaceEmbeddings(model_name=model_name_or_path,model_kwargs=model_kwargs,
)query_instruction = "Instruction: Given a search query, retrieve passages that answer the question \nQuery:"
doc_instruction = ""data = ["Venus is often called Earth's twin because of its similar size and proximity.","Mars, known for its reddish appearance, is often referred to as the Red Planet.","Jupiter, the largest planet in our solar system, has a prominent red spot.","Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]documents = [Document(page_content=text, metadata={"id": i}) for i, text in enumerate(data)]
vector_store = FAISS.from_documents(documents, embedder, distance_strategy="MAX_INNER_PRODUCT")query = "Which planet is known as the Red Planet?"
instructed_query = query_instruction + query
results = vector_store.similarity_search_with_score(instructed_query, k=3)print(f"Original Query: {query}\n")
print("Results:")
for doc, score in results:print(f"- Text: {doc.page_content} (Score: {score:.4f})")
4. 使用 LlamaIndex
🦙
这非常适合将模型集成到您的 LlamaIndex 搜索和检索系统中。
📦 安装
pip install llama-index==0.14.2 llama-index-embeddings-huggingface==0.6.1 sentence-transformers==5.1.0 llama-index-vector-stores-faiss==0.5.1
⚙️ 使用说明
import faiss
import torch
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores import VectorStoreQuery
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbeddingmodel_name_or_path = "tencent/Youtu-Embedding"
device = "cuda" if torch.cuda.is_available() else "cpu"embeddings = HuggingFaceEmbedding(model_name=model_name_or_path,trust_remote_code=True,device=device,query_instruction="Instruction: Given a search query, retrieve passages that answer the question \nQuery:", text_instruction=""
)data = ["Venus is often called Earth's twin because of its similar size and proximity.","Mars, known for its reddish appearance, is often referred to as the Red Planet.","Jupiter, the largest planet in our solar system, has a prominent red spot.","Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]nodes = [TextNode(id_=str(i), text=text) for i, text in enumerate(data)]for node in nodes:node.embedding = embeddings.get_text_embedding(node.get_content())embed_dim = len(nodes[0].embedding)
store = FaissVectorStore(faiss_index=faiss.IndexFlatIP(embed_dim))
store.add(nodes)query = "Which planet is known as the Red Planet?"
query_embedding = embeddings.get_query_embedding(query)results = store.query(VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=3)
)print(f"Query: {query}\n")
print("Results:")
for idx, score in zip(results.ids, results.similarities):print(f"- Text: {data[int(idx)]} (Score: {score:.4f})")
📊 CMTEB
模型 | 参数量 | 平均(任务) | 平均(类型) | 分类 | 聚类 | 配对分类 | 重排序 | 检索 | 语义相似度 |
---|---|---|---|---|---|---|---|---|---|
bge-multilingual-gemma2 | 9B | 67.64 | 68.52 | 75.31 | 59.30 | 79.30 | 68.28 | 73.73 | 55.19 |
ritrieve_zh_v1 | 326M | 72.71 | 73.85 | 76.88 | 66.50 | 85.98 | 72.86 | 76.97 | 63.92 |
Qwen3-Embedding-4B | 4B | 72.27 | 73.51 | 75.46 | 77.89 | 83.34 | 66.05 | 77.03 | 61.26 |
Qwen3-Embedding-8B | 8B | 73.84 | 75.00 | 76.97 | 80.08 | 84.23 | 66.99 | 78.21 | 63.53 |
Conan-embedding-v2 | 1.4B | 74.24 | 75.99 | 76.47 | 68.84 | 92.44 | 74.41 | 78.31 | 65.48 |
Seed1.6-embedding | - | 75.63 | 76.68 | 77.98 | 73.11 | 88.71 | 71.65 | 79.69 | 68.94 |
QZhou-Embedding | 7B | 76.99 | 78.58 | 79.99 | 70.91 | 95.07 | 74.85 | 78.80 | 71.89 |
Youtu-Embedding | 2B | 77.58 | 78.86 | 78.65 | 84.27 | 86.12 | 75.10 | 80.21 | 68.82 |
注意:对比分数来自MTEB排行榜,记录日期为2025年9月28日。
🎉 Citation
@misc{zhang2025codiemb,title={CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity},author={Zhang, Bowen and Song, Zixin and Chen, Chunquan and Zhang, Qian-Wen and Yin, Di and Sun, Xing},year={2025},eprint={2508.11442},archivePrefix={arXiv},url={https://arxiv.org/abs/2508.11442},
}