当前位置: 首页 > news >正文

RAG 入门全流程代码实战

RAG 入门

检索增强生成(RAG)是一种混合方法,它结合了信息检索与生成模型。通过结合外部知识,它增强了语言模型的表现,提高了准确性和事实的正确性。

实现步骤:

  • Data Ingestion(数据采集): 加载和预处理文本数据。
  • Chunking(分块处理): 将数据分割成更小的块以提高检索性能。
  • Embedding Creation(嵌入创建): 使用嵌入模型将文本块转换为数值表示。
  • Semantic Search(语义搜索): 根据用户查询检索相关块。
  • Response Generation(响应生成):使用语言模型根据检索到的文本生成响应。

设置环境

import fitz
import os
import numpy as np
import json
from openai import OpenAI
from dotenv import load_dotenvload_dotenv()

从 PDF 文件中提取文本

使用 PyMuPDF 库从 PDF 文件中提取文本

def extract_text_from_pdf(pdf_path):"""Extracts text from a PDF file and prints the first `num_chars` characters.Args:pdf_path (str): Path to the PDF file.Returns:str: Extracted text from the PDF."""# Open the PDF filemypdf = fitz.open(pdf_path)all_text = ""  # Initialize an empty string to store the extracted text# Iterate through each page in the PDFfor page_num in range(mypdf.page_count):page = mypdf[page_num]  # Get the pagetext = page.get_text("text")  # Extract text from the pageall_text += text  # Append the extracted text to the all_text stringreturn all_text  # Return the extracted text

对提取的文本进行分块

将文本切分成更小的、重叠的块以提高检索准确性


def chunk_text(text, n, overlap):"""Chunks the given text into segments of n characters with overlap.Args:text (str): 文本n (int): 块长度overlap (int): 重叠度Returns:List[str]: A list of text chunks."""chunks = []  # Initialize an empty list to store the chunks# Loop through the text with a step size of (n - overlap)for i in range(0, len(text), n - overlap):# Append a chunk of text from index i to i + n to the chunks listchunks.append(text[i:i + n])return chunks

设置 OpenAI API 客户端

初始化 OpenAI 客户端以生成嵌入和响应

# 国内支持类OpenAI的API都可,我用的是火山引擎的,需要配置对应的base_url和api_keyclient = OpenAI(base_url=os.getenv("LLM_BASE_URL"),api_key=os.getenv("LLM_API_KEY")
)

从 PDF 文件中提取和分块文本

加载 PDF,提取文本并将其分割成块

# PDF file
pdf_path = "../../data/AI_Information.en.zh-CN.pdf"# 提取文本
extracted_text = extract_text_from_pdf(pdf_path)# 切分文本块,块长度为300,重叠度为50
text_chunks = chunk_text(extracted_text, 500, 100)# 文本块的数量
print("Number of text chunks:", len(text_chunks))# 第一个文本块
print("\nFirst text chunk:")
print(text_chunks[0])

文本块创建嵌入

嵌入将文本转换为数值向量,这允许进行高效的相似性搜索

# from sentence_transformers import SentenceTransformer, util
# from typing import List
# from pathlib import Path
#
#
# def create_embeddings(text: List[str], model_path: str = "../rag_naive/model/gte-base-zh") -> List[List[float]]:
#     """
#     Creates embeddings for the given text using the local-embedding model.
#     eg: modelscope gte-base-zh
#     """
#     # Create embeddings for the input text using the specified model
#
#     st_model = SentenceTransformer(model_name_or_path=model_path)
#     st_embeddings = st_model.encode(text, normalize_embeddings=True)
#     response = [embedding.tolist() for embedding in st_embeddings]
#
#     return responsedef create_embeddings(text):# Create embeddings for the input text using the specified modelresponse = client.embeddings.create(model=os.getenv("EMBEDDING_MODEL_ID"),input=text)return response  # Return the response containing the embeddings

文本块的嵌入向量

response = create_embeddings(text_chunks)

语义搜索

实现余弦相似度来找到与用户查询最相关的文本片段

def cosine_similarity(vec1, vec2):"""Calculates the cosine similarity between two vectors.Args:vec1 (np.ndarray): The first vector.vec2 (np.ndarray): The second vector.Returns:float: The cosine similarity between the two vectors."""# Compute the dot product of the two vectors and divide by the product of their normsreturn np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def semantic_search(query, text_chunks, embeddings, k=5):"""Performs semantic search on the text chunks using the given query and embeddings.Args:query (str): The query for the semantic search.text_chunks (List[str]): A list of text chunks to search through.embeddings (List[dict]): A list of embeddings for the text chunks.k (int): The number of top relevant text chunks to return. Default is 5.Returns:List[str]: A list of the top k most relevant text chunks based on the query."""# Create an embedding for the queryquery_embedding = create_embeddings(query).data[0].embeddingsimilarity_scores = []  # Initialize a list to store similarity scores# Calculate similarity scores between the query embedding and each text chunk embeddingfor i, chunk_embedding in enumerate(embeddings):similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))similarity_scores.append((i, similarity_score))  # Append the index and similarity score# Sort the similarity scores in descending ordersimilarity_scores.sort(key=lambda x: x[1], reverse=True)# Get the indices of the top k most similar text chunkstop_indices = [index for index, _ in similarity_scores[:k]]# Return the top k most relevant text chunksreturn [text_chunks[index] for index in top_indices]

在提取的文本块上进行语义搜索

# Load the validation data from a JSON file
with open('../../data/val.json', encoding="utf-8") as f:data = json.load(f)# Extract the first query from the validation data
query = data[0]['question']# Perform semantic search to find the top 2 most relevant text chunks for the query
top_chunks = semantic_search(query, text_chunks, response.data, k=2)# Print the query
print("Query:", query)# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):print(f"Context {i + 1}:\n{chunk}\n=====================================")

基于检索到的片段生成响应

# Define the system prompt for the AI assistant
system_prompt = "你是一个AI助手,严格根据给定的上下文进行回答。如果无法直接从提供的上下文中得出答案,请回复:'我没有足够的信息来回答这个问题。'"def generate_response(system_prompt, user_message):"""Generates a response from the AI model based on the system prompt and user message.Args:system_prompt (str): The system prompt to guide the AI's behavior.user_message (str): The user's message or query.Returns:dict: The response from the AI model."""response = client.chat.completions.create(model=os.getenv("LLM_MODEL_ID"),messages=[{"role": "system", "content": system_prompt},{"role": "user", "content": user_message}],temperature=0.1,top_p=0.8,presence_penalty=1.05,max_tokens=4096,)return response.choices[0].message.content# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"上下文内容 {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\n问题: {query}"# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)
print(ai_response)

评估响应质量

# Define the system prompt for the evaluation system
evaluate_system_prompt = "你是一个智能评估系统,负责评估AI助手的回答。如果AI助手的回答与真实答案非常接近,则评分为1。如果回答错误或与真实答案不符,则评分为0。如果回答部分符合真实答案,则评分为0.5。"# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"用户问题: {query}\nAI回答:\n{ai_response}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)
print(evaluation_response)
http://www.dtcms.com/a/438464.html

相关文章:

  • Simulink模型转NuSMV模型(2006)
  • 内蒙古网站建设百度网站建设以什么盈利
  • 工业网站建设wordpress设置新浪邮箱
  • WSDL 文档:理解与使用
  • 网站建设丨金手指15设计师联盟官网效果图
  • Python3 MongoDB 使用指南
  • 沁恒微 RISC-V 芯片开发工具 MounRiver Studio 使用
  • IDEA快速上手指南!
  • 做网站用户充值提现吉林网站推广公司
  • 深圳建科技有限公司网站首页中国建筑业协会
  • [Windows] 利用AI写的一款自用多功能密码管理器开源
  • 4. 矩阵代数
  • 蛋白质的性质和研究方法
  • 【专业词典】PDCA
  • 云南网站推广大连集团网站建设
  • Sora 2 的社交野心:AI 如何重构内容社交产品逻辑?
  • 免费行情网站看男科一般花多少钱
  • 突破规模瓶颈的密钥:混合专家模型(MoE)的架构演进与应用
  • 厦门最早做网站的公司曲阳网站建设在哪
  • Spring Boot 的 7 大核心优势
  • GitLab入门教程:打开DevOps全流程的大门
  • 旅行社网站规划与建设的流程中国建设银行网站主要功能
  • 【408计组】3.2 主存储器的基本组成
  • 长沙门户网站建设公司徐典超 网站建设
  • 百度做任务的网站网站建设中古典武侠中文字幕
  • 【pytest】finalizer 执行顺序:FILO 原则
  • Windows11配置MSYS2+vscode+cpp+cmake环境
  • flash网站需要改变足球比赛直播观看
  • 批量M3U8转MP4工具
  • 关于棋牌游戏网站建设文案app 网站开发公司电话