当前位置: 首页 > news >正文

使用Gemini, LangChain, Gradio打造一个书籍推荐系统 (第二部分)

建立向量嵌入数据库

from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain.docstore.document import Document
from langchain_chroma.vectorstores import Chromaimport vertexai
from vertexai.language_models import TextEmbeddingModel

导入了一些用于构建 文本嵌入式向量检索系统(Embedding-based Retrieval System) 的模块,结合了 LangChain、Chroma 向量数据库、以及 Google Vertex AI 的文本嵌入模型。

从 LangChain 社区库 导入 TextLoader,用于从 .txt 文件中加载纯文本数据。
功能:将本地文本文件加载为 LangChain 的 Document 对象

导入 CharacterTextSplitter,是 LangChain 提供的一种 文本分割器。
功能:将长文本分割成较小的片段(例如按字符数),用于后续嵌入处理或问答检索。

功能:用于表示一个包含文本内容和可选元数据的文档对象,便于在整个链条中传递和处理。

导入 Chroma 类,它是 LangChain 支持的 向量数据库(Vector Store) 之一,基于 ChromaDB。
功能:将文本嵌入后保存到数据库中,可以用来做相似度搜索、向量检索等任务。

从 Vertex AI 的语言模型模块中导入 TextEmbeddingModel。
功能:调用 Google 预训练的嵌入模型,将文本转换为 嵌入向量(embedding vector),供后续存入向量数据库或进行相似度计算。

# 设置 GCP 项目参数
project_id = ""
location = "us-central1"  # 确保该地区支持 Gemini
import pandas as pdbooks = pd.read_csv("books_cleaned_new.csv")
books["tagged_description"]
0       9780002005883 A NOVEL THAT READERS and critics...
1       9780002261982 A new 'Christie for Christmas' -...
2       9780006178736 A memorable, mesmerizing heroine...
3       9780006280897 Lewis' work on the nature of lov...
4       9780006280934 "In The Problem of Pain, C.S. Le......                        
5192    9788172235222 On A Train Journey Home To North...
5193    9788173031014 This book tells the tale of a ma...
5194    9788179921623 Wisdom to Create a Life of Passi...
5195    9788185300535 This collection of the timeless ...
5196    9789027712059 Since the three volume edition o...
Name: tagged_description, Length: 5197, dtype: object
books["tagged_description"].to_csv("new_tagged_description.txt",sep = "\n",index = False,header = False)

将 books[“tagged_description”] 这一列保存为一个新的文本文件 new_tagged_description.txt,每行一个值,不含索引和列名。

sep=“\n”:每个元素用“换行”分隔,也就是每个值占一行。

index=False:不输出 DataFrame 的行号索引。

header=False:不输出列名(只保存纯文本内容)。

raw_documents = TextLoader("new_tagged_description.txt", encoding="utf-8").load()
text_splitter = CharacterTextSplitter(chunk_size=0, chunk_overlap=0, separator="\n")
documents = text_splitter.split_documents(raw_documents)

encoding=“utf-8”:确保以 UTF-8 编码读取文件。

创建一个文本切分器,按换行符 \n 作为分割依据,把长文本拆分成多个块(chunk)。

chunk_size=0:特殊用法,配合 separator=“\n”,表示按行完整切分,而不是定长字符数。

chunk_overlap=0:切分块之间没有重叠。

separator=“\n”:以换行符为切分依据。

将上一步加载的长文档切割成一个个较小的 Document 实例,每个实例代表一行文本。

Created a chunk of size 1168, which is longer than the specified 0
Created a chunk of size 1214, which is longer than the specified 0
Created a chunk of size 373, which is longer than the specified 0
Created a chunk of size 309, which is longer than the specified 0
Created a chunk of size 483, which is longer than the specified 0
Created a chunk of size 482, which is longer than the specified 0
Created a chunk of size 960, which is longer than the specified 0
Created a chunk of size 188, which is longer than the specified 0
Created a chunk of size 843, which is longer than the specified 0
Created a chunk of size 296, which is longer than the specified 0
Created a chunk of size 197, which is longer than the specified 0
Created a chunk of size 881, which is longer than the specified 0
Created a chunk of size 1088, which is longer than the specified 0
Created a chunk of size 1189, which is longer than the specified 0
Created a chunk of size 304, which is longer than the specified 0
Created a chunk of size 270, which is longer than the specified 0
Created a chunk of size 211, which is longer than the specified 0
Created a chunk of size 214, which is longer than the specified 0
Created a chunk of size 513, which is longer than the specified 0
Created a chunk of size 752, which is longer than the specified 0
Created a chunk of size 388, which is longer than the specified 0
Created a chunk of size 263, which is longer than the specified 0
Created a chunk of size 253, which is longer than the specified 0
Created a chunk of size 306, which is longer than the specified 0
Created a chunk of size 728, which is longer than the specified 0
...
Created a chunk of size 1655, which is longer than the specified 0
Created a chunk of size 387, which is longer than the specified 0
Created a chunk of size 763, which is longer than the specified 0
Created a chunk of size 1032, which is longer than the specified 0
documents[0]
Document(metadata={'source': 'new_tagged_description.txt'}, page_content='9780002005883 A NOVEL THAT READERS and critics have been eagerly anticipating for over a decade, Gilead is an astonishingly imagined story of remarkable lives. John Ames is a preacher, the son of a preacher and the grandson (both maternal and paternal) of preachers. It’s 1956 in Gilead, Iowa, towards the end of the Reverend Ames’s life, and he is absorbed in recording his family’s story, a legacy for the young son he will never see grow up. Haunted by his grandfather’s presence, John tells of the rift between his grandfather and his father: the elder, an angry visionary who fought for the abolitionist cause, and his son, an ardent pacifist. He is troubled, too, by his prodigal namesake, Jack (John Ames) Boughton, his best friend’s lost son who returns to Gilead searching for forgiveness and redemption. Told in John Ames’s joyous, rambling voice that finds beauty, humour and truth in the smallest of life’s details, Gilead is a song of celebration and acceptance of the best and the worst the world has to offer. At its heart is a tale of the sacred bonds between fathers and sons, pitch-perfect in style and story, set to dazzle critics and readers alike.')
vertexai.init(project=project_id, location=location)
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-005")

初始化 Vertex AI 客户端,以便后续调用 Google Cloud 上的 AI 模型。

Vertex AI 是 Google Cloud 提供的机器学习平台,这一行的作用是连接到你的云项目和区域,使你能使用 Vertex AI 上部署的模型(如 Embedding、LLM、AutoML 等)。

从 Vertex AI 加载一个预训练的文本嵌入(Text Embedding)模型。

# 嵌入函数(批处理)
def get_gemini_embeddings(texts: list[str]) -> list[list[float]]:embeddings = embedding_model.get_embeddings(texts)return [e.values for e in embeddings]

定义了一个 批量生成文本嵌入向量的函数,用于将多个文本转换为数值表示(即向量)

返回值 是一个二维列表:每一条文本对应一个向量,每个向量是浮点数列表(如 768 维的向量)。

这是一个 列表推导式,将每个 embedding 对象的 .values 取出,构成最终的二维列表。

# 分批生成嵌入
BATCH_SIZE = 50
all_texts = [doc.page_content for doc in documents]
all_metadatas = [doc.metadata for doc in documents]batched_docs = []
batched_vectors = []for i in range(0, len(all_texts), BATCH_SIZE):batch_texts = all_texts[i:i+BATCH_SIZE]batch_metadatas = all_metadatas[i:i+BATCH_SIZE]batch_vectors = get_gemini_embeddings(batch_texts)for text, metadata, vector in zip(batch_texts, batch_metadatas, batch_vectors):batched_docs.append(Document(page_content=text, metadata=metadata))batched_vectors.append(vector)

在使用嵌入模型(如 Vertex AI 的 text-embedding-005)时,出于性能和 API 限制,不能一次性处理太多文本。因此通常使用「批处理」的方式进行嵌入生成。

从 documents 中提取:page_content:文本内容;metadata:与每条文本关联的元数据(如文件名、页码等)

batched_docs:保存文本和元数据的 Document 对象
batched_vectors:保存每个文本对应的向量(float 列表)

使用步长为 BATCH_SIZE 的循环,每次处理一批文本
调用之前定义的 get_gemini_embeddings 函数生成嵌入向量

将每个文本、元数据、向量打包成一个 Document 对象 + 向量
存入两个列表中,供后续使用(如构建向量索引)

举例说明
假设你有 120 段文本,每段都要生成向量,代码会像这样运行:
第一批:第 0~49 条 → 嵌入 → 加入结果
第二批:第 50~99 条 → 嵌入 → 加入结果
第三批:第 100~119 条 → 嵌入 → 加入结果

使用 Chroma 创建一个持久化向量数据库,将文本及其嵌入向量保存进去,并用 LangChain 封装以供后续问答或检索使用。

from chromadb import PersistentClient
from langchain_chroma.vectorstores import Chroma# 先建立 Chroma 客户端
client = PersistentClient(path="./new_chroma_books")# 创建向量库集合
collection = client.get_or_create_collection(name="books")# 插入数据(确保你的向量数目和文本数目一致)
collection.add(documents=[doc.page_content for doc in batched_docs],embeddings=batched_vectors,metadatas=[doc.metadata for doc in batched_docs],ids=[f"doc_{i}" for i in range(len(batched_docs))]
)# 用 langchain 封装向量库
db_books = Chroma(client=client,collection_name="books",embedding_function=lambda x: batched_vectors  # 注意:这里最好改为动态函数
)

chromadb: 向量数据库 Chroma 的 Python 客户端

langchain_chroma.vectorstores.Chroma: LangChain 封装的 Chroma 适配器,便于在 LangChain 中集成向量库

PersistentClient:用于创建一个持久化本地向量数据库客户端

Chroma:LangChain 的向量数据库接口封装类

在 ./new_chroma_books 路径下创建或打开一个本地向量数据库

所有数据会保存在这个文件夹中,下次运行也能加载

创建一个名为 books 的集合(collection),类似于数据库中的表。

插入以下内容到 books 集合中:
documents: 原始文本内容(字符串列表)
embeddings: 每条文本的向量(二维 float 列表)
metadatas: 每条文本的元数据(字典列表)
ids: 每条记录的唯一 ID,如 doc_0, doc_1, …

封装为 LangChain 可用的向量数据库对象 db_books
传入当前客户端和集合名
embedding_function:嵌入函数,这里用了一个固定返回 batched_vectors 的 lambda

根据查询语句生成嵌入向量,并在向量数据库中查找最相似的 10 条文档。

# 示例查询
query = "A book to teach children about nature"
query_embedding = get_gemini_embeddings([query])[0]
docs = db_books.similarity_search_by_vector(query_embedding, k=10)

使用之前定义的 get_gemini_embeddings 函数,把查询转化为向量(嵌入表示)。
get_gemini_embeddings 返回的是列表(批处理),所以这里取第一个 [0] 得到该查询的向量。

使用 db_books(封装好的向量数据库)对查询向量进行相似度搜索。
similarity_search_by_vector(query_embedding, k=10) 表示返回与该向量最接近的 10 条文档。

变量 docs 中保存的是 与查询语句最相关的 10 本书的描述文本与元数据

[Document(id='doc_3751', metadata={'source': 'new_tagged_description.txt'}, page_content='9780786808717 A very special puddle sets Violet the mouse off on her latest nature discovery. It is through this puddle that Violet observes the effect rain has on the world around her. A Mylar puddle on the last page offers children a chance to see their reflection in a puddle, just like Violet!'),Document(id='doc_3747', metadata={'source': 'new_tagged_description.txt'}, page_content='9780786808069 Children will discover the exciting world of their own backyard in this introduction to familiar animals from cats and dogs to bugs and frogs. The combination of photographs, illustrations, and fun facts make this an accessible and delightful learning experience.'),Document(id='doc_442', metadata={'source': 'new_tagged_description.txt'}, page_content='"9780067575208 First published more than three decades ago, this reissue of Rachel Carson\'s award-winning classic brings her unique vision to a new generation of readers. Stunning new photographs by Nick Kelsh beautifully complement Carson\'s intimate account of adventures with her young nephew, Roger, as they enjoy walks along the rocky coast of Maine and through dense forests and open fields, observing wildlife, strange plants, moonlight and storm clouds, and listening to the ""living music"" of insects in the underbrush. ""If a child is to keep alive his inborn sense of wonder."" Writes Carson, ""he needs the companionship of at least one adult who can share it, rediscovering with him the joy, excitement and mystery of the world we live in."" The Sense of Wonder is a refreshing antidote to indifference and a guide to capturing the simple power of discovery that Carson views as essential to life. In her insightful new introduction, Linda Lear remembers Rachel Carson\'s groundbreaking achievements in the context of the legendary environmentalist\'s personal commitment to introducing young and old to the miracles of nature. Kelsh\'s lush photographs inspire sensual, tactile reactions: masses of leaves floating in a puddle are just waiting to be scooped up and examined more closely. An image of a narrow path through the trees evokes the earthy scent of the woods after a summer rain. Close-ups of mosses and miniature lichen fantasy-lands will spark innocent\'as well as more jaded\'imaginations. Like a curious child studying things underfoot and within reach, Kelsh\'s camera is drawn to patterns in nature that too often elude hurried adults\'a stand of beech trees in the springtime, patches of melting snow and the ripples from a pebble tossed into a slow-moving stream. The Sense of Wonder is a timeless volume that will be passed on from children to grandchildren, as treasured as the memory of an early-morning walk when the song of a whippoorwill was heard as if for the first time."'),Document(id='doc_3442', metadata={'source': 'new_tagged_description.txt'}, page_content='9780744578263 Washed up on the beach during a storm, the sea-thing child clings fearfully to the shore until he discovers his true destiny. Suggested level: primary.'),Document(id='doc_3797', metadata={'source': 'new_tagged_description.txt'}, page_content='9780789458209 Photographs and text explore the anatomy and life cycle of trees, examining the different kinds of bark, seeds, and leaves, the commercial processing of trees to make lumber, the creatures that live in trees, and other aspects.'),Document(id='doc_1639', metadata={'source': 'new_tagged_description.txt'}, page_content='9780374422080 This Newbery Honor Book tells the story of 11 -year-old Primrose, who lives in a small fishing village in British Columbia. She recounts her experiences and all she learns about human nature and the unpredictability of life after her parents are lost at sea.'),Document(id='doc_3750', metadata={'source': 'new_tagged_description.txt'}, page_content="9780786808397 Introduce your baby to birds, cats, dogs, and babies through fine art, illustration, and photographs. These books are a rare opportunity to exopse little ones to a range of images on a single subject, from simple child's drawings and abstract art to playful photos. A brief text accompanies each image, introducing baby to some basic -- and sometimes playful -- information about the subjects."),Document(id='doc_3748', metadata={'source': 'new_tagged_description.txt'}, page_content="9780786808373 Introducing your baby to birds, cats, dogs, and babies through fine art, illsutration and photographs. These books are a rare opportunity to expose little ones to a range of images on a single subject, from simple child's drawings and abstract art to playful photos. A brief text accompanies each image, introducing baby to some basic -- and sometimes playful -- information on the subjects."),Document(id='doc_3522', metadata={'source': 'new_tagged_description.txt'}, page_content='9780753459645 What is a leap year? Why are bees busy in summer? Who eats the moon? Why does it get dark at night? In I Wonder Why the Sun Rises by Brenda Walpole children will find out the answers to these and many more questions about time and seasons.'),Document(id='doc_3749', metadata={'source': 'new_tagged_description.txt'}, page_content="9780786808380 Introduce your babies to birds, cats, dogs, and babies through fine art, illustration, and photographs. These books are a rare opportunity to expose little ones to a range of images on a single subject, from simple child's drawings and abstract art to playful photos. A brief text accompanies each image, introducing the baby to some basic -- and sometimes playful -- information about the subjects.")]
books[books["isbn13"] == int(docs[0].page_content.split()[0].strip())]

取出第一个相似文档的文本内容,将文本按照空格拆分,并取第一个词,即为该文档关联的 ISBN-13 编号,去除该字符串首尾可能的空白字符(虽然对 ISBN 一般没必要,但保险起见)。
将字符串转为整数(因为 books[“isbn13”] 是整数类型)。
根据 docs[0] 文本中的 ISBN 编号,找到其在原始 books 表中的完整信息(如标题、作者、评分、分类等)。

根据自然语言查询语句,从语义上推荐相关图书,并返回包含这些图书详细信息的 DataFrame。

def retrieve_semantic_recommendations(query: str,top_k: int = 10,
) -> pd.DataFrame:query_embedding = get_gemini_embeddings([query])[0]recs = db_books.similarity_search_by_vector(query_embedding, k = 50)books_list = []for i in range(0, len(recs)):books_list += [int(recs[i].page_content.strip('"').split()[0])]return books[books["isbn13"].isin(books_list)]

query: 用户输入的自然语言查询,例如 “Books about space exploration for kids”。
top_k: 期望返回的图书数

用 Gemini 模型生成查询的向量嵌入(embedding)表示,用于语义相似度比较。
在之前构建的 Chroma 向量数据库中查找与该嵌入最相似的 50 本书。

recs 是一个由 Document 组成的列表,每个 Document 的 page_content 包含如 “9780316015844 This book is about…” 的字符串。

逐个提取每本推荐书的 ISBN 编号(在 page_content 中作为第一个词),并将其转换为 int 后存入 books_list。

从 books 数据框中筛选出 ISBN 在 books_list 中的记录,返回包含书名、评分、页数、描述等元数据的表格。

retrieve_semantic_recommendations("A book to teach children about nature")

相关文章:

  • WDS 无线桥接
  • 获国际权威机构认可!龙蜥社区通过 OpenChain ISO/IEC 5230 认证
  • AI Agent 入门指南
  • 复杂工况下液压挖掘机工作臂系统创新设计与性能优化
  • React组件(二):常见属性和函数
  • 74. 搜索二维矩阵
  • 【安全攻防与漏洞​】​​HTTPS中的常见攻击与防御​​
  • 七、OpenGL 2.0 可编程着色器实现渲染控制权转移的四大核心机制
  • HarmonyOS开发-应用间跳转
  • 树莓派WiringPi库
  • 【飞书知识问答】AI赋能企业,开启高效办公新模式
  • c++头文件大全
  • 关于使用高德安卓api时so文件的坑
  • 攻略生成模块
  • RESTful API 在前后端交互中的作用与实践
  • 小说漫画管理系统
  • pytorch LSTM 结构详解
  • 安卓新建项目时,Gradle下载慢下载如何用国内的镜像
  • 【博客系统】博客系统第四弹:令牌技术
  • 【python深度学习】Day34 GPU训练及类的call方法
  • 抖音引流推广软件/seo黑帽技术
  • 兰州网站建设尚美/网站运营主要做什么
  • 网站开发怎样/seo排名快速优化
  • 浪琴女士手表网站/制作网站的最大公司
  • 做网站每年需付费吗/百度问问
  • 做网站背景全覆盖的代码/在线网页制作网站