当前位置：首页 > news >正文

Milvus：标量索引详解（十三）

news 2025/11/5 8:21:27

一、位图索引（BITMAP）

1.1 核心概念深度解析

位图索引是专为低基数标量字段设计的高效索引技术，核心优势在于通过二进制位运算快速筛选符合条件的记录。

基数定义：字段中不同值的数量，比如 “用户状态” 字段只有 “活跃 / 未活跃 / 暂停”3 种值，即为低基数。
适用核心场景：需要对布尔型、枚举型、低基数整数 / 字符串字段进行快速过滤，尤其适合多条件组合查询（如 “VIP 用户且状态活跃”）。
核心原理：为每个字段唯一值分配一个二进制位图，位图中每个位对应一条记录，1 表示记录包含该值，0 表示不包含。查询时通过 AND/OR 等位运算，瞬间得到符合条件的记录位置。

1.2 索引结构与工作流程

1.2.1 核心组件详解

组件	描述	作用	实际应用说明
键（Key）	索引字段中的唯一值（如 “free”“premium”“vip”）	标识不同的字段值，作为位图的关联标识	每个 Key 对应一个独立位图，确保值与记录分布的精准映射
位图（Bitmap）	由 0 和 1 组成的二进制位序列，长度等于集合总记录数	记录每个 Key 在所有记录中的分布情况	位运算（AND/OR）效率极高，百万级记录筛选可在毫秒内完成

1.2.2 工作流程示例

假设集合有 5 条文档，“类别” 字段的 Tech 位图为 [1,0,1,0,0]（表示文档 1、3 属于 Tech 类），“公开状态” 字段的 Public 位图为 [1,0,0,1,0]（表示文档 1、4 为公开）：

当查询 “Tech 类且公开的文档” 时，执行位图 AND 运算：[1,0,1,0,0] & [1,0,0,1,0] = [1,0,0,0,0]。
运算结果中仅第 1 位为 1，直接定位到文档 1，无需遍历所有记录。

1.3 完整示例代码逐行解释

import random
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema# 1. 连接Milvus服务 - 建立与Milvus服务器的通信通道
# uri参数为Milvus服务地址（本地默认19530端口），若需认证可添加token参数
client = MilvusClient(uri="http://localhost:19530")# 2. 定义集合结构 - 明确字段类型和索引相关配置
collection_name = "user_profiles_bitmap"
fields = [# 主键字段：自增INT64类型，确保每条记录唯一标识FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),FieldSchema(name="user_id", dtype=DataType.INT64),  # 用户ID字段# 低基数状态字段：0=未活跃，1=活跃，2=暂停（仅3种值，适合位图索引）FieldSchema(name="user_status", dtype=DataType.INT8),# 低基数类型字段：仅3种值，符合位图索引适用场景FieldSchema(name="user_type", dtype=DataType.VARCHAR, max_length=20),FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)  # 向量字段（用于混合搜索）
]# 3. 创建集合 - 基于定义的字段结构初始化集合
schema = CollectionSchema(fields=fields, description="User profiles with bitmap index")
client.create_collection(collection_name=collection_name, schema=schema)
print("✅ 集合创建成功")# 4. 插入测试数据 - 生成符合低基数特征的模拟数据
user_statuses = [0, 1, 2]  # 低基数状态列表
user_types = ["free", "premium", "vip"]  # 低基数类型列表
data = []
for i in range(1000):  # 生成1000条测试数据data.append({"user_id": i + 1000,  # 用户ID从1000开始递增"user_status": random.choice(user_statuses),  # 随机选择状态"user_type": random.choice(user_types),  # 随机选择类型"vector": [random.random() for _ in range(128)]  # 生成128维随机向量})# 5. 分批插入数据 - 避免单次插入数据量过大导致超时
batch_size = 200  # 每批插入200条
for i in range(0, len(data), batch_size):batch = data[i:i + batch_size]  # 截取当前批次数据client.insert(collection_name=collection_name, data=batch)
print("✅ 测试数据插入完成")# 6. 创建位图索引 - 为低基数字段添加索引
index_params = client.prepare_index_params()  # 初始化索引参数对象# 为user_status字段创建位图索引
index_params.add_index(field_name="user_status",  # 索引关联的字段名（必须是已定义的标量字段）index_type="BITMAP",  # 索引类型指定为位图index_name="bitmap_user_status"  # 索引名称（需唯一，用于后续管理）
)# 为user_type字段创建位图索引
index_params.add_index(field_name="user_type",index_type="BITMAP",index_name="bitmap_user_type"
)# 执行索引创建 - 触发Milvus后台构建索引
client.create_index(collection_name=collection_name, index_params=index_params)
print("✅ 位图索引创建完成")# 7. 索引查询示例 - 验证索引效果
print("\n🔍 使用位图索引查询示例:")# 7.1 单条件查询：活跃用户（user_status = 1）
results = client.query(collection_name=collection_name,  # 指定查询的集合filter="user_status == 1",  # 过滤条件（位图索引自动生效）output_fields=["user_id", "user_status", "user_type"],  # 返回的字段limit=5  # 限制返回5条结果
)
print(" 活跃用户查询结果:")
for result in results:print(f" UserID: {result['user_id']}, Status: {result['user_status']}, Type: {result['user_type']}")# 7.2 多条件组合查询：VIP用户且状态为活跃（位图位运算优化）
results = client.query(collection_name=collection_name,filter="user_type == 'vip' and user_status == 1",  # 多条件AND组合output_fields=["user_id", "user_status", "user_type"],limit=5
)
print("\n📊 VIP活跃用户查询结果:")
for result in results:print(f" UserID: {result['user_id']}, Status: {result['user_status']}, Type: {result['user_type']}")# 7.3 向量搜索+标量过滤：在被暂停用户中搜索相似向量
query_vector = [[random.random() for _ in range(128)]]  # 生成查询向量
search_results = client.search(collection_name=collection_name,data=query_vector,  # 查询向量数据filter="user_status == 2",  # 位图索引过滤被暂停的用户limit=3,  # 返回Top3相似结果output_fields=["user_id", "user_status", "user_type"]
)
print("\n🎯 向量搜索 + 位图索引过滤结果:")
for hit in search_results[0]:print(f" 用户ID: {hit['entity']['user_id']}, 状态: {hit['entity']['user_status']}, 相似度: {hit['distance']:.4f}")# 8. 资源清理 - 测试完成后删除索引和集合，避免占用资源
client.drop_index(collection_name=collection_name, index_name="bitmap_user_status")
client.drop_index(collection_name=collection_name, index_name="bitmap_user_type")
client.drop_collection(collection_name=collection_name)
print("\n🧹 资源清理完成")

1.4 关键限制与使用建议

1.4.1 支持的数据类型详情

数据类型	支持情况	适用场景说明
BOOL	✅ 支持	二值化字段（如 “是否公开”“是否付费”）
INT8/16/32/64	✅ 支持	低基数整数（如 “用户等级”“订单状态码”）
VARCHAR	✅ 支持	低基数字符串（如 “用户类型”“商品分类”）
ARRAY	✅ 支持	数组元素需为上述支持类型，且整体基数低（如 “用户标签数组” 仅包含 3-5 种标签）
FLOAT/DOUBLE	❌ 不支持	浮点型通常为高基数（如 “价格”“评分”），位图索引无法发挥优势
JSON	❌ 不支持	结构复杂，难以通过位图有效表示值分布

1.4.2 基数选择指南

最优场景：字段基数＜500，此时索引存储开销小，查询速度最快。
临界值：基数超过 1000 后，位图索引的空间效率和查询性能会明显下降。
禁忌场景：高基数字段（如 “手机号”“身份证号”）禁止使用，会导致索引体积暴增，反而降低查询效率。

二、倒排索引（INVERTED）

2.1 核心概念与适用场景

倒排索引是 Milvus 中最通用的标量索引，通过 “值到记录 ID 的映射” 实现高效查询，核心解决 “快速定位包含特定值的所有记录” 问题。

适用场景全覆盖：

特定值精确过滤（如 “商品类别 = 电子产品”）；
文本字段搜索（如 VARCHAR 类型的关键词匹配）；
JSON 字段查询（如 “JSON 中的品牌 = brand_5”）；
数值范围查询（如 “价格≥100 且≤200”）；
数组包含查询（如 “标签包含‘sale’”）。

性能优势：相比全集合扫描（几秒级），倒排索引查询可达到毫秒级响应，尤其适合大型数据集。

2.2 工作原理深度解析

倒排索引的核心结构是 “倒排表”，即存储每个字段值对应的所有记录 ID 列表：

构建阶段：遍历集合中所有记录，对每个标量字段的取值，收集包含该值的记录 ID，形成键 - 值（值 - ID 列表）映射。
查询阶段：

单条件查询：直接通过字段值找到对应的 ID 列表，返回关联记录；
多条件查询：对多个条件的 ID 列表进行交集（AND）、并集（OR）运算，快速筛选目标记录。

示例：商品分类字段的倒排表

字段值	对应的记录 ID 列表
electronics	[1,3,7,9]
books	[2,5,8,11]
clothing	[4,6,10,12]

查询 “electronics 类且价格＞500” 时，先获取 electronics 对应的 ID 列表 [1,3,7,9]，再获取价格＞500 的 ID 列表，取交集后返回结果。

2.3 完整示例代码逐行解释

import random
import time
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema# 1. 连接Milvus服务 - 基础通信配置
client = MilvusClient(uri="http://localhost:19530")# 2. 定义集合结构 - 包含多种适合倒排索引的字段类型
collection_name = "products_inverted"
fields = [FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),  # 主键FieldSchema(name="product_id", dtype=DataType.INT64),  # 商品IDFieldSchema(name="category", dtype=DataType.VARCHAR, max_length=50),  # 商品分类（字符串）FieldSchema(name="price", dtype=DataType.DOUBLE),  # 商品价格（数值型，支持范围查询）# 标签数组（VARCHAR数组，支持包含查询）FieldSchema(name="tags", dtype=DataType.ARRAY, element_type=DataType.VARCHAR, max_capacity=10),FieldSchema(name="metadata", dtype=DataType.JSON),  # JSON字段（支持路径查询）FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)  # 向量字段
]# 3. 创建集合 - 初始化数据存储结构
schema = CollectionSchema(fields=fields, description="Products with inverted index")
client.create_collection(collection_name=collection_name, schema=schema)
print("✅ 集合创建成功")# 4. 插入测试数据 - 模拟真实商品数据特征
categories = ["electronics", "books", "clothing", "home", "sports"]  # 商品分类列表
tags_pool = ["new", "sale", "popular", "featured", "limited"]  # 标签池
data = []
for i in range(1500):  # 生成1500条商品数据data.append({"product_id": 5000 + i,  # 商品ID从5000开始"category": random.choice(categories),  # 随机选择分类"price": round(random.uniform(10.0, 1000.0), 2),  # 价格10-1000元，保留2位小数"tags": random.sample(tags_pool, random.randint(1, 3)),  # 随机选择1-3个标签"metadata": {  # JSON格式元数据"brand": f"brand_{random.randint(1, 20)}",  # 品牌（1-20个可选）"rating": round(random.uniform(3.0, 5.0), 1),  # 评分3.0-5.0"in_stock": random.choice([True, False])  # 是否有货},"vector": [random.random() for _ in range(128)]  # 128维向量})# 5. 分批插入数据 - 优化插入效率，避免超时
batch_size = 300
for i in range(0, len(data), batch_size):batch = data[i:i + batch_size]client.insert(collection_name=collection_name, data=batch)
print("✅ 测试数据插入完成")# 6. 创建倒排索引 - 为不同类型字段配置索引
index_params = client.prepare_index_params()  # 初始化索引参数# 6.1 为category字段创建索引（字符串精确查询）
index_params.add_index(field_name="category",index_type="INVERTED",index_name="inverted_category"
)# 6.2 为price字段创建索引（数值范围查询）
index_params.add_index(field_name="price",index_type="INVERTED",index_name="inverted_price"
)# 6.3 为tags数组字段创建索引（数组包含查询）
index_params.add_index(field_name="tags",index_type="INVERTED",index_name="inverted_tags"
)# 6.4 为JSON字段的特定路径创建索引（JSON路径查询）
index_params.add_index(field_name="metadata",index_type="INVERTED",index_name="inverted_metadata_brand",params={"json_path": "metadata[\"brand\"]",  # 指定JSON中的键路径（品牌字段）"json_cast_type": "varchar"  # 转换为varchar类型进行索引}
)# 执行索引创建 - Milvus后台异步构建索引（大型数据集可能需要几秒）
client.create_index(collection_name=collection_name, index_params=index_params)
print("✅ 倒排索引创建完成")# 7. 多样化查询示例 - 覆盖核心使用场景
print("\n🔍 倒排索引查询示例:")# 7.1 精确匹配查询（商品分类=electronics）
print("1. 精确类别查询:")
results = client.query(collection_name=collection_name,filter="category == 'electronics'",  # 精确匹配条件output_fields=["product_id", "category", "price"],limit=5
)
for result in results:print(f" 产品ID: {result['product_id']}, 类别: {result['category']}, 价格: {result['price']}")# 7.2 数值范围查询（价格100-200元）
print("\n2. 价格范围查询:")
results = client.query(collection_name=collection_name,filter="price >= 100.0 and price <= 200.0",  # 范围条件output_fields=["product_id", "category", "price"],limit=5
)
for result in results:print(f" 产品ID: {result['product_id']}, 类别: {result['category']}, 价格: {result['price']}")# 7.3 数组包含查询（标签包含sale）
print("\n3. 标签包含查询:")
results = client.query(collection_name=collection_name,filter="array_contains(tags, 'sale')",  # 数组包含函数output_fields=["product_id", "tags", "price"],limit=5
)
for result in results:print(f" 产品ID: {result['product_id']}, 标签: {result['tags']}, 价格: {result['price']}")# 7.4 JSON字段查询（品牌=brand_5）
print("\n4. JSON字段品牌查询:")
results = client.query(collection_name=collection_name,filter="metadata['brand'] == 'brand_5'",  # JSON路径访问output_fields=["product_id", "metadata"],limit=5
)
for result in results:print(f" 产品ID: {result['product_id']}, 品牌: {result['metadata']['brand']}")# 7.5 复杂组合查询（分类=electronics + 价格>500 + 标签包含new）
print("\n5. 复杂组合查询:")
results = client.query(collection_name=collection_name,filter="category == 'electronics' and price > 500.0 and array_contains(tags, 'new')",output_fields=["product_id", "category", "price", "tags"],limit=5
)
for result in results:print(f" 产品ID: {result['product_id']}, 类别: {result['category']}, 价格: {result['price']}, 标签: {result['tags']}")# 8. 性能对比测试 - 验证索引效果
print("\n⏱️ 性能对比测试:")# 8.1 无索引查询（查询未索引的JSON字段rating>4.5）
start_time = time.time()
results = client.query(collection_name=collection_name,filter="metadata['rating'] > 4.5",  # metadata.rating未创建索引output_fields=["product_id"],limit=10
)
no_index_time = time.time() - start_time# 8.2 有索引查询（查询已索引的category字段）
start_time = time.time()
results = client.query(collection_name=collection_name,filter="category == 'electronics'",  # category已创建倒排索引output_fields=["product_id"],limit=10
)
with_index_time = time.time() - start_time# 输出性能对比结果
print(f" 无索引查询时间: {no_index_time:.4f}秒")
print(f" 有索引查询时间: {with_index_time:.4f}秒")
print(f" 性能提升: {no_index_time/with_index_time:.1f}倍")# 9. 资源清理
client.drop_index(collection_name=collection_name, index_name="inverted_category")
client.drop_index(collection_name=collection_name, index_name="inverted_price")
client.drop_index(collection_name=collection_name, index_name="inverted_tags")
client.drop_index(collection_name=collection_name, index_name="inverted_metadata_brand")
client.drop_collection(collection_name=collection_name)
print("\n🧹 资源清理完成")

2.4 最佳实践与注意事项

2.4.1 索引创建时机选择

时机	推荐程度	适用场景	优势
加载数据后创建	✅ 强烈推荐	批量导入历史数据场景	索引构建效率更高，避免边插入边构建导致的性能损耗
空集合创建	⚠️ 仅临时场景使用	实时增量插入少量数据	无需等待数据导入完成，索引随数据插入逐步构建

2.4.2 命名规范与维护建议

索引命名规则：idx_字段名_索引类型（如idx_category_inverted），便于后续识别和管理。
多字段索引：建议为常用查询字段单独创建索引，而非创建复合索引（Milvus 倒排索引支持多字段条件自动组合）。
高基数字段：倒排索引是高基数字段的最优选择（如 “商品 ID”“用户手机号”），无需担心基数过高的问题。

三、NGRAM索引（Milvus v2.6.2+）

3.1 核心概念与适用场景

NGRAM 索引是专为模糊文本查询设计的索引技术，通过将文本分解为固定长度的子串（N-gram），支持前缀、后缀、包含、通配符等多种模糊匹配。

核心解决问题：传统倒排索引仅支持精确匹配，无法高效处理 “LIKE” 模糊查询，NGRAM 索引填补了这一空白。

适用查询类型全覆盖：

查询类型	示例	实际应用场景
前缀匹配	title LIKE "vector%"	搜索以 “vector” 开头的文档标题
后缀匹配	name LIKE "%learning"	搜索以 “learning” 结尾的关键词
包含匹配	content LIKE "%neural%"	搜索内容包含 “neural” 的文章
通配符匹配	text LIKE "%ar%int%"	搜索包含 “ar” 后接 “int” 的文本（如 “artificial intelligence”）

3.2 工作原理深度解析

3.2.1 N-gram 分解核心逻辑

N-gram 指将文本按连续的 N 个字符拆分，生成子串集合（包含空格等特殊字符）：

配置参数：min_gram（最小子串长度）和max_gram（最大子串长度），控制分解粒度。
示例：输入文本 “AI database”，min_gram=2，max_gram=3：
- 2-grams（二元子串）：“AI”“I ”“ d”“da”“at”“ta”“ab”“ba”“as”“se”
- 3-grams（三元子串）：“AI”“I d”“ da”“dat”“ata”“tab”“aba”“bas”“ase”

3.2.2 模糊查询匹配流程

索引构建阶段：将 VARCHAR 字段或 JSON 文本字段按 N-gram 规则分解，构建 “子串 - 记录 ID” 的映射。
查询阶段：
- 将查询条件中的模糊字符串（如 “% vector%”）同样分解为 N-gram 子串；
- 查找包含这些子串的记录 ID 列表，通过交集运算筛选出匹配的记录；
- 对结果进行相关性排序，返回最匹配的记录。

3.3 完整示例代码逐行解释

import random
import string
import time
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema# 1. 连接Milvus服务
client = MilvusClient(uri="http://localhost:19530")# 2. 定义集合结构 - 包含文本字段和JSON字段
collection_name = "documents_ngram"
fields = [FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),  # 主键FieldSchema(name="doc_id", dtype=DataType.INT64),  # 文档IDFieldSchema(name="title", dtype=DataType.VARCHAR, max_length=200),  # 文档标题（短文本）FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=1000),  # 文档内容（长文本）FieldSchema(name="metadata", dtype=DataType.JSON),  # JSON元数据（包含作者等文本信息）FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)  # 向量字段（混合搜索）
]# 3. 创建集合
schema = CollectionSchema(fields=fields, description="Documents with NGRAM index")
client.create_collection(collection_name=collection_name, schema=schema)
print("✅ 集合创建成功")# 4. 生成测试文本数据 - 模拟技术文档场景
def generate_sample_text():"""生成包含技术关键词的文档标题"""topics = ["machine learning", "vector database", "artificial intelligence","neural network", "deep learning", "data science"]  # 核心主题prefixes = ["Introduction to", "Advanced", "Basic concepts of", "Research on"]  # 前缀suffixes = ["techniques", "applications", "fundamentals", "algorithms"]  # 后缀topic = random.choice(topics)prefix = random.choice(prefixes)suffix = random.choice(suffixes)return f"{prefix} {topic} {suffix}"  # 组合生成标题（如"Introduction to vector database techniques"）def generate_content(keyword):"""生成包含关键词的文档内容"""contents = [f"This document discusses various aspects of {keyword} and its practical implementations.",f"A comprehensive guide to understanding {keyword} principles and methodologies.",f"Exploring the latest developments in {keyword} technology and future trends.",f"Fundamental concepts and advanced topics in {keyword} for professionals.",f"Case studies and real-world applications of {keyword} in modern systems."]return random.choice(contents)# 生成1000条文档数据
data = []
for i in range(1000):title = generate_sample_text()# 提取标题中的核心关键词（第三个单词）keyword = title.split()[2] if len(title.split()) > 2 else "database"content = generate_content(keyword)data.append({"doc_id": 10000 + i,"title": title,"content": content,"metadata": {"author": f"author_{random.randint(1, 50)}",  # 作者名（author_1到author_50）"language": random.choice(["en", "zh", "es", "fr"]),  # 语言类型"keywords": [keyword, "technology", "research"]  # 关键词数组},"vector": [random.random() for _ in range(128)]})# 5. 分批插入数据
batch_size = 200
for i in range(0, len(data), batch_size):batch = data[i:i + batch_size]client.insert(collection_name=collection_name, data=batch)
print("✅ 测试数据插入完成")# 6. 创建NGRAM索引 - 针对不同文本字段配置不同参数
index_params = client.prepare_index_params()# 6.1 为title字段创建索引（短文本，最小2字符，最大4字符）
index_params.add_index(field_name="title",index_type="NGRAM",index_name="ngram_title",min_gram=2,  # 短文本适合较小的min_gram，提高匹配精度max_gram=4   # 覆盖2-4字符的子串，兼顾前缀/后缀匹配
)# 6.2 为content字段创建索引（长文本，最小3字符，最大5字符）
index_params.add_index(field_name="content",index_type="NGRAM",index_name="ngram_content",min_gram=3,  # 长文本用较大的min_gram，减少子串数量，降低索引体积max_gram=5   # 覆盖3-5字符的子串，提高长文本匹配效率
)# 6.3 为JSON字段的author路径创建索引（作者名模糊查询）
index_params.add_index(field_name="metadata",index_type="NGRAM",index_name="ngram_metadata_author",min_gram=2,max_gram=4,params={"json_path": "metadata[\"author\"]",  # 指定JSON中的作者字段路径"json_cast_type": "varchar"  # 转换为varchar类型处理}
)# 执行索引创建
client.create_index(collection_name=collection_name, index_params=index_params)
print("✅ NGRAM索引创建完成")# 7. 模糊查询示例 - 覆盖所有核心场景
print("\n🔍 NGRAM索引LIKE查询示例:")# 7.1 前缀匹配（标题以"vector"开头）
print("1. 前缀匹配 - 查找以'vector'开头的标题:")
results = client.query(collection_name=collection_name,filter="title LIKE 'vector%'",  # 前缀匹配语法（%在末尾）output_fields=["doc_id", "title"],limit=5
)
for result in results:print(f" 文档ID: {result['doc_id']}, 标题: {result['title']}")# 7.2 后缀匹配（标题以"learning"结尾）
print("\n2. 后缀匹配 - 查找以'learning'结尾的标题:")
results = client.query(collection_name=collection_name,filter="title LIKE '%learning'",  # 后缀匹配语法（%在开头）output_fields=["doc_id", "title"],limit=5
)
for result in results:print(f" 文档ID: {result['doc_id']}, 标题: {result['title']}")# 7.3 包含匹配（标题包含"neural"）
print("\n3. 前后缀匹配 - 查找包含'neural'的标题:")
results = client.query(collection_name=collection_name,filter="title LIKE '%neural%'",  # 包含匹配语法（%在两端）output_fields=["doc_id", "title"],limit=5
)
for result in results:print(f" 文档ID: {result['doc_id']}, 标题: {result['title']}")# 7.4 内容字段匹配（内容包含"database"）
print("\n4. 内容匹配 - 查找包含'database'的内容:")
results = client.query(collection_name=collection_name,filter="content LIKE '%database%'",output_fields=["doc_id", "content"],limit=3
)
for result in results:# 截取前80字符预览content_preview = result['content'][:80] + "..." if len(result['content']) > 80 else result['content']print(f" 文档ID: {result['doc_id']}, 内容预览: {content_preview}")# 7.5 复杂通配符匹配（内容包含'ar'后接'int'）
print("\n5. 复杂模式匹配 - 查找包含'ar'后面跟着'int'的内容:")
results = client.query(collection_name=collection_name,filter="content LIKE '%ar%int%'",  # 多通配符组合output_fields=["doc_id", "content"],limit=3
)
for result in results:content_preview = result['content'][:80] + "..." if len(result['content']) > 80 else result['content']print(f" 文档ID: {result['doc_id']}, 内容预览: {content_preview}")# 7.6 JSON字段模糊查询（作者名包含'thor_'）
print("\n6. JSON字段匹配 - 查找作者名包含'thor_'的文档:")
results = client.query(collection_name=collection_name,filter="metadata['author'] LIKE '%thor_%'",  # JSON字段模糊匹配output_fields=["doc_id", "metadata"],limit=5
)
for result in results:print(f" 文档ID: {result['doc_id']}, 作者: {result['metadata']['author']}")# 7.7 向量搜索+NGRAM过滤（在包含"machine"的文档中搜索相似向量）
print("\n🎯 向量搜索 + NGRAM过滤:")
query_vector = [[random.random() for _ in range(128)]]
search_results = client.search(collection_name=collection_name,data=query_vector,filter="title LIKE '%machine%'",  # NGRAM过滤包含"machine"的文档limit=3,output_fields=["doc_id", "title"]
)
print("🔍 在包含'machine'的文档中搜索相似向量:")
for hit in search_results[0]:print(f" 文档ID: {hit['entity']['doc_id']}, 标题: {hit['entity']['title']}, 相似度: {hit['distance']:.4f}")# 8. 性能对比测试
print("\n⏱️ NGRAM索引性能对比:")# 8.1 无NGRAM索引查询（未索引的JSON语言字段模糊查询）
start_time = time.time()
results = client.query(collection_name=collection_name,filter="metadata['language'] LIKE '%en%'",  # metadata.language未创建NGRAM索引output_fields=["doc_id"],limit=10
)
no_ngram_time = time.time() - start_time# 8.2 有NGRAM索引查询（已索引的title字段模糊查询）
start_time = time.time()
results = client.query(collection_name=collection_name,filter="title LIKE '%learning%'",  # title已创建NGRAM索引output_fields=["doc_id"],limit=10
)
with_ngram_time = time.time() - start_time# 输出性能对比
print(f" 无NGRAM索引查询时间: {no_ngram_time:.4f}秒")
print(f" 有NGRAM索引查询时间: {with_ngram_time:.4f}秒")
print(f" 性能提升: {no_ngram_time/with_ngram_time:.1f}倍")# 9. 资源清理
client.drop_index(collection_name=collection_name, index_name="ngram_title")
client.drop_index(collection_name=collection_name, index_name="ngram_content")
client.drop_index(collection_name=collection_name, index_name="ngram_metadata_author")
client.drop_collection(collection_name=collection_name)
print("\n🧹 资源清理完成")

3.4 关键配置与使用注意事项

3.4.1 字符处理特性（避免踩坑）

特性	说明	实际使用影响
语言无关	基于字符分解，不依赖语言词典	支持中英文、西班牙语等所有语言，但中文需注意分词问题（如 “机器学习” 会分解为 “机学”“学习” 等子串）
空格保留	空格作为普通字符参与分解	搜索 “vector database” 时，“vector ”（带空格）和 “ database”（带空格）会作为独立子串，匹配更精准
大小写敏感	保留原始大小写，查询需精确匹配	若索引字段包含 “Vector”，查询 “vector%” 无法匹配，需统一大小写（建议入库时标准化为小写）

3.4.2 min_gram 与 max_gram 配置指南

场景	推荐配置	配置理由
通用文本查询	min_gram=2，max_gram=3	平衡匹配精度和索引体积，适配大部分中英文文本
短文本优化（如关键词、标题）	min_gram=1，max_gram=2	短文本子串数量少，小粒度分解可提高匹配覆盖率
长文本处理（如文档内容、描述）	min_gram=3，max_gram=5	减少子串总数，降低索引存储开销，同时避免过度匹配

四、RTREE 索引（Milvus 2.6.4+）

4.1 核心概念与适用场景

RTREE 索引是专为几何空间数据设计的索引技术，支持 Well-Known Text（WKT）格式的空间数据查询，核心解决 “空间关系判断” 问题。

空间数据类型支持：

几何类型	WKT 格式示例	实际应用场景
点（POINT）	POINT (30 10)	地理位置坐标（如商铺经纬度、用户位置）
线（LINESTRING）	LINESTRING (30 10, 10 30, 40 40)	道路、河流、路线轨迹
多边形（POLYGON）	POLYGON ((0 0, 10 0, 10 10, 0 10, 0 0))	行政区域、商圈范围、地理围栏

核心空间查询能力：包含、相交、距离范围、空间包含等，支持与属性条件组合查询。

4.2 工作原理深度解析

4.2.1 核心思想：最小边界矩形（MBR）

RTREE 索引通过为每个几何对象构建最小边界矩形（能完全包围该对象的轴对齐矩形），将空间查询转化为矩形间的关系判断，大幅降低计算复杂度：

点的 MBR：自身坐标点（退化的矩形）；
线的 MBR：包含所有线段端点的最小矩形；
多边形的 MBR：包含整个多边形的最小矩形。

4.2.2 查询流程

索引构建阶段：递归地将空间中的 MBR 分组，构建树形索引结构（根节点包含所有子节点的 MBR，叶子节点对应原始几何对象的 MBR）。
查询阶段：

输入查询几何对象（如多边形、线段），构建其 MBR；
遍历 RTREE，快速筛选出与查询 MBR 有重叠的候选节点；
对候选节点对应的原始几何对象，进行精确的空间关系判断（如 ST_CONTAINS、ST_INTERSECTS）；
返回符合条件的结果。

4.3 完整示例代码逐行解释

import random
import math
import time
from pymilvus import MilvusClient, DataType, CollectionSchema, FieldSchema# 1. 连接Milvus服务
client = MilvusClient(uri="http://localhost:19530")# 2. 定义集合结构 - 包含几何字段（GEOMETRY类型）
collection_name = "spatial_data_rtree"
fields = [FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),  # 主键FieldSchema(name="location_id", dtype=DataType.INT64),  # 位置IDFieldSchema(name="geo", dtype=DataType.GEOMETRY),  # 几何字段（存储WKT格式数据）FieldSchema(name="name", dtype=DataType.VARCHAR, max_length=100),  # 位置名称FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=50),  # 位置类别FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)  # 向量字段（混合搜索）
]# 3. 创建集合
schema = CollectionSchema(fields=fields, description="Spatial data with RTREE index")
client.create_collection(collection_name=collection_name, schema=schema)
print("✅ 集合创建成功")# 4. 生成测试空间数据 - 包含点、线、多边形三种类型
def generate_point():"""生成随机点的WKT格式字符串（经纬度范围：经度-180~180，纬度-90~90）"""x = round(random.uniform(-180, 180), 6)  # 经度y = round(random.uniform(-90, 90), 6)   # 纬度return f"POINT ({x} {y})"def generate_line_string():"""生成随机线段的WKT格式字符串（2个端点）"""points = []for _ in range(2):  # 生成2个端点x = round(random.uniform(-180, 180), 6)y = round(random.uniform(-90, 90), 6)points.append(f"{x} {y}")return f"LINESTRING ({', '.join(points)})"def generate_polygon():"""生成随机矩形多边形的WKT格式字符串（闭合矩形）"""center_x = random.uniform(-170, 170)  # 矩形中心经度center_y = random.uniform(-80, 80)    # 矩形中心纬度width = random.uniform(1, 10)         # 矩形宽度（经度方向）height = random.uniform(1, 10)        # 矩形高度（纬度方向）# 生成矩形四个顶点（顺时针顺序），最后回到起点闭合coords = [f"{center_x - width/2} {center_y - height/2}",  # 左下f"{center_x + width/2} {center_y - height/2}",  # 右下f"{center_x + width/2} {center_y + height/2}",  # 右上f"{center_x - width/2} {center_y + height/2}",  # 左上f"{center_x - width/2} {center_y - height/2}"   # 闭合点]return f"POLYGON (({', '.join(coords)}))"# 定义位置类别和名称前缀
categories = ["landmark", "building", "park", "water", "road", "forest"]
location_names = ["Central", "North", "South", "East", "West", "Main", "City"]
name_suffixes = ["St", "Ave", "Blvd", "Park"]# 生成800条空间数据
data = []
for i in range(800):# 随机选择几何类型（点、线、多边形）geo_type = random.choice(["point", "line", "polygon"])if geo_type == "point":geo_wkt = generate_point()elif geo_type == "line":geo_wkt = generate_line_string()else:geo_wkt = generate_polygon()# 构建数据记录data.append({"location_id": 20000 + i,"geo": geo_wkt,"name": f"{random.choice(location_names)} {random.choice(name_suffixes)}",  # 位置名称（如"Central Park"）"category": random.choice(categories),"vector": [random.random() for _ in range(128)]})# 5. 分批插入数据
batch_size = 200
for i in range(0, len(data), batch_size):batch = data[i:i + batch_size]client.insert(collection_name=collection_name, data=batch)
print("✅ 测试数据插入完成")# 6. 创建RTREE索引 - 专为几何字段设计
index_params = client.prepare_index_params()
index_params.add_index(field_name="geo",  # 关联几何字段index_type="RTREE",  # 索引类型为RTREEindex_name="rtree_geo",  # 索引名称params={}  # RTREE无需额外参数，默认配置即可
)# 执行索引创建
client.create_index(collection_name=collection_name, index_params=index_params)
print("✅ RTREE索引创建完成")# 7. 空间查询示例 - 覆盖核心空间关系
print("\n🔍 RTREE索引空间查询示例:")# 7.1 包含查询：查找指定多边形内的所有几何对象
print("1. 点包含查询 - 查找在指定多边形内的几何对象:")
# 定义查询多边形（经纬度范围：-10~10）
test_polygon = "POLYGON ((-10 -10, 10 -10, 10 10, -10 10, -10 -10))"
results = client.query(collection_name=collection_name,filter=f"ST_CONTAINS('{test_polygon}', geo)",  # ST_CONTAINS(包含者, 被包含者)output_fields=["location_id", "name", "category"],limit=5
)
for result in results:print(f" 位置ID: {result['location_id']}, 名称: {result['name']}, 类别: {result['category']}")# 7.2 相交查询：查找与指定线段相交的几何对象
print("\n2. 相交查询 - 查找与指定线段相交的几何对象:")
# 定义查询线段（从(-50,-50)到(50,50)）
test_line = "LINESTRING (-50 -50, 50 50)"
results = client.query(collection_name=collection_name,filter=f"ST_INTERSECTS(geo, '{test_line}')",  # ST_INTERSECTS(对象1, 对象2)：判断是否相交output_fields=["location_id", "name", "category"],limit=5
)
for result in results:print(f" 位置ID: {result['location_id']}, 名称: {result['name']}, 类别: {result['category']}")# 7.3 距离查询：查找距离指定点100单位内的几何对象
print("\n3. 距离查询 - 查找距离指定点100单位内的几何对象:")
# 定义查询点（原点(0,0)）
test_point = "POINT (0 0)"
results = client.query(collection_name=collection_name,filter=f"ST_DWITHIN(geo, '{test_point}', 100)",  # ST_DWITHIN(对象1, 对象2, 距离)：判断距离是否在范围内output_fields=["location_id", "name", "category"],limit=5
)
for result in results:print(f" 位置ID: {result['location_id']}, 名称: {result['name']}, 类别: {result['category']}")# 7.4 几何包含查询：查找包含指定点的几何对象（如包含点的多边形）
print("\n4. 几何包含查询 - 查找包含指定点的几何对象:")
small_point = "POINT (5 5)"
results = client.query(collection_name=collection_name,filter=f"ST_CONTAINS(geo, '{small_point}')",  # 几何对象包含点output_fields=["location_id", "name", "category"],limit=5
)
for result in results:print(f" 位置ID: {result['location_id']}, 名称: {result['name']}, 类别: {result['category']}")# 7.5 组合查询：空间条件+属性条件（公园类别且包含指定点）
print("\n5. 组合查询 - 在公园类别中查找包含指定点的几何对象:")
results = client.query(collection_name=collection_name,filter=f"category == 'park' and ST_CONTAINS(geo, '{small_point}')",  # 空间+属性组合output_fields=["location_id", "name", "category"],limit=5
)
for result in results:print(f" 位置ID: {result['location_id']}, 名称: {result['name']}, 类别: {result['category']}")# 7.6 向量搜索+空间过滤：在指定地理区域内搜索相似向量
print("\n🎯 向量搜索 + 空间过滤:")
query_vector = [[random.random() for _ in range(128)]]
# 定义搜索区域（多边形：-20~20）
search_area = "POLYGON ((-20 -20, 20 -20, 20 20, -20 20, -20 -20))"
search_results = client.search(collection_name=collection_name,data=query_vector,filter=f"ST_CONTAINS('{search_area}', geo)",  # 仅在指定区域内搜索limit=3,output_fields=["location_id", "name", "category"]
)
print(f"🔍 在区域 {search_area} 内搜索相似向量:")
for hit in search_results[0]:print(f" 位置ID: {hit['entity']['location_id']}, 名称: {hit['entity']['name']}, 相似度: {hit['distance']:.4f}")# 8. 性能对比测试
print("\n⏱️ RTREE索引性能对比:")# 8.1 RTREE简单查询（包含查询）
start_time = time.time()
results = client.query(collection_name=collection_name,filter=f"ST_CONTAINS('{test_polygon}', geo)",output_fields=["location_id"],limit=10
)
with_rtree_time = time.time() - start_time# 8.2 RTREE复杂查询（相交+属性条件）
start_time = time.time()
results = client.query(collection_name=collection_name,filter=f"ST_INTERSECTS(geo, '{test_line}') and category == 'road'",output_fields=["location_id"],limit=10
)
complex_rtree_time = time.time() - start_time# 输出性能结果（无索引时为全表扫描，性能极差，此处不演示）
print(f" RTREE简单查询时间: {with_rtree_time:.4f}秒")
print(f" RTREE复杂查询时间: {complex_rtree_time:.4f}秒")
print(" 提示: 没有RTREE索引的空间查询会执行全表扫描，性能极差")# 9. 空间分析示例：统计指定区域内各类别的数量
print("\n📊 空间分析示例:")
categories_stats = {}
for category in categories:results = client.query(collection_name=collection_name,filter=f"category == '{category}' and ST_CONTAINS('{test_polygon}', geo)",output_fields=["location_id"],limit=1000  # 最多返回1000条数据)categories_stats[category] = len(results)# 输出统计结果
print(" 在测试多边形内各类别的对象数量:")
for category, count in categories_stats.items():print(f" {category}: {count}个")# 10. 资源清理
client.drop_index(collection_name=collection_name, index_name="rtree_geo")
client.drop_collection(collection_name=collection_name)
print("\n🧹 资源清理完成")

4.4 关键使用注意事项

4.4.1 几何数据格式要求

必须使用 WKT（Well-Known Text）格式存储几何数据，不支持其他格式（如 GeoJSON）。
多边形数据必须闭合：最后一个顶点需与第一个顶点相同（如POLYGON ((0 0, 10 0, 10 10, 0 10, 0 0))）。
坐标顺序：WKT 格式默认 “经度纬度”（x y），与 GPS 坐标一致，避免颠倒。

4.4.2 空间函数使用规范

Milvus 支持的核心空间函数（需配合 RTREE 索引使用）：

ST_CONTAINS(A, B)：判断 A 是否包含 B（A 为包含者，B 为被包含者）；
ST_INTERSECTS(A, B)：判断 A 和 B 是否相交；
ST_DWITHIN(A, B, distance)：判断 A 和 B 的距离是否小于等于指定值；
ST_EQUALS(A, B)：判断 A 和 B 是否完全相等。

五、索引管理与选择指南

5.1 索引删除操作详解

5.1.1 通用删除代码与说明

# 通用删除索引方法
client.drop_index(collection_name="my_collection",  # 集合名称（必须已存在）index_name="index_name"  # 要删除的索引名称（需准确匹配创建时的名称）
)

5.1.2 版本兼容性注意事项

Milvus 版本

删除要求

操作步骤

v2.6.3 及更早

需先释放集合

1. 执行

client.release_collection(collection_name="my_collection")

；2. 执行删除索引；3. 如需继续使用，执行

client.load_collection(collection_name="my_collection")

v2.6.4 及更新

直接删除

无需释放集合，直接执行drop_index即可

5.2 索引选择决策矩阵

5.2.1 场景 - 索引匹配指南

场景特征	推荐索引	替代方案	不推荐索引	决策依据
低基数字段（基数＜500）：布尔 / 枚举 / 少量分类	BITMAP	INVERTED	NGRAM/RTREE	位图索引空间效率最高，位运算查询速度极快
高基数字段：唯一 ID / 手机号 / 大量分类	INVERTED	-	BITMAP	倒排索引无基数限制，支持精确匹配和范围查询
模糊文本查询：前缀 / 后缀 / 包含 / 通配符	NGRAM	-	BITMAP/INVERTED	仅 NGRAM 索引支持 LIKE 模糊查询，其他索引无法高效处理
空间数据查询：点 / 线 / 多边形关系判断	RTREE	-	所有非空间索引	唯一支持空间函数和几何关系判断的索引
JSON 字段查询：精确匹配（如品牌 = brand_5）	INVERTED	-	BITMAP	倒排索引支持 JSON 路径精确查询，性能最优
JSON 字段查询：模糊匹配（如作者名包含 'thor'）	NGRAM	-	INVERTED	需通过 NGRAM 索引实现 JSON 文本的模糊匹配
数值范围查询：价格 / 评分 / 时间范围	INVERTED	-	BITMAP（高基数时）	倒排索引对数值范围查询优化充分，支持任意基数
数组包含查询：标签包含 /size 包含	INVERTED	-	BITMAP（数组元素基数高时）	倒排索引原生支持数组包含查询，无需额外配置

5.2.2 数据类型 - 索引兼容性矩阵

数据类型	BITMAP	INVERTED	NGRAM	RTREE	备注
BOOL	✅	✅	❌	❌	BITMAP 性能更优
INT8/16/32/64	✅（低基数）	✅（任意基数）	❌	❌	低基数选 BITMAP，高基数选 INVERTED
VARCHAR	✅（低基数）	✅（精确匹配）	✅（模糊匹配）	❌	模糊查询必须选 NGRAM
ARRAY（元素为支持类型）	✅（低基数元素）	✅（任意基数元素）	❌	❌	数组包含查询选 INVERTED
FLOAT/DOUBLE	❌	✅	❌	❌	仅 INVERTED 支持数值范围查询
JSON	❌	✅（精确路径查询）	✅（模糊路径查询）	❌	需指定 json_path 和 json_cast_type
GEOMETRY	❌	❌	❌	✅	仅 RTREE 支持