当前位置：首页 > news >正文

Milvus：数据字段-主字段和自动识别（五）

news 2025/11/4 9:39:00

一、主字段概述

1.1 什么是主字段？

主字段是 Milvus Collection 中每个实体的唯一标识符，类似于传统数据库中的主键。它在数据操作过程中起到关键作用：

唯一标识：确保每个实体都有唯一的标识符
数据管理：支持插入、更新、查询、删除等操作
避免歧义：保证数据操作的准确性

1.2 主字段要求

每个 Collection 必须有一个主字段
主字段值 不能为空
数据类型在创建时指定，不可更改

二、支持的数据类型

注意：VARCHAR 类型需要定义 max_length 属性来限制最大字节数。

三、自动 ID vs 手动 ID

3.1 模式对比

模式	描述	推荐场景
自动 ID	Milvus 自动生成唯一标识符	不需要手动管理 ID 的大多数情况
手动 ID	用户自己提供唯一 ID	ID 需要与外部系统或现有数据集保持一致时

3.2 选择建议

如果不确定选择哪种模式，建议从自动 ID 开始，这样可以：

简化数据输入流程
保证 ID 的唯一性
减少手动管理的工作量

四、自动 ID 实战指南

4.1 创建启用 AutoID 的 Collection

from pymilvus import MilvusClient, DataType# 连接到 Milvus 服务
client = MilvusClient(uri="http://localhost:19530")# 创建 schema
schema = client.create_schema()# 定义主字段并启用 AutoID
schema.add_field(field_name="id",           # 主字段名称is_primary=True,          # 设置为主字段auto_id=True,             # 启用自动 ID 生成datatype=DataType.INT64   # 数据类型为 INT64
)# 定义其他字段
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=4  # 向量维度
)schema.add_field(field_name="category", datatype=DataType.VARCHAR, max_length=1000  # VARCHAR 类型必须指定最大长度
)# 清理并创建 Collection
if client.has_collection("demo_autoid"):client.drop_collection("demo_autoid")client.create_collection(collection_name="demo_autoid", schema=schema
)print("已成功创建启用 AutoID 的 Collection")

4.2 插入数据（自动 ID）

# 准备数据 - 注意：不要包含主字段列！
data = [{"embedding": [0.1, 0.2, 0.3, 0.4], "category": "book"},{"embedding": [0.2, 0.3, 0.4, 0.5], "category": "toy"},{"embedding": [0.3, 0.4, 0.5, 0.6], "category": "electronic"}
]# 插入数据
res = client.insert(collection_name="demo_autoid", data=data)# 查看自动生成的 ID
print("自动生成的 ID:", res.get("ids"))
# 输出示例: [461526052788333649, 461526052788333650, 461526052788333651]# 验证数据插入
print(f"成功插入 {len(res.get('ids'))} 条记录")

4.3 使用 Upsert 避免重复

# 使用 upsert 更新或插入数据
upsert_data = [{"embedding": [0.4, 0.5, 0.6, 0.7], "category": "clothing"}
]upsert_res = client.upsert(collection_name="demo_autoid", data=upsert_data)
print("Upsert 操作生成的 ID:", upsert_res.get("ids"))

五、手动 ID 实战指南

5.1 创建禁用 AutoID 的 Collection

from pymilvus import MilvusClient, DataTypeclient = MilvusClient(uri="http://localhost:19530")schema = client.create_schema()# 定义主字段并禁用 AutoID
schema.add_field(field_name="product_id",   # 主字段名称is_primary=True,          # 设置为主字段auto_id=False,            # 禁用自动 ID，需要手动提供 IDdatatype=DataType.VARCHAR, # 使用 VARCHAR 类型max_length=100            # VARCHAR 类型必须指定最大长度
)# 定义其他字段
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=4
)schema.add_field(field_name="category", datatype=DataType.VARCHAR, max_length=1000
)# 创建 Collection
if client.has_collection("demo_manual_ids"):client.drop_collection("demo_manual_ids")client.create_collection(collection_name="demo_manual_ids", schema=schema
)print("已成功创建需要手动 ID 的 Collection")

5.2 插入数据（手动 ID）

# 准备数据 - 必须包含主字段！
data = [{"product_id": "PROD-001",  # 手动提供 ID"embedding": [0.1, 0.2, 0.3, 0.4], "category": "book"},{"product_id": "PROD-002",  # 手动提供 ID"embedding": [0.2, 0.3, 0.4, 0.5], "category": "toy"},{"product_id": "PROD-003",  # 手动提供 ID"embedding": [0.3, 0.4, 0.5, 0.6], "category": "electronic"}
]# 插入数据
res = client.insert(collection_name="demo_manual_ids", data=data)# 查看返回的 ID（与提供的 ID 相同）
print("返回的 ID:", res.get("ids"))
# 输出示例: ['PROD-001', 'PROD-002', 'PROD-003']print(f"成功插入 {len(res.get('ids'))} 条记录")

5.3 手动 ID 的注意事项

# ⚠️ 错误示例：ID 重复会导致插入失败
duplicate_data = [{"product_id": "PROD-001", "embedding": [0.5, 0.6, 0.7, 0.8], "category": "duplicate"}
]try:# 这会引发异常，因为 PROD-001 已存在client.insert(collection_name="demo_manual_ids", data=duplicate_data)
except Exception as e:print(f"插入失败: {e}")# ✅ 正确做法：使用 upsert 处理可能的重复
safe_data = [{"product_id": "PROD-001", "embedding": [0.5, 0.6, 0.7, 0.8], "category": "updated"}
]upsert_res = client.upsert(collection_name="demo_manual_ids", data=safe_data)
print("Upsert 操作成功")

六、高级用法

6.1 数据迁移：保留现有 ID

# 在数据迁移场景中，可能需要保留原有的 ID
# 通过修改 Collection 属性来允许在 AutoID 启用时插入自定义 ID# 首先创建一个启用 AutoID 的 Collection
schema = client.create_schema()
schema.add_field(field_name="id", is_primary=True, auto_id=True, datatype=DataType.INT64)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=4)
schema.add_field(field_name="category", datatype=DataType.VARCHAR, max_length=1000)if client.has_collection("migration_demo"):client.drop_collection("migration_demo")
client.create_collection(collection_name="migration_demo", schema=schema)# 启用 allow_insert_auto_id 属性
client.alter_collection_properties(collection_name="migration_demo",properties={"allow_insert_auto_id": True}
)print("已启用 allow_insert_auto_id，可以在 AutoID Collection 中插入自定义 ID")

6.2 多集群环境配置

在多 Milvus 集群环境中，需要配置唯一的集群 ID 来确保 AutoID 的全局唯一性：

# milvus.yaml 配置文件
common:clusterID: 3   # 必须在所有集群中唯一（范围：0-7）

配置说明：

clusterID 范围：0-7，最多支持 8 个集群
每个集群必须有不同的 clusterID
Milvus 内部处理位反转，确保 ID 不重叠

七、AutoID 内部工作原理

7.1 AutoID 的 64 位结构

[sign_bit][cluster_id][physical_ts][logical_ts]

7.2 VARCHAR 类型的 AutoID

即使主字段使用 VARCHAR 数据类型，Milvus 仍然生成数字 ID：

以数字字符串形式存储
最大长度为 20 个字符（uint64 范围）
保证全局唯一性

八、故障排除

8.1 常见错误及解决方案

# 错误：在 AutoID Collection 中提供了主字段
try:wrong_data = [{"id": 1, "embedding": [0.1, 0.2, 0.3, 0.4]}]client.insert(collection_name="demo_autoid", data=wrong_data)
except Exception as e:print(f"错误: {e}")  # 会报错，因为 AutoID 模式下不应提供主字段# 错误：在手动 ID Collection 中缺少主字段
try:missing_id_data = [{"embedding": [0.1, 0.2, 0.3, 0.4]}]client.insert(collection_name="demo_manual_ids", data=missing_id_data)
except Exception as e:print(f"错误: {e}")  # 会报错，因为手动 ID 模式下必须提供主字段

通过本文档，应该能够熟练掌握 Milvus 主字段的使用方法，根据实际业务需求合理选择自动 ID 或手动 ID 模式，并能够在复杂场景下正确配置和使用主字段功能。

查看全文

http://www.dtcms.com/a/565345.html