当前位置：首页 > news >正文

10 分钟上手 Elasticsearch 语义搜索（Serverless Cloud 本地双版本教程）

news 2025/8/4 15:20:59

1. 场景 & 术语速览

概念	解释
语义搜索	用自然语言检索「含义相近」的文档，而非纯关键词匹配
向量/Embedding	把文本映射为 N 维数组；相似文本→距离更近
ELSER	Elastic 默认稀疏向量模型 (Learned Sparse Encoder)，无需微调即可跨领域检索
`semantic_text`	ES 8.11 + 内置字段类型，存储向量+自动推断模型，免手动 token/vocab

2. 环境准备

Serverless（推荐新同学）

进入 Elasticsearch → Home
选择 “Semantic search workflow” → Create a semantic-optimized index
自动创建向量索引 + 示例数据

Elastic Cloud / 本地集群

# Kibana DevTools / Console
PUT /semantic-index

需 superuser（本地）或 开发者 / 管理员（Cloud）角色
若初次加载 ELSER，模型会在后台下载（1-5 分钟）

3. 创建向量索引 & 映射

PUT /semantic-index/_mapping
{"properties": {"content": {        # 任意字段名"type": "semantic_text"# 嵌入模型自动 = ELSER}}
}

为什么选 semantic_text？

默认填好稀疏向量参数
自动拆分大段文本、做 chunking
查询时 match / knn 语法保持一致

4. 批量写入示例文档

POST /_bulk?pretty
{ "index": { "_index": "semantic-index" } }
{ "content": "Yellowstone National Park …" }
{ "index": { "_index": "semantic-index" } }
{ "content": "Yosemite National Park …" }
{ "index": { "_index": "semantic-index" } }
{ "content": "Rocky Mountain National Park …" }

若 408/504 超时：等待模型加载完后重试
文本会被 自动切片 + 编码为稀疏向量

5. Discover 浏览 & ES | QL 入口

Kibana 左侧导航 → Discover
选中 semantic-index
任意文档点击 “⬆️” 展开查看字段
顶栏 → Try ES|QL 打开查询编辑器

6. ES | QL 语义检索

6.1 入门查询

FROM semantic-index
| WHERE content: "what's the biggest park?"
| LIMIT 10

content: 语法自动走 语义匹配（ELSER 稀疏向量）
LIMIT 控制返回行数

6.2 查看相关性分值并排序

FROM semantic-index METADATA _score
| WHERE content: "best spot for rappelling"
| KEEP content, _score
| SORT _score DESC
| LIMIT 10

命令	作用
`METADATA _score`	暴露 Lucene 相关度得分
`KEEP`	控制显示列
`SORT _score DESC`	按得分降序

查看结果：Rocky Mountain National Park… 通常排第 1，最适合攀岩。

7. 整个查询走 REST API（脚本 / 自动化用）

POST /_query?format=txt
{"query": """FROM semantic-index METADATA _score| WHERE content: "best spot for rappelling"| KEEP content, _score| SORT _score DESC| LIMIT 10"""
}

删除样例数据：

DELETE /semantic-index

8. 常见问题 FAQ

症状	处理方式
`model download timeout`	检查 Stack Management → Trained Models，确认 ELSER 状态 `started`
向量检索慢	确保索引设置 `default_pipeline` 未做阻塞式处理；生产环境建议调整 `segment` 刷新周期
想混合关键字 + 语义	创建 `text` + `semantic_text` 双字段，用 `bool` 查询或 ES	QL `WHERE` 逻辑组合