es 书籍检索-上篇
前言
需要是要做一个书籍检索系统,将电子版书籍按章节内容提取出来,进行关键字检索到对应章节内容。
es部署
一、前置环境
项目要求系统Linux ARM aarch64Elasticsearch7.17.x (建议 7.17.10)
IK 分词器v7.17.0
JDK8 或 11 都可(ES 7.x 支持)
用户创建 es 用户运行
Elasticsearch内存≥2GB
安装es 宿主机是arm架构,麒麟系统1、确认当前系统架构:
uname -m
aarch64检查当前 Java 版本:
java -version2、下载 Elasticsearch 7.17.x ARM 版本
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.10-linux-aarch64.tar.gz
tar -zxvf elasticsearch-7.17.10-linux-aarch64.tar.gz
mv elasticsearch-7.17.10 /usr/local/elasticsearch
然后拷贝文件到服务器3、创建用户运行,es不允许root用户执行
useradd es
mkdir -p /usr/local/elasticsearch/{data,logs}
chown -R es:es /usr/local/elasticsearch4、配置es
vi /usr/local/elasticsearch/config/elasticsearch.ymlcluster.name: es-cluster
node.name: node-1
path.data: /usr/local/elasticsearch/data
path.logs: /usr/local/elasticsearch/logs
network.host: 0.0.0.0
http.port: 9200
discovery.type: single-node
xpack.security.enabled: false系统参数优化
# 虚拟内存
echo "vm.max_map_count=262144" >> /etc/sysctl.conf
sysctl -p# 文件句柄
echo "es soft nofile 65536" >> /etc/security/limits.conf
echo "es hard nofile 65536" >> /etc/security/limits.conf️5、安装 IK v7.17.0 分词器,这里采用离线安装方式,另外宿主机 es 版本是7.17.10,我需要安装 IK 分词器 7.17.10 , 但是没有这个版本,版本不匹配会无法运行es,所以采用了下载分词器代码,进行更改pom文件版本,然后重写打包构建
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.17.0/elasticsearch-analysis-ik-7.17.0.zip安装
cd /usr/local/elasticsearch
sudo -u es ./bin/elasticsearch-plugin install file:///opt/elasticsearch-analysis-ik-7.17.0.zip6、启动es
su - es
/usr/local/elasticsearch/bin/elasticsearch -dchmod +x /usr/local/elasticsearch/jdk/bin/java
这里有个问题,es 需要jdk 11才能启动,而服务器是jdk 8,这里采用es 内部自带的jdk 方式来启动es报错如下:
could not find java in bundled JDK at /usr/local/elasticsearch/jdk/bin/java解决问题:(目前是加权限好使了)
chmod +x /usr/local/elasticsearch/jdk/bin/java切换到elasticsearch用户执行
su - elasticsearch
cd /usr/local/elasticsearch
bin/elasticsearch -d验证 IK 分词器
测试:
curl -X POST "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{"analyzer": "ik_max_word","text": "中华人民共和国成立了"
}'预期输出:
{"tokens": [{"token": "中华人民共和国"},{"token": "中华"},{"token": "人民"},{"token": "共和国"},{"token": "成立"}]
}
安装 IK 中文分词器(7.x 分支)详细步骤:
git checkout 7.x# 1️⃣ 克隆仓库
git clone https://github.com/medcl/elasticsearch-analysis-ik.git
cd elasticsearch-analysis-ik# 2️⃣ 切换到 7.x 分支
git checkout 7.x# 3️⃣ 修改 pom.xml 确保版本一致
vim pom.xml找到如下配置(若版本不一致请改成你的 Elasticsearch 版本):
<elasticsearch.version>7.17.10</elasticsearch.version>构建
mvn clean package -DskipTests
构建成功在工程target目录下
sudo -u es bin/elasticsearch-plugin install file:///usr/local/elasticsearch/elasticsearch-analysis-ik-7.17.10.zip
后端代码:
相关依赖:
<!-- Elasticsearch 7.17.10 --><dependency><groupId>org.elasticsearch.client</groupId><artifactId>elasticsearch-rest-high-level-client</artifactId><version>7.17.10</version></dependency><!-- Apache POI (docx) --><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>5.2.3</version></dependency><!-- PDF 解析 --><dependency><groupId>org.apache.pdfbox</groupId><artifactId>pdfbox</artifactId><version>2.0.29</version></dependency><!-- 处理 .doc 文件 --><dependency><groupId>org.apache.poi</groupId><artifactId>poi</artifactId><version>5.2.3</version></dependency><!-- 处理 .docx 文件 --><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>5.2.3</version></dependency><!-- 全量 OOXML schemas,解决 DocumentDocument 缺失 --><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml-full</artifactId><version>5.2.3</version></dependency><!-- XMLBeans --><dependency><groupId>org.apache.xmlbeans</groupId><artifactId>xmlbeans</artifactId><version>5.1.1</version></dependency><!-- POI 依赖 --><dependency><groupId>org.apache.commons</groupId><artifactId>commons-collections4</artifactId><version>4.4</version></dependency><dependency><groupId>org.apache.commons</groupId><artifactId>commons-compress</artifactId><version>1.21</version></dependency><!-- Jackson --><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-databind</artifactId></dependency>
# es相关配置文件:elasticsearch:host: 192.168.183.138port: 9200scheme: httpindex: book
es config文件
@ConfigurationProperties(prefix = "elasticsearch")
@Configuration
@Data
public class ElasticsearchConfig {private String host ;private Integer port ;//重写父类方法
// @Bean(destroyMethod = "close")public RestHighLevelClient restHighLevelClient() {RestClientBuilder builder = RestClient.builder(new HttpHost(host, port, "http"));return new RestHighLevelClient(builder);}
}
解析实体类
@Data
public class ChapterDoc {private String id;private String bookId;private String bookTitle;private int chapterIndex;private String chapterTitle;private String content;private List<ChapterImage> images = new ArrayList<>();private LocalDateTime createdAt = LocalDateTime.now();
}@Data
public class ChapterImage {private String url;private int position;
}
文件解析工具类DocParser
@Service
public class DocParser {public List<ChapterDoc> parseFile(File file, String bookId, String bookTitle) throws Exception {String lower = file.getName().toLowerCase();if (lower.endsWith(".docx") || lower.endsWith(".doc")) {return parseDocx(file, bookId, bookTitle);} else if (lower.endsWith(".pdf")) {return parsePdf(file, bookId, bookTitle);} else {throw new RuntimeException("不支持的文件类型");}}private List<ChapterDoc> parseDocx(File file, String bookId, String bookTitle) throws Exception {List<ChapterDoc> chapters = new ArrayList<>();try (FileInputStream fis = new FileInputStream(file);XWPFDocument doc = new XWPFDocument(fis)) {Files.createDirectories(Paths.get("/data/book_images"));StringBuilder sb = new StringBuilder();ChapterDoc currentChapter = null;int chapterIdx = 1;int imagePosition = 0;for (IBodyElement elem : doc.getBodyElements()) {if (elem instanceof XWPFParagraph) {XWPFParagraph p = (XWPFParagraph) elem;String text = p.getText();if (text == null) continue;boolean isHeading = text.matches("^第[\\d一二三四五六七八九十]+章.*");if (isHeading) {if (currentChapter != null) {currentChapter.setContent(sb.toString());chapters.add(currentChapter);sb.setLength(0);}currentChapter = new ChapterDoc();currentChapter.setId(bookId + "_" + chapterIdx++);currentChapter.setBookId(bookId);currentChapter.setBookTitle(bookTitle);currentChapter.setChapterIndex(chapterIdx - 1);currentChapter.setChapterTitle(text);currentChapter.setImages(new ArrayList<>());imagePosition = 0;sb.append(text).append("\n");} else {sb.append(text).append("\n");}} else if (elem instanceof XWFPicture) {XWFPicture pic = (XWFPicture) elem;XWPFPictureData picData = pic.getPictureData();if (picData != null && currentChapter != null) {byte[] bytes = picData.getData();String ext = picData.suggestFileExtension();String path = "/data/book_images/" + UUID.randomUUID() + "." + ext;Files.write(Paths.get(path), bytes);ChapterImage ci = new ChapterImage();ci.setUrl("/images/" + Paths.get(path).getFileName());ci.setPosition(imagePosition++);currentChapter.getImages().add(ci);}}}if (currentChapter != null) {currentChapter.setContent(sb.toString());chapters.add(currentChapter);}}return chapters;}private List<ChapterDoc> parsePdf(File file, String bookId, String bookTitle) throws Exception {List<ChapterDoc> chapters = new ArrayList<>();try (PDDocument pdf = PDDocument.load(file)) {PDFTextStripper stripper = new PDFTextStripper();String text = stripper.getText(pdf);Pattern pattern = Pattern.compile("(?m)^第[\\d一二三四五六七八九十]+章");String[] parts = pattern.split(text);for (int i = 0; i < parts.length; i++) {ChapterDoc doc = new ChapterDoc();doc.setId(bookId + "_" + (i+1));doc.setBookId(bookId);doc.setBookTitle(bookTitle);doc.setChapterIndex(i+1);doc.setChapterTitle("第" + (i+1) + "章");doc.setContent(parts[i]);doc.setImages(new ArrayList<>());chapters.add(doc);}}return chapters;}
}
5️⃣ EsIndexer.java(Java API Client 批量索引)
@Service
public class EsIndexer {private final ElasticsearchClient client;public EsIndexer(ElasticsearchClient client) {this.client = client;}public void bulkIndex(List<ChapterDoc> chapters) throws IOException {BulkRequest.Builder br = new BulkRequest.Builder();for (ChapterDoc doc : chapters) {br.operations(op -> op.index(idx -> idx.index("book_index").id(doc.getId()).document(doc)));}client.bulk(br.build());}
}
Controller类
private final RestHighLevelClient client;public FileController(RestHighLevelClient client) {this.client = client;}// 上传接口:接收文件并保存到指定路径@PostMapping("/upload")public String upload(@RequestParam("file") MultipartFile file,@RequestParam Integer type) throws Exception {if (file.isEmpty()) {return "上传失败:文件为空";}String uploadDir = null;String dateDir = new SimpleDateFormat("yyyyMMdd").format(new Date());if (type == 1) {uploadDir = filePath + dateDir + "\\";} else if (type == 2) {uploadDir = picPath + dateDir + "\\";} else if (type == 3) {uploadDir = videoPath + dateDir + "\\";}// 确保目录存在File dir = new File(uploadDir);if (!dir.exists()) {dir.mkdirs();}// 保存文件到服务器目录File dest = new File(uploadDir + file.getOriginalFilename());YstsSearchFileParsingRecord record = new YstsSearchFileParsingRecord();record.setFileName(file.getOriginalFilename());try {file.transferTo(dest);if (!recordService.add(record) && type == 1) {// 判断文件是否已经解析识别}return "上传成功:" + dest.getAbsolutePath();} catch (IOException e) {e.printStackTrace();return "上传失败:" + e.getMessage();}}@GetMapping("/search")public List<Map<String, Object>> search(@RequestParam String keyword) throws IOException {// 构建搜索条件SearchRequest searchRequest = new SearchRequest("book");SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();// 多字段匹配MultiMatchQueryBuilder multiMatchQuery = QueryBuilders.multiMatchQuery(keyword, "chapterTitle", "sectionTitle", "content");sourceBuilder.query(multiMatchQuery);// 高亮设置HighlightBuilder highlightBuilder = new HighlightBuilder().preTags("<em>").postTags("</em>").field("content"); // 高亮字段sourceBuilder.highlighter(highlightBuilder);// 返回条数sourceBuilder.size(20);searchRequest.source(sourceBuilder);// 执行查询SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);List<Map<String, Object>> resultList = new ArrayList<>();for (SearchHit hit : response.getHits().getHits()) {Map<String, Object> source = hit.getSourceAsMap();Map<String, Object> map = new HashMap<>();map.put("chapterTitle", source.get("chapterTitle"));map.put("sectionTitle", source.get("sectionTitle"));// 高亮内容Map<String, HighlightField> highlightFields = hit.getHighlightFields();if (highlightFields != null && highlightFields.containsKey("content")) {String highlightText = Arrays.toString(highlightFields.get("content").fragments());map.put("highlight", highlightText);} else {map.put("highlight", source.get("content"));}// 图片内容map.put("images", source.get("images"));resultList.add(map);}return resultList;}
IK分词器创建
{"settings": {"number_of_shards": 3,"number_of_replicas": 1,"analysis": {"analyzer": {"ik_max": { "tokenizer": "ik_max_word" },"ik_smart": { "tokenizer": "ik_smart" }}}},"mappings": {"properties": {"bookId": { "type": "keyword" },"bookTitle": { "type": "text", "analyzer": "ik_max", "search_analyzer": "ik_smart" },"chapterIndex": { "type": "integer" },"chapterTitle": { "type": "text", "analyzer": "ik_max", "search_analyzer": "ik_smart" },"sectionIndex": { "type": "integer" },"sectionTitle": { "type": "text", "analyzer": "ik_max", "search_analyzer": "ik_smart" },"content": { "type": "text", "analyzer": "ik_max", "search_analyzer": "ik_smart" },"imageUrls": { "type": "keyword" },"pageFrom": { "type": "integer" },"pageTo": { "type": "integer" },"createdTime": { "type": "date" }}}
}