当前位置：首页 > wzjs >正文

中国城乡住房和建设部网站首页网页制作与网站开发感想

wzjs 2025/9/5 23:41:23

中国城乡住房和建设部网站首页,网页制作与网站开发感想,网站建设优化推广系统,手机版网站怎样做推广需求背景最近有一个需求需要建设一个知识库文档检索系统，这些知识库物料附件的文档居多，有较多文档格式如：PDF, Open Office, MS Office等，需要将这些格式的文件转化成文本格式，写入elasticsearch 的全文检索索引&am…

需求背景

最近有一个需求需要建设一个知识库文档检索系统，这些知识库物料附件的文档居多，有较多文档格式如：PDF, Open Office, MS Office等，需要将这些格式的文件转化成文本格式，写入elasticsearch 的全文检索索引，方便搜索。我这里介绍一种工具不考虑文件原来格式，但能方便将转化的文档写入到对应的es 索引，并且支持OCR识别扫描版本的pdf文档。

FSCrawler介绍

使用官方文档
github：https://github.com/dadoonet/fscrawler/tree/master

主要功能

本地文件系统（挂载盘）爬取和创建文件索引，更新已经存在的和删除旧的文档；
远程文件系统爬取例如：SSH/FTP等；
允许 REST 接口方式上传你的二进制文件到 es。

下载和安装

docker 下载

docker pull dadoonet/fscrawler

运行

docker run -it --rm \-v ~/.fscrawler:/root/.fscrawler \-v ~/tmp:/tmp/es:ro \dadoonet/fscrawler fscrawler job_name

/root/.fscrawler 是程序的工作目录，会读取该目录下的_settings.yaml 文件，如果不存在会默认创建一个；
/tmp/es:ro 是待爬取的文件目录，该文件夹下的文件会被读取，写入es 对应的索引，其中索引可以在_settings.yaml 中指定，如果不指定会默认创建索引名为，启动任务的名称job_name_folder。

实践说明

我们下面创建一个爬取任务 job_name 为例进行功能说明。

创建工作目录

我创建一个工作目录如下：

/data/workspace/app/fscrawler

我在这个文件夹下创建了一个文件目录：./es 用于存放我需要索引的文件，_settings.yaml用于配置爬取的任务。

`_settings.yaml` 文件配置

为了验证各种文件格式，

我配置了支持ocr 识别，支持中文和英文字符识别；
为了能快速验证，文件目录检查时间我设置为1min；
配置了一个es 数据库用于存储解析后的数据，但没指定索引。
_settings.yaml 文件如下（具体字段意义见注释说明）：

---
name: "job_name" # job name
fs:url: "/tmp/es" # 要索引文件路径update_rate: "1m" # 更新频率excludes: # 排除文件- "*/~*"json_support: false # 是否支持jsonfilename_as_id: false # 是否将文件名作为idadd_filesize: true # 是否添加文件大小remove_deleted: true # 是否删除已删除的文件add_as_inner_object: false # 是否将文件内容作为内部对象store_source: false # 是否存储源文件index_content: true # 是否索引内容attributes_support: false # 是否支持属性raw_metadata: false # 是否原始元数据xml_support: false # 是否支持xmlindex_folders: true # 是否索引文件夹lang_detect: false # 是否检测语言continue_on_error: false # 是否继续错误add_as_inner_object: true # 是否将文件内容作为内部对象ocr:language: "chi_sim+eng" # 识别语言enabled: true # 是否启用ocrpdf_strategy: "ocr_and_text" # pdf策略follow_symlinks: false # 是否跟随符号链接
elasticsearch:nodes:- url: "http://xxx.xxx.xxx.xxx:9200"username: xxxpassword: xxxxbulk_size: 100flush_interval: "5s"byte_size: "10mb"ssl_verification: truepush_templates: true

启动任务

docker run -it --rm \-v /data/workspace/app/fscrawler:/root/.fscrawler \-v /data/workspace/app/fscrawler/es:/tmp/es:ro \dadoonet/fscrawler fscrawler job_name02:41:04,665 INFO  [f.console] ,----------------------------------------------------------------------------------------------------.
|       ,---,.  .--.--.     ,----..                                     ,--,           2.10-SNAPSHOT |
|     ,'  .' | /  /    '.  /   /   \                                  ,--.'|                         |
|   ,---.'   ||  :  /`. / |   :     :  __  ,-.                   .---.|  | :               __  ,-.   |
|   |   |   .';  |  |--`  .   |  ;. /,' ,'/ /|                  /. ./|:  : '             ,' ,'/ /|   |
|   :   :  :  |  :  ;_    .   ; /--` '  | |' | ,--.--.       .-'-. ' ||  ' |      ,---.  '  | |' |   |
|   :   |  |-, \  \    `. ;   | ;    |  |   ,'/       \     /___/ \: |'  | |     /     \ |  |   ,'   |
|   |   :  ;/|  `----.   \|   : |    '  :  / .--.  .-. | .-'.. '   ' .|  | :    /    /  |'  :  /     |
|   |   |   .'  __ \  \  |.   | '___ |  | '   \__\/: . ./___/ \:     ''  : |__ .    ' / ||  | '      |
|   '   :  '   /  /`--'  /'   ; : .'|;  : |   ," .--.; |.   \  ' .\   |  | '.'|'   ;   /|;  : |      |
|   |   |  |  '--'.     / '   | '/  :|  , ;  /  /  ,.  | \   \   ' \ |;  :    ;'   |  / ||  , ;      |
|   |   :  \    `--'---'  |   :    /  ---'  ;  :   .'   \ \   \  |--" |  ,   / |   :    | ---'       |
|   |   | ,'               \   \ .'         |  ,     .-./  \   \ |     ---`-'   \   \  /             |
|   `----'                  `---`            `--`---'       '---"                `----'              |
+----------------------------------------------------------------------------------------------------+
|                                        You know, for Files!                                        |
|                                     Made from France with Love                                     |
|                           Source: https://github.com/dadoonet/fscrawler/                           |
|                          Documentation: https://fscrawler.readthedocs.io/                          |
`----------------------------------------------------------------------------------------------------'02:41:04,683 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [560.9mb/8.7gb=6.25%], RAM [19.6gb/35gb=56.2%], Swap [0b/0b=0.0].
02:41:04,860 WARN  [f.p.e.c.f.s.Elasticsearch] username is deprecated. Use apiKey instead.
02:41:04,860 WARN  [f.p.e.c.f.s.Elasticsearch] password is deprecated. Use apiKey instead.
02:41:04,869 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
02:41:04,869 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/share/fscrawler/lib/log4j-slf4j-impl-2.22.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
02:41:05,230 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.8.1
02:41:05,266 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.8.1
02:41:05,331 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [job_name] for [/tmp/es] every [1m]
02:42:05,430 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

启动成功后，如果是首次启动，没有指定索引和settings.yaml 文件的话会自动创建，当日志会给出警告。

启动成功后文件目录发生了变化：

自动创建_default文件夹，该文件夹下会生成es版本6，7，8对应的全文索引的默认的schema；
自动创建一个_status.json文件，用来检查上次运行的时间和文件变更信息，如下：

{"name" : "job_name","lastrun" : "2024-02-21T07:55:58.851263972","indexed" : 0,"deleted" : 0
}

本地索引目录添加文件

fscrawler配置的每间隔1分钟进行一次文件同步操作，这里需要注意：
1、如果需要启动时将历史文件全量同步的话，需要在启动fscrawler之前就将文件放入settings.yaml配置字段 url对应的文件路径，我们配置的是/tmp/es,容器映射的文件目录是：/data/workspace/app/fscrawler/es 。

2、后续启动后创建了_status.json文件，文件里面的字段lastrun表示上次同步运行的时间，如果文件的修改时间在这个时间之前，是不会同步更新的，新增的文件修改时间必须是在这个时间之后才会同步。

在kibana验证同步效果(我的kibana 版本 8.8.1)

我在es目录下创建了多个格式的文件，包括txt、 doc、docx、ppt、pdf 和扫描件pdf 总共28个文件，文件目录如下：
在这里插入图片描述

在es中查看索引文档

在kibana创建可视化搜索应用，左侧菜单->Enterprise->Search Application；
点击创建并选择下拉选择的索引，输入应用名称即可
展示效果，可以看到文件内容和文件类型，pdf的扫描件效果也不错。

支持rest接口上传文件

启动时新增参数--rest 即可启动rest接口上传文件（注意：我本地为了测试方便使用--net=host -p 8080:8080 来映射主机端口到容器）

docker run -it --net=host -p 8080:8080 --rm \-v /data/workspace/app/fscrawler:/root/.fscrawler \-v /data/workspace/app/fscrawler/es:/tmp/es:ro \dadoonet/fscrawler fscrawler job_name --rest

启动后可以查看fscrawler 状态和配置

// http://127.0.0.1:8080/fscrawler{"ok": true,"version": "2.10-SNAPSHOT","elasticsearch": "8.8.1","settings": {"name": "job_name","fs": {"url": "/tmp/es","update_rate": "1m","excludes": ["*/~*"],"json_support": false,"filename_as_id": false,"add_filesize": true,"remove_deleted": true,"add_as_inner_object": true,"store_source": false,...

上传文件

~/workspace » echo "This is my text" > test.txt                                                                     
~/workspace » curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_document"                                   ernestxwli@VM-142-118-tencentos
{"ok":true,"filename":"test.txt","url":"http://xxx.xxx.xxx.xxx:9200/job_name/_doc/dd18bf3a8ea2a3e53e2661c7fb53534"}%

上传文件并添加额外标签
写入文件额外信息json到tags.txt, 内容如下{"external":{"tenantId": 24,"projectId": 34,"description":"these are additional tags"}}

curl -F "file=@test.txt" -F "tags=@tags.txt" "http://127.0.0.1:8080/fscrawler/_document"

查看es索引文档信息

可以看到业务可以根据自己的需要对文件进行字段扩展，以便满足业务需求，当然还提供了其他接口来处理文件的删除等，详见官方文档。

总结

FSCrawler 提供了一站式的集成方案用来解决各种文档数据转化并存储到es数据库，也有一定的灵活性来自定义拓展字段，可以作为一种文档转换存储工具的选择之一。

文章转载自：

http://VBTp8sQQ.gygfx.cn
http://85eCrsV6.gygfx.cn
http://gWBcITht.gygfx.cn
http://gHvjNYsm.gygfx.cn
http://e5uzNHVo.gygfx.cn
http://sw9TQzVH.gygfx.cn
http://bpvQbwcY.gygfx.cn
http://mx4dvfFo.gygfx.cn
http://Q48BTdY5.gygfx.cn
http://dKJ39Ba6.gygfx.cn
http://IsNXT3RJ.gygfx.cn
http://x0VKaFdO.gygfx.cn
http://hTB5obCj.gygfx.cn
http://G7CuR3xD.gygfx.cn
http://odno3EDW.gygfx.cn
http://CxGzhjYn.gygfx.cn
http://npWYmUuB.gygfx.cn
http://koOn8gSE.gygfx.cn
http://ndw4F8ms.gygfx.cn
http://f2G3geyH.gygfx.cn
http://251Ijhrm.gygfx.cn
http://77ZgQ4fJ.gygfx.cn
http://bZ8rB3k7.gygfx.cn
http://0IqPZHQ7.gygfx.cn
http://BWdLcuJD.gygfx.cn
http://Im7fTPxU.gygfx.cn
http://fz8gaVcG.gygfx.cn
http://btvW5ctv.gygfx.cn
http://dDRMgtTx.gygfx.cn
http://FX1q4lVC.gygfx.cn

查看全文

http://www.dtcms.com/wzjs/610639.html

自己如何做购物网站深圳网站建设在哪里找

网站建设和管理是教什么科目wordpress网站相册

网站的站点建设分为24小时自助平台业务下单

汽车维修保养网站模板wordpress安装下载

专利减缓在哪个网站上做网站建设初学者教程

网站页面做专题的步骤网络运营师

顺企网南昌网站建设wordpress文章相关推荐

兰州哪家网站做推广效果好网站开发课程心得

企业门户网站是什么意思网站定制开发微信运营

保定建站软件中铁三局招聘文员要求身材好

侯马做网站私密浏览器免费

化妆品网站源码宝山做网站价格

网站建设需求报告四大软件外包公司

加盟招商网站建设品牌推广方案100例

苏宿工业园区网站建设成功案例印刷报价网站源码下载

兰州网站排名外包黄骅招聘

湖南英文网站建设线上营销公司

服务器访问不了网站吉林省长春市长春网站建设哪家好

温州做网站老师潍坊做网站的企业

长泰建设局网站帮忙注册公司多少钱

门户网站开发维护合同与网站签约

网站怎么做推广和宣传wordpress头像被墙

云南高端网站建设公司做网站要钱的吗

网站semseo先做哪个网络平台推广方案

做ppt网站有哪些内容吗湖北网中建设工程有限公司

网站建设服务内容费用wordpress移动底部菜单插件

用dw设计网站模板下载地址wordpress 显示当前位置

寻找建设网站客户做卖图片的网站能赚钱吗

学做网站卖东西去哪学手机网页端

水果电商网站建设相关文献成都app开发多少钱