当前位置: 首页 > news >正文

LLM时代基于unstructured解析非结构化html

LLM时代文本的读取解析是一项非常重要的数据预处理工作。

unstructured是非结构化数据提取与标准化的工具,兼容html、word、pdf等多种格式,转换为统一的结构化的文本片段、标题、表格、元数据等。

这里尝试在linux conda环境示例unstructured解析html过程,后续进一步示例pdf、word文件解析。

1 安装unstructured

这里使用基于conda的python环境,假设conda已经安装。

1.1 unstructrued

python 版本3.12,unstructured安装命令如下,这里选择的是完全版。

pip install "unstructured[local-inference]" -i https://pypi.tuna.tsinghua.edu.cn/simple

1.2 nltk_data

需要注意的是unstructured依赖nltk_data,所以在实际测试unstructrued之前,先安装nltk_data。

由于国内网络原因,下载代码会运行失败。

import nltk
nltk.download('punkt')

所以需要直接去nltk_data官网下载,下载过程参考附录问题2。

https://github.com/nltk/nltk_data

2 验证安装

运行如下代码,检查是否报错。

import unstructured

正常情况不应该报错。

3 解析测试

这里提取HTML网页文字,并进行格式化处理,示例代码如下

3.1 文本输出

from unstructured.partition.html import partition_html
elements = partition_html(url="https://apnews.com/article/national-guard-trump-chicago-portland-court-b5d227814d775159eb9c3814779b3ae3", include_page_breaks=False)# 文本输出
for e in elements:print("--")print(e.text)

输出如下所示,unstructrued成功解析出了网页文本。

103
--
Federal court to weigh Trump’s deployment of National Guard troops in Chicago area
--
1 of 4 |
--
President Donald Trump on Wednesday said the Illinois governor and Chicago mayor, both Democrats, should be jailed as they oppose his deployment of National Guard troops for his immigration and crime crackdown in the nation’s third-largest city.
--
2 of 4 |
--
A protester is arrested by police and federal officers outside a U.S. Immigration and Customs Enforcement facility in Portland, Ore., Monday, Oct. 6, 2025. (AP Photo/Ethan Swope)
--
3 of 4 |
--
Military personnel in uniform, with the Texas National Guard patch on, are seen at the U.S. Army Reserve Center, Tuesday, Oct. 7, 2025, in Elwood, Ill., a suburb of Chicago. (AP Photo/Erin Hooley)
--
4 of 4 |
--
Military personnel in uniform, with the Texas National Guard patch on, are seen at the U.S. Army Reserve Center, Tuesday, Oct. 7, 2025, in Elwood, Ill., a suburb of Chicago. (AP Photo/Erin Hooley)
--
Federal court to weigh Trump’s deployment of National Guard troops in Chicago area
--
1 of 4
--
President Donald Trump on Wednesday said the Illinois governor and Chicago mayor, both Democrats, should be jailed as they oppose his deployment of National Guard troops for his immigration and crime crackdown in the nation’s third-largest city.
--

                    Share

......

3.2 格式化输出

尝试将网页数据格式化为下游任务的输入,如JSON或LLM训练数据。

from unstructured.partition.html import partition_html
elements = partition_html(url="https://apnews.com/article/national-guard-trump-chicago-portland-court-b5d227814d775159eb9c3814779b3ae3", include_page_breaks=False)# 文本输出
for e in elements[:5]:print("--")print(e.category, e.text)# 格式化输出
from unstructured.staging.base import convert_to_dict
json_data = convert_to_dict(elements)
print(json_data[:2])  # 输出前两个元素

输出示例如下


text output:
--
UncategorizedText Federal court to weigh Trump’s deployment of National Guard troops in Chicago area
--
UncategorizedText 1 of 7 |
--
NarrativeText President Donald Trump on Wednesday said the Illinois governor and Chicago mayor, both Democrats, should be jailed as they oppose his deployment of National Guard troops for his immigration and crime crackdown in the nation’s third-largest city.
--
UncategorizedText 2 of 7 |
--
NarrativeText Aerial footage showed a large group of demonstrators marching through downtown Chicago Wednesday night, in the wake of President Donald Trump’s deployment of federal troops to an Army training center outside the city.
json output:
[{'type': 'UncategorizedText', 'element_id': '1dd8d9f7bfda127df3536a1ce2fcb90f', 'metadata': {'filetype': 'text/html', 'page_number': 1, 'url': 'https://apnews.com/article/national-guard-trump-chicago-portland-court-b5d227814d775159eb9c3814779b3ae3'}, 'text': 'Federal court to weigh Trump’s deployment of National Guard troops in Chicago area'}, {'type': 'UncategorizedText', 'element_id': '1aebd08eb85a4bac63de68d21d334b7a', 'metadata': {'filetype': 'text/html', 'page_number': 1, 'url': 'https://apnews.com/article/national-guard-trump-chicago-portland-court-b5d227814d775159eb9c3814779b3ae3', 'emphasized_text_contents': ['1 of 7\xa0|', '|'], 'emphasized_text_tags': ['span', 'span']}, 'text': '1 of 7\xa0|'}]

问题

问题1: numpy版本问题

module compiled against ABI version 0x1000009 but this version of numpy is 0x2000000

由模块依赖numpy 1.x,目前numpy为2.x,所以需要对numpy降级到1.x

pip install numpy==1.26 -i https://pypi.tuna.tsinghua.edu.cn/simple

参考 https://juejin.cn/post/7398087074971090959

问题2: nltk数据下载问题

Resource punkt_tab not found.
  Please use the NLTK Downloader to obtain the resource

在nltk_data官网下载nltk_data

https://github.com/nltk/nltk_data

将其中的packages目录更名为nltk_data,然后复制到/home/user目录

nltk_data默认目录在/home/user

然后运行

import nltk
nltk.download('punkt')

正常情况能观察到如下输出

[nltk_data] Downloading package punkt to /home/dev/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
True

运行下方代码时,仍然报错

from nltk.tokenize import word_tokenize
from nltk.text import Text
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(input_str)print(tokens)

word_tokenize NLTK的工具,将文本拆分为单词。这个函数使用了一个名为 punkt 的 NLTK 数据文件,该数据文件包含了用于分词的语言特定的规则。punkt用于识别文本中的单词边界,识别标点符号、空格和其他分隔符,并将文本分解成单词。word_tokenize函数依赖已经下载了的punkt 数据文件。参考报错信息,报错是因为虽然下载了punkt库,但没有解压。

找到刚才复制到的/home/dev/nltk_data/tokenizers目录,将其中的punkt_tab.zip文件解压

cd /home/dev/nltk_data/tokenizers

unzip punkt_tab.zip

然后再次运行之前代码。

https://blog.csdn.net/2301_81199775/article/details/139939837

问题3:  averaged perceptron tagger问题

Resource averaged_perceptron_tagger_eng not found.
  Please use the NLTK Downloader to obtain the resource:

cd到tagger目录,将averaged_perceptron_tagger_eng文件解压。

cd /home/user/nltk_data/taggers

unzip averaged_perceptron_tagger_eng.zip

然后再次运行即可

reference

---

unstructured

https://github.com/Unstructured-IO/unstructured.git

numpy版本错误

https://juejin.cn/post/7398087074971090959

【自然语言处理系列】安装nltk_data和punkt库(亲测有效)

https://blog.csdn.net/2301_81199775/article/details/139939837

unstructured 库:处理和预处理非结构化数据

https://blog.csdn.net/u013172930/article/details/147831892

http://www.dtcms.com/a/461106.html

相关文章:

  • 混合动力汽车MATLAB建模实现方案
  • 到底什么是智能网联汽车??第四期——汽车通信系统应用及开发
  • 【开题答辩全过程】以 百宝汽配汽车维修智能管理系统为例,包含答辩的问题和答案
  • ASM1042芯片在汽车BCM项目的工程化应用探索
  • 【工具变量】国家智慧城市试点名单DID数据(2000-2024年)
  • 手机网站设计费用衡水网站建设培训学校
  • 专业网站建设市场网站开发时app打开很慢
  • 悟空AI CRM15版本 客户标签 功能
  • 【开题答辩实录分享】以《面向农业领域的智能灌溉系统》为例进行答辩实录分享
  • JVM 永久代垃圾回收深度解析
  • 什么是电迁移?
  • 编程记录五
  • 【硬核配置】MySQL配置文件my.cnf/ini全参数深度解析:从入门到高可用架构调优
  • QEM算法原理与实现 (QEM Algorithm Explained)
  • 网站建设都有哪些宁德市住房和城乡建设局网站打不开
  • 嘉兴网络建站模板网站建设选择题
  • Apple M3 MacOS arm64 编译QGroundControl5.0.8(base on Qt 6.8.3)
  • web socket消息推送
  • MyBatis入门指南:从零掌握数据库操作
  • OpenTiny TinyVue组件有哪些常用组件?
  • 马鞍山市住房和城乡建设部网站软件公司宣传册设计样本
  • kafka3.9集群部署-kraft模式
  • 动态图表导出与视频生成:精通Matplotlib Animation与FFmpeg
  • 【ES实战】ES6.8到9.1.4的常用客户端变化
  • CFS三层靶机-内网渗透
  • 【智慧城市】2025年中国地质大学(武汉)暑期实训优秀作品(6):武汉视界
  • Redis的缓存更新策略
  • MarsEdit 5 for Mac 博客博文编辑管理工具
  • 蒙古语网站建设江西省飞宏建设工程有限公司 网站
  • 智能监控项目:Python 多目标检测系统 目标检测 目标跟踪(YOLOv8+ByteTrack 监控/交通 源码+文档)✅