LLM时代基于unstructured解析非结构化html
LLM时代文本的读取解析是一项非常重要的数据预处理工作。
unstructured是非结构化数据提取与标准化的工具,兼容html、word、pdf等多种格式,转换为统一的结构化的文本片段、标题、表格、元数据等。
这里尝试在linux conda环境示例unstructured解析html过程,后续进一步示例pdf、word文件解析。
1 安装unstructured
这里使用基于conda的python环境,假设conda已经安装。
1.1 unstructrued
python 版本3.12,unstructured安装命令如下,这里选择的是完全版。
pip install "unstructured[local-inference]" -i https://pypi.tuna.tsinghua.edu.cn/simple
1.2 nltk_data
需要注意的是unstructured依赖nltk_data,所以在实际测试unstructrued之前,先安装nltk_data。
由于国内网络原因,下载代码会运行失败。
import nltk
nltk.download('punkt')
所以需要直接去nltk_data官网下载,下载过程参考附录问题2。
https://github.com/nltk/nltk_data
2 验证安装
运行如下代码,检查是否报错。
import unstructured
正常情况不应该报错。
3 解析测试
这里提取HTML网页文字,并进行格式化处理,示例代码如下
3.1 文本输出
from unstructured.partition.html import partition_html
elements = partition_html(url="https://apnews.com/article/national-guard-trump-chicago-portland-court-b5d227814d775159eb9c3814779b3ae3", include_page_breaks=False)# 文本输出
for e in elements:print("--")print(e.text)
输出如下所示,unstructrued成功解析出了网页文本。
103
--
Federal court to weigh Trump’s deployment of National Guard troops in Chicago area
--
1 of 4 |
--
President Donald Trump on Wednesday said the Illinois governor and Chicago mayor, both Democrats, should be jailed as they oppose his deployment of National Guard troops for his immigration and crime crackdown in the nation’s third-largest city.
--
2 of 4 |
--
A protester is arrested by police and federal officers outside a U.S. Immigration and Customs Enforcement facility in Portland, Ore., Monday, Oct. 6, 2025. (AP Photo/Ethan Swope)
--
3 of 4 |
--
Military personnel in uniform, with the Texas National Guard patch on, are seen at the U.S. Army Reserve Center, Tuesday, Oct. 7, 2025, in Elwood, Ill., a suburb of Chicago. (AP Photo/Erin Hooley)
--
4 of 4 |
--
Military personnel in uniform, with the Texas National Guard patch on, are seen at the U.S. Army Reserve Center, Tuesday, Oct. 7, 2025, in Elwood, Ill., a suburb of Chicago. (AP Photo/Erin Hooley)
--
Federal court to weigh Trump’s deployment of National Guard troops in Chicago area
--
1 of 4
--
President Donald Trump on Wednesday said the Illinois governor and Chicago mayor, both Democrats, should be jailed as they oppose his deployment of National Guard troops for his immigration and crime crackdown in the nation’s third-largest city.
--Share
......
3.2 格式化输出
尝试将网页数据格式化为下游任务的输入,如JSON或LLM训练数据。
from unstructured.partition.html import partition_html
elements = partition_html(url="https://apnews.com/article/national-guard-trump-chicago-portland-court-b5d227814d775159eb9c3814779b3ae3", include_page_breaks=False)# 文本输出
for e in elements[:5]:print("--")print(e.category, e.text)# 格式化输出
from unstructured.staging.base import convert_to_dict
json_data = convert_to_dict(elements)
print(json_data[:2]) # 输出前两个元素
输出示例如下
text output:
--
UncategorizedText Federal court to weigh Trump’s deployment of National Guard troops in Chicago area
--
UncategorizedText 1 of 7 |
--
NarrativeText President Donald Trump on Wednesday said the Illinois governor and Chicago mayor, both Democrats, should be jailed as they oppose his deployment of National Guard troops for his immigration and crime crackdown in the nation’s third-largest city.
--
UncategorizedText 2 of 7 |
--
NarrativeText Aerial footage showed a large group of demonstrators marching through downtown Chicago Wednesday night, in the wake of President Donald Trump’s deployment of federal troops to an Army training center outside the city.
json output:
[{'type': 'UncategorizedText', 'element_id': '1dd8d9f7bfda127df3536a1ce2fcb90f', 'metadata': {'filetype': 'text/html', 'page_number': 1, 'url': 'https://apnews.com/article/national-guard-trump-chicago-portland-court-b5d227814d775159eb9c3814779b3ae3'}, 'text': 'Federal court to weigh Trump’s deployment of National Guard troops in Chicago area'}, {'type': 'UncategorizedText', 'element_id': '1aebd08eb85a4bac63de68d21d334b7a', 'metadata': {'filetype': 'text/html', 'page_number': 1, 'url': 'https://apnews.com/article/national-guard-trump-chicago-portland-court-b5d227814d775159eb9c3814779b3ae3', 'emphasized_text_contents': ['1 of 7\xa0|', '|'], 'emphasized_text_tags': ['span', 'span']}, 'text': '1 of 7\xa0|'}]
问题
问题1: numpy版本问题
module compiled against ABI version 0x1000009 but this version of numpy is 0x2000000
由模块依赖numpy 1.x,目前numpy为2.x,所以需要对numpy降级到1.x
pip install numpy==1.26 -i https://pypi.tuna.tsinghua.edu.cn/simple
参考 https://juejin.cn/post/7398087074971090959
问题2: nltk数据下载问题
Resource punkt_tab not found.
Please use the NLTK Downloader to obtain the resource
在nltk_data官网下载nltk_data
https://github.com/nltk/nltk_data
将其中的packages目录更名为nltk_data,然后复制到/home/user目录
nltk_data默认目录在/home/user
然后运行
import nltk
nltk.download('punkt')
正常情况能观察到如下输出
[nltk_data] Downloading package punkt to /home/dev/nltk_data...
[nltk_data] Package punkt is already up-to-date!
True
运行下方代码时,仍然报错
from nltk.tokenize import word_tokenize
from nltk.text import Text
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon,We have to play basketball tomorrow."
tokens = word_tokenize(input_str)print(tokens)
word_tokenize NLTK的工具,将文本拆分为单词。这个函数使用了一个名为 punkt 的 NLTK 数据文件,该数据文件包含了用于分词的语言特定的规则。punkt用于识别文本中的单词边界,识别标点符号、空格和其他分隔符,并将文本分解成单词。word_tokenize函数依赖已经下载了的punkt 数据文件。参考报错信息,报错是因为虽然下载了punkt库,但没有解压。
找到刚才复制到的/home/dev/nltk_data/tokenizers目录,将其中的punkt_tab.zip文件解压
cd /home/dev/nltk_data/tokenizers
unzip punkt_tab.zip
然后再次运行之前代码。
https://blog.csdn.net/2301_81199775/article/details/139939837
问题3: averaged perceptron tagger问题
Resource averaged_perceptron_tagger_eng not found.
Please use the NLTK Downloader to obtain the resource:
cd到tagger目录,将averaged_perceptron_tagger_eng文件解压。
cd /home/user/nltk_data/taggers
unzip averaged_perceptron_tagger_eng.zip
然后再次运行即可
reference
---
unstructured
https://github.com/Unstructured-IO/unstructured.git
numpy版本错误
https://juejin.cn/post/7398087074971090959
【自然语言处理系列】安装nltk_data和punkt库(亲测有效)
https://blog.csdn.net/2301_81199775/article/details/139939837
unstructured 库:处理和预处理非结构化数据
https://blog.csdn.net/u013172930/article/details/147831892