当前位置: 首页 > news >正文

如何用python读取大的xml文件,示例为1.9G的xml文件

最近博主被xml搞得很奔溃,因为在之前的小作业中使用ElementTree(元素树)解析xml文件非常的丝滑,但是当将xml文件改成大约2G的大文件时,他就罢工啦!
xml的文件格式如下图所示:
在这里插入图片描述
之前的解析代码如下:

def load_xml(self, file):'''function: use the file in XML formatArgs:file (str): path to the xml fileReturns:         '''   # with open(file, 'r', encoding='utf-8') as f:#     xml = f.read()#load the xml file# root = ET.fromstring(xml) #root.tag is rowtree = ET.parse(file)root = tree.getroot()for child in root:docID = child.find('DOCNO').text #get docID# print(docID)content = child.find('TITLE').text +child.find('ARTIST').text +child.find('YEAR').text +child.find('LYRICS').text + child.find('GENRE').text #get content# print("content",content)self.docs_df.loc[docID] = content# use a dataframe to store each doc:docID,content(headline+text)  

之前的报错信息:

Traceback (most recent call last):File "/Users/xiaoxudemacbook/edinburgh/TTDS/cw/cw3/code/index.py", line 115, in <module>index = Index('data.xml', 'englishST.txt')File "/Users/xiaoxudemacbook/edinburgh/TTDS/cw/cw3/code/index.py", line 21, in __init__self.load_xml(xml_path)File "/Users/xiaoxudemacbook/edinburgh/TTDS/cw/cw3/code/index.py", line 34, in load_xmlroot = ET.fromstring(xml) #root.tag is documentFile "/Users/xiaoxudemacbook/opt/anaconda3/lib/python3.9/xml/etree/ElementTree.py", line 1344, in XMLparser.feed(text)
xml.etree.ElementTree.ParseError: out of memory: line 1, column 0

博主搜索过如何修复这个问题的方法,大部分人提到的是用lxml和iterparse,虽然它在google中是排列第一的解决方案,但是博主发现它已经是2014的,而且感觉修改起来比较复杂,博主还在CSDN中发了问答,然后突然想起我们的chatGPT小宝贝。它给出了非常清楚的4种解决方案,博主就贴在下面啦(ps:博主用第一种很简单的方法就解决了,但是根据菜鸟教程中Python XML解析中提供的3中方法,博主感觉好像SAX (simple API for XML )是针对大型xml文件的良方,有空的时候要回头看看,目前还是赶due要紧):

This error message suggests that the XML file being parsed by the load_xml() function is too large to fit in memory. When an XML file is parsed using the ET.fromstring() function, the entire contents of the file are loaded into memory as a string, which can be a problem for very large files.

To resolve this error, you can try one or more of the following solutions:

1. Use the ET.parse() function instead of ET.fromstring() to parse the XML file.

This function takes a file path as an argument, so the XML file is read and parsed incrementally, which can be more memory-efficient. Here’s an example of how you can modify the code:
makefile
Copy code

Replace this line:

root = ET.fromstring(xml)

With this line:

tree = ET.parse(xml_path)
root = tree.getroot()

2. Split the XML file into smaller files before parsing them

If the XML file contains a large number of documents, you can split it into smaller files that each contain a subset of the documents. This way, you can parse each file separately and avoid loading the entire XML file into memory at once.

3. Increase the amount of memory available to the Python process.

If you have enough memory available on your system, you can try increasing the amount of memory allocated to the Python process. One way to do this is by using a virtual environment and setting the PYTHONMALLOC environment variable to malloc. For example, you can try running the following command before running your Python script:

PYTHONMALLOC=malloc python index.py

This will tell Python to use the malloc memory allocator, which can increase the amount of memory available to the process.

4. Use a streaming XML parser.

If the XML file is too large to fit in memory even after trying the above solutions, you can use a streaming XML parser such as xml.sax or lxml.etree. These parsers allow you to process the XML file incrementally, without loading the entire file into memory at once. However, using a streaming parser can be more complex and require more code changes than the other solutions.

附:菜鸟教程Python XML解析:https://www.runoob.com/python/python-xml.html

相关文章:

  • HarmonyOS - UIObserver(无感监听)
  • Windows 安装以及配置Docker全流程 - Docker Toolbox
  • vibe coding 2025工具全景图
  • python打卡day51@浙大疏锦行
  • 如何利用测试Agent自动分析覆盖率报告与缺陷趋势
  • “十五五”时期智慧城市赋能全国一体化数据市场建设:战略路径与政策建议[ 注:本建议基于公开政策文件与行业实践研究,数据引用截至2025年6月11日。]
  • 剑指offer21——反转链表
  • 力扣上C语言编程题:最大子数组和(涉及数组)
  • Qwen3-Embedding-8B:文本嵌入界的“卷王”,多语言检索新标杆!
  • 嵌入式学习笔记 - C语言访问地址的方式,以及指针的进一步理解
  • 简单的五子棋实现简介
  • 6.11本日总结
  • typescript中的泛型
  • 字符串|数组|计算常见函数整理-竞赛专用(从比赛真题中总结的,持续更新中)
  • 使用CSDN作为Markdown编辑器图床
  • 【Python-Day 25】玩转数字:精通 math 与 random 模块,从数学运算到随机抽样
  • 图文教程——Deepseek最强平替工具免费申请教程——国内edu邮箱可用
  • 亚马逊Woot黑五策略,快速提升亚马逊业绩
  • LeetCode - 136. 只出现一次的数字
  • vue3 + ant 实现 tree默认展开,筛选对应数据打开,简单~直接cv
  • 芜湖做网站公司/南宁seo关键词排名
  • 做网站一天能接多少单/洛阳seo网站
  • wordpress app下载模板下载/广州seo
  • 门户网站静态页面/网站推广优化的原因
  • 聊城手机网站建设电话/微信朋友圈推广文案
  • 如何给给公司建立网站/百度广告联盟