使用 python 解析非常大的 xml 文件时出现问题

时间:2023-01-26
本文介绍了使用 python 解析非常大的 xml 文件时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我有一个大的 xml 文件(大约 84MB),格式如下:

I have a large xml file (about 84MB) which is in this form:

<books>
    <book>...</book>
    ....
    <book>...</book>
</books>

我的目标是提取每一本书并获得其属性.我尝试如下解析它(就像我对其他 xml 文件所做的那样):

My goal is to extract every single book and get its properties. I tried to parse it (as I did with other xml files) as follows:

from xml.dom.minidom import parse, parseString

fd = "myfile.xml"
parser = parse(fd)
## other python code here

但代码似乎在解析指令中失败.为什么会发生这种情况,我该如何解决?

but the code seems to fail in the parse instruction. Why is this happening and how can I solve this?

我应该指出,该文件可能包含希腊语、西班牙语和阿拉伯语字符.

I should point out that the file may contain greek, spanish and arabic characters.

这是我在 ipython 中得到的输出:

This is the output i got in ipython:

In [2]: fd = "myfile.xml"

In [3]: parser = parse(fd)
Killed

我想指出的是计算机在执行过程中冻结,所以这可能与内存消耗有关,如下所述.

I would like to point out that the computer freezes during the execution, so this may be related to memory consumption as stated below.

推荐答案

我强烈建议在这里使用 SAX 解析器.我不建议在任何大于几兆字节的 XML 文档上使用 minidom.我已经看到它使用大约 400MB 的 RAM 读取大小约为 10MB 的 XML 文档.我怀疑您遇到的问题是由 minidom 请求过多内存引起的.

I would strongly recommend using a SAX parser here. I wouldn't recommend using minidom on any XML document larger than a few megabytes; I've seen it use about 400MB of RAM reading in an XML document that was about 10MB in size. I suspect the problems you are having are being caused by minidom requesting too much memory.

Python 带有一个 XML SAX 解析器.要使用它,请执行以下操作.

Python comes with an XML SAX parser. To use it, do something like the following.

from xml.sax.handlers import ContentHandler
from xml.sax import parse

class MyContentHandler(ContentHandler):
    # override various ContentHandler methods as needed...


handler = MyContentHandler()
parse("mydata.xml", handler)

您的 ContentHandler 子类将覆盖 ContentHandler(例如 startElementstartElementNSendElementendElementNScharacters.这些处理由 SAX 解析器在读取您的 XML 文档时生成的事件.

Your ContentHandler subclass will override various methods in ContentHandler (such as startElement, startElementNS, endElement, endElementNS or characters. These handle events generated by the SAX parser as it reads your XML document in.

SAX 是一种比 DOM 更低级"的 XML 处理方式;除了从文档中提取相关数据外,您的 ContentHandler 还需要跟踪它当前包含的元素.不过,从好的方面来说,由于 SAX 解析器不会将整个文档保存在内存中,因此它们可以处理任何大小的 XML 文档,包括那些比您更大的文档.

SAX is a more 'low-level' way to handle XML than DOM; in addition to pulling out the relevant data from the document, your ContentHandler will need to do work keeping track of what elements it is currently inside. On the upside, however, as SAX parsers don't keep the whole document in memory, they can handle XML documents of potentially any size, including those larger than yours.

我还没有尝试过其他使用 DOM 解析器(例如 lxml)来处理这种大小的 XML 文档,但我怀疑 lxml 仍然需要相当长的时间并使用大量内存来解析您的 XML 文档.如果每次运行代码时都必须等待它读取 84MB XML 文档,这可能会减慢您的开发速度.

I haven't tried other using DOM parsers such as lxml on XML documents of this size, but I suspect that lxml will still take a considerable time and use a considerable amount of memory to parse your XML document. That could slow down your development if every time you run your code you have to wait for it to read in an 84MB XML document.

最后,我不相信你提到的希腊语、西班牙语和阿拉伯语字符会造成问题.

Finally, I don't believe the Greek, Spanish and Arabic characters you mention will cause a problem.

这篇关于使用 python 解析非常大的 xml 文件时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

上一篇:使用 Python 2 在 XML 中按属性查找所有节点 下一篇:BeautifulSoup 计数标签而不深入解析它们

相关文章

最新文章