当前位置: 首页 > 面试题库 >

使用lxml.etree.iterparse解析损坏的XML

孟德曜
2023-03-14
问题内容

我试图以一种内存高效的方式(例如从磁盘延迟流式传输而不是将整个文件加载到内存中)来使用lxml解析巨大的xml文件。不幸的是,文件包含一些坏的ascii字符,破坏了默认的解析器。如果我设置了restore
= True,则解析器可以工作,但是iterparse方法不使用restore参数或自定义解析器对象。有谁知道如何使用iterparse解析损坏的xml?

#this works, but loads the whole file into memory
parser = lxml.etree.XMLParser(recover=True) #recovers from bad characters.
tree = lxml.etree.parse(filename, parser)

#how do I do the equivalent with iterparse?  (using iterparse so the file can be streamed lazily from disk)
context = lxml.etree.iterparse(filename, tag='RECORD')
#record contains 6 elements that I need to extract the text from

谢谢你的帮助!

编辑-这是我遇到的编码错误类型的示例:

In [17]: data
Out[17]: '\t<articletext>&lt;p&gt;The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.&lt;/p&gt;&lt;p&gt;A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.&lt;/p&gt;&lt;p&gt;In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.&lt;/p&gt;&lt;p&gt;Retired portrait photographer.  Main hobby - quartet singing.&lt;/p&gt;</articletext>\n'

In [18]: lxml.etree.from
lxml.etree.fromstring      lxml.etree.fromstringlist

In [18]: lxml.etree.fromstring(data)
---------------------------------------------------------------------------
XMLSyntaxError                            Traceback (most recent call last)

/mnt/articles/<ipython console> in <module>()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)()

/usr/lib/python2.5/site-packages/lxml-2.2.4-py2.5-linux-i686.egg/lxml/etree.so in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)()

XMLSyntaxError: PCDATA invalid Char value 30, line 1, column 1190

In [19]: chardet.detect(data)
Out[19]: {'confidence': 1.0, 'encoding': 'ascii'}

如您所见,chardet认为这是一个ascii文件,但是在此示例中间有一个“ \ x1e”,这使lxml引发异常。


问题答案:

我通过使用File类对象界面创建一个类来解决了这个问题。类的read()方法从文件中读取一行,并替换所有“不良字符”,然后再将该行返回iterparse。

#psudo code

class myFile(object):
    def __init__(self, filename):
        self.f = open(filename)

    def read(self, size=None):
        return self.f.next().replace('\x1e', '').replace('some other bad character...' ,'')


#iterparse
context = lxml.etree.iterparse(myFile('bigfile.xml', tag='RECORD')

我不得不编辑myFile类几次,添加了更多的replace()调用,这些调用使lxml变得更加困难。我认为lxml的SAX解析也可以正常工作(似乎支持restore选项),但是这种解决方案确实很吸引人!



 类似资料:
  • 我正在尝试创建一个zip文件,以便能够通过http发送多个文件。 我的问题是,生成的Zip文件在发送之前和之后都“损坏”。问题是我无法找到我做错了什么,因为我在控制台中没有收到任何错误。 那么,有人有一个想法文件我生成的zip文件损坏? 这是我的代码: 谢谢你的帮助!

  • 我试图从一个网站下载所有pdf文件,但创建的每个pdf都已损坏。。。

  • 我编写的下载文件的方法总是产生损坏的文件。 我通过adb访问这些文件,将它们传输到我的sccard,在那里我看到它们似乎有合适的大小,但没有根据例如Linux命令的类型。 你知道丢失了什么以及如何修复它吗? 谢谢。 代码的简单版本(但错误相同) 日志:< code > file . length:2485394 | content length:1399242 问题是,我从我的API单例中获得了,

  • SLF4J:未能加载类“org.slf4j.impl.StatibloggerBinder”。 SLF4J:默认为无操作(NOP)记录器实现 SLF4J:有关更多细节,请参见http://www.slf4j.org/codes.html#staticloggerbinder。 这里是main

  • 问题内容: 我使用Eclipse在Windows 7中创建了一个jar文件。当我尝试打开jar文件时,它说jar文件无效或损坏。谁能建议我为什么jar文件无效? 问题答案: 当您在Windows资源管理器中双击一个JAR文件时,会发生这种情况,但是JAR本身实际上不是 可执行的 JAR。真正的可执行JAR至少应具有带有方法的类,并在中引用它。 在Eclispe中,您需要将项目导出为 Runnabl

  • 问题内容: 一些背景: 在使用Macports的Mac OS X 10.6上,我已经DYLD_LIBRARY_PATH在.bash_profile中进行了设置。 问题: 运行时出现java -version此错误: VM初始化期间发生错误 无法加载本机库:libjava.jnilib 通过一个有用的论坛线程,我发现问题是由于设置了’/ opt / local / lib’目录中的某些文件而导致了麻