第 9 章 XML 处理
第 9 章 XML 处理
- 9.1. 概览
- 9.2. 包
- 9.3. XML 解析
- 9.4. Unicode
- 9.5. 搜索元素
- 9.6. 访问元素属性
- 9.7. Segue
9.1. 概览
下面两章是关于 Python 中 XML 处理的。如果你已经知道一个 XML 文档的样子,比如它是由结构化标记构成的,这些标记形成了层次模型的元素,等等这些知识都是有帮助的。如果你不明白这些,这里有 很多 XML 教程 能够解释这些基础知识。
如果你对XML不是很感兴趣,你还是应该读一下这些章节,它们涵盖了不少重要的主题比如 Python 包,Unicode,命令行参数以及如何使用 getattr 进行方法分发。
Being a philosophy major is not required, although if you have ever had the misfortune of being subjected to the writings of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer science.
处理 XML 有两种基本的方式。一种叫做 SAX(“Simple API for XML”),它的工作方式是,一次读出一点 XML 内容,然后对发现的每一个元素调用一个方法。(如果你读了 第 8 章 HTML 处理,这应该听起来很熟悉,因为这是 sgmllib 工作的方式。)另一种方式叫做 DOM (“Document Object Model”),它的工作方式是,一次性读入整个 XML 文档,然后使用 Python 类创建一个内部表示形式(以树结构进行连接)。Python 拥有这两种解析方式的标准模块,但是本章只涉及 DOM。
下面是一个完整的 Python 程序,它根据 XML 格式定义的上下文无关语法生成伪随机输出。如果你不明白是什么意思,不用担心,下面两章中将会深入的检视这个程序的输入和输出。
例 9.1. kgp.py
如果您还没有下载本书附带的例子程序, 可以 下载本程序和其他例子程序。
"""Kant Generator for Python Generates mock philosophy based on a context-free grammar Usage: python kgp.py [options] [source] Options: -g ..., --grammar=... use specified grammar file or URL -h, --help show this help -d show debugging information while parsing Examples: kgp.py generates several paragraphs of Kantian philosophy kgp.py -g husserl.xml generates several paragraphs of Husserl kpg.py "<xref id='paragraph'/>" generates a paragraph of Kant kgp.py template.xml reads from template.xml to decide what to generate """ from xml.dom import minidom import random import toolbox import sys import getopt _debug = 0 class NoSourceError(Exception): pass class KantGenerator: """generates mock philosophy based on a context-free grammar""" def __init__(self, grammar, source=None): self.loadGrammar(grammar) self.loadSource(source and source or self.getDefaultSource()) self.refresh() def _load(self, source): """load XML input source, return parsed XML document - a URL of a remote XML file ("http://diveintopython.org/kant.xml") - a filename of a local XML file ("~/diveintopython/common/py/kant.xml") - standard input ("-") - the actual XML document, as a string """ sock = toolbox.openAnything(source) xmldoc = minidom.parse(sock).documentElement sock.close() return xmldoc def loadGrammar(self, grammar): """load context-free grammar""" self.grammar = self._load(grammar) self.refs = {} for ref in self.grammar.getElementsByTagName("ref"): self.refs[ref.attributes["id"].value] = ref def loadSource(self, source): """load source""" self.source = self._load(source) def getDefaultSource(self): """guess default source of the current grammar The default source will be one of the <ref>s that is not cross-referenced. This sounds complicated but it's not. Example: The default source for kant.xml is "<xref id='section'/>", because 'section' is the one <ref> that is not <xref>'d anywhere in the grammar. In most grammars, the default source will produce the longest (and most interesting) output. """ xrefs = {} for xref in self.grammar.getElementsByTagName("xref"): xrefs[xref.attributes["id"].value] = 1 xrefs = xrefs.keys() standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs] if not standaloneXrefs: raise NoSourceError, "can't guess source, and no source specified" return '<xref id="%s"/>' % random.choice(standaloneXrefs) def reset(self): """reset parser""" self.pieces = [] self.capitalizeNextWord = 0 def refresh(self): """reset output buffer, re-parse entire source file, and return output Since parsing involves a good deal of randomness, this is an easy way to get new output without having to reload a grammar file each time. """ self.reset() self.parse(self.source) return self.output() def output(self): """output generated text""" return "".join(self.pieces) def randomChildElement(self, node): """choose a random child element of a node This is a utility method used by do_xref and do_choice. """ choices = [e for e in node.childNodes if e.nodeType == e.ELEMENT_NODE] chosen = random.choice(choices) if _debug: sys.stderr.write('%s available choices: %s\n' % \ (len(choices), [e.toxml() for e in choices])) sys.stderr.write('Chosen: %s\n' % chosen.toxml()) return chosen def parse(self, node): """parse a single XML node A parsed XML document (from minidom.parse) is a tree of nodes of various types. Each node is represented by an instance of the corresponding Python class (Element for a tag, Text for text data, Document for the top-level document). The following statement constructs the name of a class method based on the type of node we're parsing ("parse_Element" for an Element node, "parse_Text" for a Text node, etc.) and then calls the method. """ parseMethod = getattr(self, "parse_%s" % node.__class__.__name__) parseMethod(node) def parse_Document(self, node): """parse the document node The document node by itself isn't interesting (to us), but its only child, node.documentElement, is: it's the root node of the grammar. """ self.parse(node.documentElement) def parse_Text(self, node): """parse a text node The text of a text node is usually added to the output buffer verbatim. The one exception is that <p class='sentence'> sets a flag to capitalize the first letter of the next word. If that flag is set, we capitalize the text and reset the flag. """ text = node.data if self.capitalizeNextWord: self.pieces.append(text[0].upper()) self.pieces.append(text[1:]) self.capitalizeNextWord = 0 else: self.pieces.append(text) def parse_Element(self, node): """parse an element An XML element corresponds to an actual tag in the source: <xref id='...'>, <p chance='...'>, <choice>, etc. Each element type is handled in its own method. Like we did in parse(), we construct a method name based on the name of the element ("do_xref" for an <xref> tag, etc.) and call the method. """ handlerMethod = getattr(self, "do_%s" % node.tagName) handlerMethod(node) def parse_Comment(self, node): """parse a comment The grammar can contain XML comments, but we ignore them """ pass def do_xref(self, node): """handle <xref id='...'> tag An <xref id='...'> tag is a cross-reference to a <ref id='...'> tag. <xref id='sentence'/> evaluates to a randomly chosen child of <ref id='sentence'>. """ id = node.attributes["id"].value self.parse(self.randomChildElement(self.refs[id])) def do_p(self, node): """handle <p> tag The <p> tag is the core of the grammar. It can contain almost anything: freeform text, <choice> tags, <xref> tags, even other <p> tags. If a "class='sentence'" attribute is found, a flag is set and the next word will be capitalized. If a "chance='X'" attribute is found, there is an X% chance that the tag will be evaluated (and therefore a (100-X)% chance that it will be completely ignored) """ keys = node.attributes.keys() if "class" in keys: if node.attributes["class"].value == "sentence": self.capitalizeNextWord = 1 if "chance" in keys: chance = int(node.attributes["chance"].value) doit = (chance > random.randrange(100)) else: doit = 1 if doit: for child in node.childNodes: self.parse(child) def do_choice(self, node): """handle <choice> tag A <choice> tag contains one or more <p> tags. One <p> tag is chosen at random and evaluated; the rest are ignored. """ self.parse(self.randomChildElement(node)) def usage(): print __doc__ def main(argv): grammar = "kant.xml" try: opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="]) except getopt.GetoptError: usage() sys.exit(2) for opt, arg in opts: if opt in ("-h", "--help"): usage() sys.exit() elif opt == '-d': global _debug _debug = 1 elif opt in ("-g", "--grammar"): grammar = arg source = "".join(args) k = KantGenerator(grammar, source) print k.output() if __name__ == "__main__": main(sys.argv[1:])
例 9.2. toolbox.py
"""Miscellaneous utility functions""" def openAnything(source): """URI, filename, or string --> stream This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it. Examples: >>> from xml.dom import minidom >>> sock = openAnything("http://localhost/kant.xml") >>> doc = minidom.parse(sock) >>> sock.close() >>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml") >>> doc = minidom.parse(sock) >>> sock.close() >>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>") >>> doc = minidom.parse(sock) >>> sock.close() """ if hasattr(source, "read"): return source if source == '-': import sys return sys.stdin # try to open with urllib (if source is http, ftp, or file URL) import urllib try: return urllib.urlopen(source) except (IOError, OSError): pass # try to open with native open function (if source is pathname) try: return open(source) except (IOError, OSError): pass # treat source as string import StringIO return StringIO.StringIO(str(source))
独立运行程序 kgp.py ,它会解析 kant.xml 中默认的基于 XML 的语法,并以康德的风格打印出几段有哲学价值的段落来。
例 9.3. Sample output of kgp.py
[you@localhost kgp]$ python kgp.py As is shown in the writings of Hume, our a priori concepts, in reference to ends, abstract from all content of knowledge; in the study of space, the discipline of human reason, in accordance with the principles of philosophy, is the clue to the discovery of the Transcendental Deduction. The transcendental aesthetic, in all theoretical sciences, occupies part of the sphere of human reason concerning the existence of our ideas in general; still, the never-ending regress in the series of empirical conditions constitutes the whole content for the transcendental unity of apperception. What we have alone been able to show is that, even as this relates to the architectonic of human reason, the Ideal may not contradict itself, but it is still possible that it may be in contradictions with the employment of the pure employment of our hypothetical judgements, but natural causes (and I assert that this is the case) prove the validity of the discipline of pure reason. As we have already seen, time (and it is obvious that this is true) proves the validity of time, and the architectonic of human reason, in the full sense of these terms, abstracts from all content of knowledge. I assert, in the case of the discipline of practical reason, that the Antinomies are just as necessary as natural causes, since knowledge of the phenomena is a posteriori. The discipline of human reason, as I have elsewhere shown, is by its very nature contradictory, but our ideas exclude the possibility of the Antinomies. We can deduce that, on the contrary, the pure employment of philosophy, on the contrary, is by its very nature contradictory, but our sense perceptions are a representation of, in the case of space, metaphysics. The thing in itself is a representation of philosophy. Applied logic is the clue to the discovery of natural causes. However, what we have alone been able to show is that our ideas, in other words, should only be used as a canon for the Ideal, because of our necessary ignorance of the conditions. [...snip...]
当然这是胡言乱语。噢,不完全是胡言乱语。它在句法和语法上都是正确的(尽管非常罗嗦--康德可不是你们所说的踩得到点上的那种人)。其中一些实际上是正确的(或者至少康德可能会认同的事情),其中一些则明显是错误的,大部分只是语无伦次。但所有内容都是符合康德的风格。
让我重复一遍,如果你现在或曾经主修哲学专业,这会非常、非常有趣。
关于这个程序的有趣之处在于没有一点内容是属于康德的。所有的内容都来自于上下文无关语法文件kant.xml。如果你要程序使用不同的语法文件(可以在命令行中指定),输出信息将完全不同。
例 9.4. kgp.py 的简单输出
[you@localhost kgp]$ python kgp.py -g binary.xml 00101001 [you@localhost kgp]$ python kgp.py -g binary.xml 10110100
在本章后面的内容中,你将近距离的观察语法文件的结构。现在,你只要知道语法文件定义了输出信息的结构,而 kgp.py 程序读取语法规则并随机确定哪些单词插入哪里。