lxml.etree

2018-09-28 本文已影响0人血刃飘香

翻译自：https://lxml.de/tutorial.html
lxml.etree提供了原ElementTree API定义的接口，以及一些简单的enhancements。

基本

from lxml import etree
root = etree.Element("root")
>>> print(root.tag)
root

添加子元素：

 root.append( etree.Element("child1") )
 child2 = etree.SubElement(root, "child2")

打印:

>>> print(etree.tostring(root, pretty_print=True))
<root>
  <child1/>
  <child2/>
  <child3/>
</root>

子元素作为列表

 child = root[0]
 root.index(root[1])  &nbsp;&nbsp; # lxml.etree only!
 root.insert(0, etree.Element("child0")
 root[1:3]
 for child in root:  print(child.tag)

特殊情况(可略过)：
同一个元素一般只能存在于一个地方(不同于original ElementTree API)

>>> for child in root:  print(child.tag)
child0
child1
child2
child3
>>> root[0] = root[-1]       # this moves the element in lxml.etree!
>>> for child in root: print(child.tag)
child3
child1
child2

Note that in the original ElementTree, a single Element object can sit in any number of places in any number of trees, which allows for the same copy operation as with lists. The obvious drawback is that modifications to such an Element will apply to all places where it appears in a tree, which may or may not be intended.
The upside of this difference is that an Element in lxml.etree always has exactly one parent, which can be queried through the getparent() method. This is not supported in the original ElementTree.

>>> root is root[0].getparent() &nbsp;&nbsp;  # lxml.etree only!
True

If you want to copy an element to a different position in lxml.etree, consider creating an independent deep copy using the copy module from Python's standard library: from copy import deepcopy.

属性作为字典

 root = etree.Element("root", interesting="totally")
 root.get("hello")
 root.set("hello", "Huhu")
 sorted(root.keys())
 for name, value in sorted(root.items()):  print('%s = %r' % (name, value))

Element的attrib成员完全支持字典式接口。

>>> attributes = root.attrib
>>> print(attributes["interesting"])
totally
>>> print(attributes.get("no-such-attribute"))
None
>>> attributes["hello"] = "Guten Tag"
>>> print(attributes["hello"])
Guten Tag
>>> print(root.get("hello"))
Guten Tag

注意对attrib成员的更改会应用到原Element上，反之亦然。使用 dict(root.attrib) 可以得到一个独立的字典。

元素中的文本

>>> root = etree.Element("root")
>>> root.text = "TEXT"
>>> etree.tostring(root)
b'<root>TEXT</root>'

数据型XML一般只会在叶子节点包含文本，但是在超文本文档里，文本可能出现在元素之间。
这可以通过tail成员获得支持(可略过)：

>>> html = etree.Element("html")
>>> body = etree.SubElement(html, "body")
>>> body.text = "TEXT"
>>> br = etree.SubElement(body, "span")
>>> br.tail = "TAIL"
>>> br.text = "MIDDLE"
>>> etree.tostring(html)
b'<html><body>TEXT<span>MIDDLE</span>TAIL</body></html>'
>>> body.text, br.text, br.tail
('TEXT', 'MIDDLE', 'TAIL')

注意.text成员只包含元素子文本中紧贴开头的部分, 而.tail成员包含紧跟在该元素后面的文本。
使用tostring可以提取出xml中所含的全部文本：

>>> etree.tostring(html, method="text")
b'TEXTMIDDLETAIL'

iter方法

元素的.iiter方法会产生一个文档树顺序的迭代器（注意和直接迭代元素本身的不同）。

>>> root = etree.Element("root")
>>> etree.SubElement(root, "child").text = "Child 1"
>>> etree.SubElement(root, "child").text = "Child 2"
>>> etree.SubElement(root, "another").text = "Child 3"
>>> etree.SubElement(root[1], "grandchild").text = "Grandchild"
>>> print(etree.tostring(root, pretty_print=True))
<root>
  <child>Child 1</child>
  <child>Child 2<grandchild>Grandchild</grandchild></child>
  <another>Child 3</another>
</root>

>>> for element in root.iter():
...     print("%s - %s" % (element.tag, element.text))
root - None
child - Child 1
child - Child 2
grandchild - Grandchild
another - Child 3

给iter方法指定tag时会仅迭代指定的元素。

>>> for element in root.iter("child"):
...     print("%s - %s" % (element.tag, element.text))
child - Child 1
child - Child 2
>>> for element in root.iter("another", "child"):
...     print("%s - %s" % (element.tag, element.text))
child - Child 1
child - Child 2
another - Child 3

ElementTree 类

ElementTree类包含完整的文档信息，如DOCTYPE和DTD。

>>> root = etree.XML('''\
... <?xml version="1.0"?>
... <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "parsnips"> ]>
... <root>
...   <a>&tasty;</a>
... </root>
... ''')
>>> tree = etree.ElementTree(root)
>>> print(tree.docinfo.xml_version)
1.0
>>> print(tree.docinfo.doctype)
<!DOCTYPE root SYSTEM "test">
>>> tree.docinfo.public_id = '-//W3C//DTD XHTML 1.0 Transitional//EN'
>>> tree.docinfo.system_url = 'file://local.dtd'
>>> print(tree.docinfo.doctype)
<!DOCTYPE root PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://local.dtd">

ElementTree是使用parse()函数来解析文件的返回值类型。
注意序列化时ElementTree和它的根节点的不同

>>> print(etree.tostring(tree).decode())
<!DOCTYPE root PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "file://local.dtd" [
<!ENTITY tasty "parsnips">
]>
<root>
<a>parsnips</a>
</root>
>>> tree.getroot() is root
True
>>> print(etree.tostring(root).decode())
<root>
<a>parsnips</a>
</root>

Parsing from strings and files

fromstring()

fromstring函数可以把一串xml解析为一个xml元素（返回值类型和etree.Element一样是lxml.etree._Element类）。

>>> some_xml_data = "<root>data</root>"
>>> root = etree.fromstring(some_xml_data)
>>> etree.tostring(root)
b'<root>data</root>'

XML()

XML函数的行为基本和fromstring一致，也是返回Element类。

>>> root = etree.XML("<root>data</root>")
>>> etree.tostring(root)
b'<root>data</root>'

还有一个HTML函数，会自动加上html和body元素(如果原字符串没有的话)(同样是返回Element类)。

>>> root = etree.HTML("<p>data</p>")
>>> etree.tostring(root)
b'<html><body><p>data</p></body></html>'

注意：HTML函数的返回值依然会被当成标准XML处理。

>>> root = etree.HTML('<head/><p>Hello<br/>World</p>')
>>> etree.tostring(root)
b'<html><head/><body><p>Hello<br/>World</p></body></html>'
>>> etree.tostring(root, method='html') 
b'<html><head></head><body><p>Hello<br>World</p></body></html>'

parse()

parse函数主要用于解析完整的文档，而上述几个字符串解析函数主要用于解析文档碎片。
注意：parse函数返回ElementTree对象，而不是Element对象。

>>> from io import BytesIO
>>> tree = etree.parse(BytesIO(b"<root>data</root>"))
>>> etree.tostring(tree)
b'<root>data</root>'
>>> etree.tostring(tree.getroot())
b'<root>data</root>'

parse函数支持以下参数：

打开的文件或文件型对象(建议以二进制模式打开)
文件名字符串
HTTP或者FTP的url

注意从文件名或者url解析通常比从文件对象解析要快。

Parser对象

暂无，见原链接。

Incremental parsing

暂无，见原链接。
lxml.etree provides two ways for incremental step-by-step parsing. One is through file-like objects, where it calls the read() method repeatedly. The second way is through a feed parser interface, given by the feed(data) and close() methods:

Event-driven parsing

暂无，见原链接。
Sometimes, all you need from a document is a small fraction somewhere deep inside the tree, so parsing the whole tree into memory, traversing it and dropping it can be too much overhead. lxml.etree supports this use case with two event-driven parser interfaces.

Namespaces