BeautifulSoup文档学习4-输出

2020-03-13 本文已影响0人 JA_Cobra

输出

格式化输出

prettify()方法将BeautifulSoup的文档树格式化后以Unicode编码输出，每个XML/HTML标签独占一行。

示例：

>>> markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
>>> soup = BeautifulSoup(markup)
>>> soup.prettify()
'<html>\n <body>\n  <a href="http://example.com/">\n   I linked to\n   <i>\n    example.com\n   </i>\n  </a>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
 <body>
  <a href="http://example.com/">
   I linked to
   <i>
    example.com
   </i>
  </a>
 </body>
</html>

BeautifulSoup对象和它的tag节点都可以调用prettify()方法

压缩输出

如果只想得到结果字符串，可以对BeautifulSoup对象或者tag对象直接使用Python的unicode()和str()方法：

>>> str(soup)
'<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'
 
>>> unicode(soup.a)
u'<a href="http://example.com/">I linked to <i>example.com</i></a>'

`get_text()`

如果只想得到tag中包含的文本内容，可以使用get_text()方法，这个方法获取到tag中包含的所有文本内容包括子孙节点中tag的内容，并结果作为Unicode字符串返回：

>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup)
 
>>> soup.get_text()
'\nI linked to example.com\n'
>>> soup.i.get_text()
'example.com'

可以通过参数指定tag的文本内容的分隔符：

>>> soup.get_text("|")
'\nI linked to |example.com|\n'

还可以去除获得内容的前后空白：

>>> soup.get_text("|", strip=True)
'I linked to|example.com'

BeautifulSoup文档学习4-输出

输出

格式化输出

压缩输出

`get_text()`

猜你喜欢

热点阅读