BeautifulSoup4库

2019-07-28 本文已影响0人叶扬风起

bs4 全名 BeautifulSoup，是编写 python 爬虫常用库之一，主要用来解析 html 标签

https://www.cnblogs.com/gl1573/p/9480022.html

一、初始化

from bs4 import BeautifulSoup
···
from bs4 import BeautifulSoup soup = BeautifulSoup("<html>A Html Text</html>", "html.parser") soup.prettify() # prettify 有括号和没括号都可以
···

二、搜索（常用命令）

1、find 和 find_all

搜索当前 tag 的所有 tag 子节点，并判断是否符合过滤器的条件

语法：

find(name=None, attrs={}, recursive=True, text=None, **kwargs) 
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

通过 attrs 参数传递：

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') 
print(data_soup.find_all(attrs={"data-foo": "value"}))

按 class_ 查找

css_soup = BeautifulSoup('<p class="body bold strikeout"></p>') 
print(css_soup.find_all("p", class_="strikeout")) 
print(css_soup.find_all("p", class_="body"))

2、其他搜索方法

find_parents()　　　　　 返回所有祖先节点

find_parent()　　　　　　返回直接父节点

find_next_siblings()　　 返回后面所有的兄弟节点

find_next_sibling()　　 返回后面的第一个兄弟节点

find_previous_siblings() 返回前面所有的兄弟节点

find_previous_sibling()　返回前面第一个兄弟节点

find_all_next()　　　　 返回节点后所有符合条件的节点

find_next()　　　　　　 返回节点后第一个符合条件的节点

find_all_previous()　　 返回节点前所有符合条件的节点

find_previous()　　　　 返回节点前所有符合条件的节点

三、对象

1、tag:Tag对象与 xml 或 html 原生文档中的 tag 相同。如果不存在，则返回 None，如果存在多个，则返回第一个。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') 
tag = soup.b

2、Attributes：tag 的属性是一个字典

type(tag.attrs) 
# <class 'dict'>

3、多值属性：最常见的多值属性是class，多值属性的返回 list。

soup = BeautifulSoup('<p class="body strikeout"></p>') 
print(soup.p['class']) 
# ['body', 'strikeout'] 
print(soup.p.attrs) 
# {'class': ['body', 'strikeout']}

4、Text：text 属性返回 tag 的所有字符串连成的字符串。

5、其他方法：tag.has_attr('id') # 返回 tag 是否包含 id 属性

NavigableString：字符串常被包含在 tag 内，Beautiful Soup 用 NavigableString 类来包装 tag 中的字符串。但是字符串中不能包含其他 tag。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') 
s = soup.b.string print(s) 
# Extremely bold 
print(type(s)) 
# <class 'bs4.element.NavigableString'>

Comment：Comment 一般表示文档的注释部分。

soup = BeautifulSoup("<b><!--This is a comment--></b>") 
comment = soup.b.string 
print(comment) 
# This is a comment 
print(type(comment)) 
# <class 'bs4.element.Comment'>

四、遍历

（一）、子节点

1、contents 属性：返回所有子节点的列表，包括 NavigableString 类型节点。如果节点当中有换行符，会被当做是 NavigableString 类型节点而作为一个子节点。NavigableString 类型节点没有 contents 属性，因为没有子节点。

soup = BeautifulSoup("""<div> <span>test</span> </div> """) 
element = soup.div.contents 
print(element) 
# ['\n', <span>test</span>, '\n']

2、children 属性：children 属性跟 contents 属性基本一样，只不过返回的不是子节点列表，而是子节点的可迭代对象。

3、descendants 属性：descendants 属性返回 tag 的所有子孙节点。

4、string 属性：如果一个 tag 仅有一个子节点，那么这个 tag 也可以使用 .string 方法，输出结果与当前唯一子节点的 .string 结果相同。

如果 tag 包含了多个子节点，tag 就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None。

soup = BeautifulSoup("""<div> <p><span><b>test</b></span></p> </div> """) 
element = soup.p.string 
print(element) 
# test 
print(type(element)) 
# <class 'bs4.element.NavigableString'>

（二）、父节点

1、parent 属性：parent 属性返回某个元素（tag、NavigableString）的父节点，文档的顶层节点的父节点是 BeautifulSoup 对象，BeautifulSoup 对象的父节点是 None。

2、parents 属性：parent 属性递归得到元素的所有父辈节点，包括 BeautifulSoup 对象。

（三）、兄弟节点

next_sibling 和 previous_sibling：next_sibling 返回后一个兄弟节点，previous_sibling 返回前一个兄弟节点。直接看个例子，注意别被换行缩进搅了局。

soup = BeautifulSoup("""<div> <p>test 1</p><b>test 2</b><h>test 3</h></div> """, 'html.parser') 
print(soup.b.next_sibling) 
# <h>test 3</h> 
print(soup.b.previous_sibling) 
# <p>test 1</p> 
print(soup.h.next_sibling) # None

五、通过css查找

from bs4 import BeautifulSoup 
html = ''' <html> <head><title>标题</title></head> <body> <p class="title" name="dromouse"><b>标题</b></p> <div name="divlink"> <p> <a href="http://example.com/1" class="sister" id="link1">链接1</a> <a href="http://example.com/2" class="sister" id="link2">链接2</a> <a href="http://example.com/3" class="sister" id="link3">链接3</a> </p> </div> <p></p> <div name='dv2'></div> </body> </html> '''
soup = BeautifulSoup(html, 'lxml') 

# 通过tag查找 
print(soup.select('title')) 
# [<title>标题</title>] 

# 通过tag逐层查找 
print(soup.select("html head title")) 
# [<title>标题</title>] 

# 通过class查找 
print(soup.select('.sister')) 
# [<a class="sister" href="http://example.com/1" id="link1">链接1</a>, 
# <a class="sister" href="http://example.com/2" id="link2">链接2</a>, 
# <a class="sister" href="http://example.com/3" id="link3">链接3</a>] 

# 通过id查找 
print(soup.select('#link1, #link2')) 
# [<a class="sister" href="http://example.com/1" id="link1">链接1</a>, 
# <a class="sister" href="http://example.com/2" id="link2">链接2</a>] 

# 组合查找 
print(soup.select('p #link1'))　　　　
# [<a class="sister" href="http://example.com/1" id="link1">链接1</a>] 

# 查找直接子标签 
print(soup.select("head > title"))　 
# [<title>标题</title>] 

print(soup.select("p > #link1"))　　 
# [<a class="sister" href="http://example.com/1" id="link1">链接1</a>] 

print(soup.select("p > a:nth-of-type(2)"))　　
# [<a class="sister" href="http://example.com/2" id="link2">链接2</a>] 

# nth-of-type 是CSS选择器 

# 查找兄弟节点（向后查找） 
print(soup.select("#link1 ~ .sister")) 
# [<a class="sister" href="http://example.com/2" id="link2">链接2</a>, 
# <a class="sister" href="http://example.com/3" id="link3">链接3</a>] 

print(soup.select("#link1 + .sister")) 
# [<a class="sister" href="http://example.com/2" id="link2">链接2</a>] 
# 通过属性查找 
print(soup.select('a[href="http://example.com/1"]')) 

# ^ 以XX开头 
print(soup.select('a[href^="http://example.com/"]')) 

# * 包含 
print(soup.select('a[href*=".com/"]')) 

# 查找包含指定属性的标签 
print(soup.select('[name]')) 

# 查找第一个元素 
print(soup.select_one(".sister"))