bs4

2019-08-16 本文已影响0人不咸的Yan

一个方便的网页解析库，处理高效，支持多种解析器。
主流的是Python标准库html.parser,一个是lxml解析器

# Python的标准库
BeautifulSoup(html, 'html.parser')

# lxml
BeautifulSoup(html, 'lxml')

内置标准库执行速度一般，在低版本的Python中，中文的容错能力比较差
lxml解析器执行速度快，需要装C语言依赖库

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')

soup.prettify() >>> 进行自动补全，将缺失代码补齐。

选择器

标准选择器
find_all(name, attrs, recursive, text, **kwargs)

soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")   多个参数传入是一个传递关系  p标签下的title  css样式
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all('div', class_='top')
# 这里注意下，class是Python的内部关键词，我们需要在css属性class后面加一个下划线'_'，不然会报错。

soup.find_all("a", limit=2)
# [<a id="link1" class="sister" href="http://example.com/elsie">Elsie</a>,
# <a id="link2" class="sister" href="http://example.com/lacie">Lacie</a>]

find( name , attrs , recursive , string , **kwargs )
与find_all的不同

soup.find_all('title', limit=1)
# [The Dormouse's story]

soup.find('title')
#The Dormouse's story

find_all返回的是一个列表，找不到目标返回空列表，
find直接返回结果，找不到目标返回None

CSS选择器

soup.select("title")
# [The Dormouse's story]

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]
soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]×

soup.select("body > a")
# []

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

提取标签内容

list = [<ahref="http://www.baidu.com/">百度</a>,
<ahref="http://www.163.com/">网易</a>,
<ahref="http://www.sina.com/"新浪</a>]

for i inlist:
print(i.get_text()) # 我们使用get_text()方法获得标签内容
print(i.get['href']# get['attrs']方法获得标签属性
print(i['href'])# 简写结果一样

百度
网易
新浪
http://www.baidu.com/
http://www.163.com/
http://www.sina.com/
http://www.baidu.com/
http://www.163.com/
http://www.sina.com/

bs4

bs4

猜你喜欢

热点阅读