python爬虫从入门到放弃之六:BeautifulSoup库
—— BeautifulSoup "美味的汤"
是一个可以从HTML代码中提取数据的Python库
-
安装 BeautifulSoup
BeautifulSoup不是Python标准库,需要单独安装它
Windows |
pip install BeautifulSoup4 |
---|---|
Mac |
pip3 install BeautifulSoup4 |
BeautifulSoup支持Python标准库的解析器,如果你想获得更好的解析能力,可以安装lxml解析器
Windows |
pip install lxml |
---|
lxml解析器,它包含lxml HTML 解析器和lxml XML 解析器,使用方法如下:
解析器 |
使用方法 |
---|---|
Python标准库 |
BeautifulSoup(html, "html.parser") |
lxml HTML 解析器 |
BeautifulSoup(html, "lxml") |
lxml XML 解析器 |
BeautifulSoup(xml, "xml") |
-
BeautifulSoup基本展示
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.baidu.com')
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,'lxml')
print(type(soup))
print(soup)
以上将html源代码转换为<class 'bs4.BeautifulSoup'>
对象,接下来我们看看bs对象用哪些方法和属性来提取tag标签、值
-
BeautifulSoup功能展示
下面的一段HTML代码是 《爱丽丝梦游仙境》 的一段内容html_doc
(简称为 爱丽丝 的文档):
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(soup.p)
print(soup.find('p'))
print('')
print(soup.find('p',class_="story"))
print('')
print(soup.find_all('a'))
print('')
print(soup.find('a',id = "link3"))
print(soup.find_all('a')[1])
print('')
print(type(soup.find('a')))
print(type(soup.find_all('a')))
for link in soup.find_all('a'):
print(type(link))
运行结果:
<p class="title"><b>The Dormouse's story</b></p>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<class 'bs4.element.Tag'>
<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
>>>
从运行结果来看,soup.p
跟soup.find('p')
返回数据是一样的,都是取出第一个p
标签,但是find()方法可以指定属性,更准确的定位,如:soup.find('p',class_="story")
返回符合class_="story"
属性的第一个标签,这里的由于class是Python保留字,要写成class_
find()方法是取出第一个符合要求的标签,find_all()方法是取出所有符合要求的标签
find_all()方法返回列表结构,你可以把它当做列表来看,但不算是列表,而是<class 'bs4.element.ResultSet'>
,同时,这个ResultSet的对象可以用列表的方法来取元素,例如索引、for语句...
find()方法返回"bs4.element.Tag"对象,find_all()方法返回"bs4.element.Tag"对象的总体
find()方法和find_all()方法都可以通过 |标签名|
、|标签名+属性|
提取标签
bs4.element.Tag对象可以再次提取Tag,bs4.element.ResultSet需要用for语句先遍历
——以上,是BeautifulSoup直接提取标签的方法
-
提取Tag标签的兄弟标签
还是 爱丽丝 文档
from bs4 import BeautifulSoup
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
# 先定义一个标签变量
link1 = soup.a
# 获取link1后面的第一个'a'标签
link2 = link1.find_next('a')
print(link2)
print('')
# 获取link2后面的第一个'a'标签
link3 = link2.find_next('a')
print(link3)
print('')
# 获取link1后面的所有'a'标签
links = link1.find_all_next('a')
print(links)
print('')
# 获取link1后面的所有标签,会把不需要的标签也拿到
links = link1.find_all_next()
print(links)
print('')
# 获取link3前面的第一个'a'标签
link_2 = link3.find_previous('a')
print(link_2)
print('')
# 获取link3前面的所有'a'标签
link_s = link3.find_all_previous('a')
print(link_s)
运行结果:
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>>
可以看到:
find_next() 方法返回后面
第一个符合条件的兄弟标签,find_all_next() 方法返回后面
所有符合条件的兄弟标签
find_previous() 方法返回前面
第一个符合条件的兄弟标签,find_all_previous() 方法返回前面
所有符合条件的兄弟标签
-
提取Tag中的内容
还是 爱丽丝 文档
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
title = soup.find('p',class_ = 'title')
print(type(title.text),'——',title.text)
link1 = soup.find('a',id="link1")
print(type(link1.text),'——',link1.text)
link3 = soup.find('a',id="link3")
print(type(link3['href']),'——',link3['href'])
p2 = soup.find_all('p')[2]
print(type(p2.text),'——',p2.text)
links = soup.find_all('a')
for link in links:
print(link.text,'——',link['href'])
运行结果:
<class 'str'> —— The Dormouse's story
<class 'str'> —— Elsie
<class 'str'> —— http://example.com/tillie
<class 'str'> —— ...
Elsie —— http://example.com/elsie
Lacie —— http://example.com/lacie
Tillie —— http://example.com/tillie
>>>
可以发现:
Tag.text可以提取Tag中的文字
Tag['属性名']可以提取属性的值
-
BeautifulSoup提取数据基本思路:
0,用find()方法提取html中的大标签,也就是父标签
1,用find_all()提取大标签中的子标签,返回列表形式
2,用for语句遍历,提取每一个子标签
3,用Tag.text和Tag['属性名']来取出想要的数据
更多BeautifulSoup知识请点击下面的官网文档:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
>>>
阅读更多文章请点击以下链接:
python爬虫从入门到放弃之一:认识爬虫
python爬虫从入门到放弃之二:HTML基础
python爬虫从入门到放弃之三:爬虫的基本流程
python爬虫从入门到放弃之四:Requests库基础
python爬虫从入门到放弃之五:Requests库高级用法
python爬虫从入门到放弃之六:BeautifulSoup库
python爬虫从入门到放弃之七:正则表达式
python爬虫从入门到放弃之八:Xpath
python爬虫从入门到放弃之九:Json解析
python爬虫从入门到放弃之十:selenium库
python爬虫从入门到放弃之十一:定时发送邮件
python爬虫从入门到放弃之十二:多协程
python爬虫从入门到放弃之十三:Scrapy概念和流程
python爬虫从入门到放弃之十四:Scrapy入门使用