python爬虫从入门到放弃之六：BeautifulSoup库

2019-07-20 本文已影响21人 52d19f475fe5

—— BeautifulSoup "美味的汤"

是一个可以从HTML代码中提取数据的Python库

安装 BeautifulSoup

BeautifulSoup不是Python标准库，需要单独安装它

`Windows`	`pip install BeautifulSoup4`
`Mac`	`pip3 install BeautifulSoup4`

BeautifulSoup支持Python标准库的解析器，如果你想获得更好的解析能力，可以安装lxml解析器

`Windows`	`pip install lxml`

lxml解析器，它包含lxml HTML 解析器和lxml XML 解析器，使用方法如下：

`解析器`	`使用方法`
`Python标准库`	`BeautifulSoup(html, "html.parser")`
`lxml HTML 解析器`	`BeautifulSoup(html, "lxml")`
`lxml XML 解析器`	`BeautifulSoup(xml, "xml")`

BeautifulSoup基本展示

import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.baidu.com')
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,'lxml') 
print(type(soup))
print(soup)

以上将html源代码转换为<class 'bs4.BeautifulSoup'>对象，接下来我们看看bs对象用哪些方法和属性来提取tag标签、值

BeautifulSoup功能展示

下面的一段HTML代码是《爱丽丝梦游仙境》的一段内容html_doc(简称为爱丽丝的文档):

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

print(soup.p)
print(soup.find('p'))
print('')
print(soup.find('p',class_="story"))
print('')
print(soup.find_all('a'))
print('')
print(soup.find('a',id = "link3"))
print(soup.find_all('a')[1])
print('')
print(type(soup.find('a')))
print(type(soup.find_all('a')))
for link in soup.find_all('a'):
    print(type(link))

运行结果：

<p class="title"><b>The Dormouse's story</b></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

<class 'bs4.element.Tag'>
<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
>>>

从运行结果来看，soup.p跟soup.find('p')返回数据是一样的，都是取出第一个p标签，但是find()方法可以指定属性，更准确的定位，如：soup.find('p',class_="story")返回符合class_="story"属性的第一个标签，这里的由于class是Python保留字，要写成class_

find()方法是取出第一个符合要求的标签，find_all()方法是取出所有符合要求的标签

find_all()方法返回列表结构，你可以把它当做列表来看，但不算是列表，而是<class 'bs4.element.ResultSet'>，同时，这个ResultSet的对象可以用列表的方法来取元素，例如索引、for语句...

find()方法返回"bs4.element.Tag"对象，find_all()方法返回"bs4.element.Tag"对象的总体

find()方法和find_all()方法都可以通过 |标签名|、|标签名+属性|提取标签

bs4.element.Tag对象可以再次提取Tag，bs4.element.ResultSet需要用for语句先遍历

——以上，是BeautifulSoup直接提取标签的方法

提取Tag标签的兄弟标签

还是爱丽丝文档

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')

# 先定义一个标签变量
link1 = soup.a

# 获取link1后面的第一个'a'标签
link2 = link1.find_next('a')
print(link2)
print('')

# 获取link2后面的第一个'a'标签
link3 = link2.find_next('a')
print(link3)
print('')

# 获取link1后面的所有'a'标签
links = link1.find_all_next('a')
print(links)
print('')

# 获取link1后面的所有标签,会把不需要的标签也拿到
links = link1.find_all_next()
print(links)
print('')

# 获取link3前面的第一个'a'标签
link_2 = link3.find_previous('a')
print(link_2)
print('')

# 获取link3前面的所有'a'标签
link_s = link3.find_all_previous('a')
print(link_s)

运行结果：

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>>

可以看到：
find_next() 方法返回后面第一个符合条件的兄弟标签，find_all_next() 方法返回后面所有符合条件的兄弟标签

find_previous() 方法返回前面第一个符合条件的兄弟标签，find_all_previous() 方法返回前面所有符合条件的兄弟标签

提取Tag中的内容

还是爱丽丝文档

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

title = soup.find('p',class_ = 'title')
print(type(title.text),'——',title.text)

link1 = soup.find('a',id="link1")
print(type(link1.text),'——',link1.text)

link3 = soup.find('a',id="link3")
print(type(link3['href']),'——',link3['href'])

p2 = soup.find_all('p')[2]
print(type(p2.text),'——',p2.text)

links = soup.find_all('a')
for link in links:
    print(link.text,'——',link['href'])

运行结果：

<class 'str'> —— The Dormouse's story
<class 'str'> —— Elsie
<class 'str'> —— http://example.com/tillie
<class 'str'> —— ...
Elsie —— http://example.com/elsie
Lacie —— http://example.com/lacie
Tillie —— http://example.com/tillie
>>>

可以发现：
Tag.text可以提取Tag中的文字
Tag['属性名']可以提取属性的值

BeautifulSoup提取数据基本思路：

0，用find()方法提取html中的大标签，也就是父标签

1，用find_all()提取大标签中的子标签，返回列表形式

2，用for语句遍历，提取每一个子标签

3，用Tag.text和Tag['属性名']来取出想要的数据

更多BeautifulSoup知识请点击下面的官网文档：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

>>>阅读更多文章请点击以下链接：

python爬虫从入门到放弃之一：认识爬虫
 python爬虫从入门到放弃之二：HTML基础
 python爬虫从入门到放弃之三：爬虫的基本流程
 python爬虫从入门到放弃之四：Requests库基础
 python爬虫从入门到放弃之五：Requests库高级用法
 python爬虫从入门到放弃之六：BeautifulSoup库
 python爬虫从入门到放弃之七：正则表达式
 python爬虫从入门到放弃之八：Xpath
python爬虫从入门到放弃之九：Json解析
 python爬虫从入门到放弃之十：selenium库
 python爬虫从入门到放弃之十一：定时发送邮件
 python爬虫从入门到放弃之十二：多协程
 python爬虫从入门到放弃之十三：Scrapy概念和流程
 python爬虫从入门到放弃之十四：Scrapy入门使用

python爬虫从入门到放弃之六：BeautifulSoup库

安装 BeautifulSoup

BeautifulSoup基本展示

BeautifulSoup功能展示

提取Tag标签的兄弟标签

提取Tag中的内容

BeautifulSoup提取数据基本思路：

猜你喜欢

热点阅读