Python爬虫从入门到放弃python 爬虫

python爬虫从入门到放弃之六:BeautifulSoup库

2019-07-20  本文已影响21人  52d19f475fe5

—— BeautifulSoup "美味的汤"

是一个可以从HTML代码中提取数据的Python库

BeautifulSoup不是Python标准库,需要单独安装它

Windows pip install BeautifulSoup4
Mac pip3 install BeautifulSoup4

BeautifulSoup支持Python标准库的解析器,如果你想获得更好的解析能力,可以安装lxml解析器

Windows pip install lxml

lxml解析器,它包含lxml HTML 解析器和lxml XML 解析器,使用方法如下:

解析器 使用方法
Python标准库 BeautifulSoup(html, "html.parser")
lxml HTML 解析器 BeautifulSoup(html, "lxml")
lxml XML 解析器 BeautifulSoup(xml, "xml")
import requests
from bs4 import BeautifulSoup

r = requests.get('http://www.baidu.com')
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,'lxml') 
print(type(soup))
print(soup)

以上将html源代码转换为<class 'bs4.BeautifulSoup'>对象,接下来我们看看bs对象用哪些方法和属性来提取tag标签、值

下面的一段HTML代码是 《爱丽丝梦游仙境》 的一段内容html_doc(简称为 爱丽丝 的文档):

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

print(soup.p)
print(soup.find('p'))
print('')
print(soup.find('p',class_="story"))
print('')
print(soup.find_all('a'))
print('')
print(soup.find('a',id = "link3"))
print(soup.find_all('a')[1])
print('')
print(type(soup.find('a')))
print(type(soup.find_all('a')))
for link in soup.find_all('a'):
    print(type(link))

运行结果:

<p class="title"><b>The Dormouse's story</b></p>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

<class 'bs4.element.Tag'>
<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
>>> 

从运行结果来看,soup.psoup.find('p')返回数据是一样的,都是取出第一个p标签,但是find()方法可以指定属性,更准确的定位,如:soup.find('p',class_="story")返回符合class_="story"属性的第一个标签,这里的由于class是Python保留字,要写成class_

find()方法是取出第一个符合要求的标签,find_all()方法是取出所有符合要求的标签

find_all()方法返回列表结构,你可以把它当做列表来看,但不算是列表,而是<class 'bs4.element.ResultSet'>,同时,这个ResultSet的对象可以用列表的方法来取元素,例如索引、for语句...

find()方法返回"bs4.element.Tag"对象,find_all()方法返回"bs4.element.Tag"对象的总体

find()方法和find_all()方法都可以通过 |标签名||标签名+属性|提取标签

bs4.element.Tag对象可以再次提取Tag,bs4.element.ResultSet需要用for语句先遍历

——以上,是BeautifulSoup直接提取标签的方法

还是 爱丽丝 文档

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')

# 先定义一个标签变量
link1 = soup.a

# 获取link1后面的第一个'a'标签
link2 = link1.find_next('a')
print(link2)
print('')

# 获取link2后面的第一个'a'标签
link3 = link2.find_next('a')
print(link3)
print('')

# 获取link1后面的所有'a'标签
links = link1.find_all_next('a')
print(links)
print('')

# 获取link1后面的所有标签,会把不需要的标签也拿到
links = link1.find_all_next()
print(links)
print('')

# 获取link3前面的第一个'a'标签
link_2 = link3.find_previous('a')
print(link_2)
print('')

# 获取link3前面的所有'a'标签
link_s = link3.find_all_previous('a')
print(link_s)

运行结果:

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>> 

可以看到:
find_next() 方法返回后面第一个符合条件的兄弟标签,find_all_next() 方法返回后面所有符合条件的兄弟标签

find_previous() 方法返回前面第一个符合条件的兄弟标签,find_all_previous() 方法返回前面所有符合条件的兄弟标签

还是 爱丽丝 文档

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

title = soup.find('p',class_ = 'title')
print(type(title.text),'——',title.text)

link1 = soup.find('a',id="link1")
print(type(link1.text),'——',link1.text)

link3 = soup.find('a',id="link3")
print(type(link3['href']),'——',link3['href'])

p2 = soup.find_all('p')[2]
print(type(p2.text),'——',p2.text)

links = soup.find_all('a')
for link in links:
    print(link.text,'——',link['href'])

运行结果:

<class 'str'> —— The Dormouse's story
<class 'str'> —— Elsie
<class 'str'> —— http://example.com/tillie
<class 'str'> —— ...
Elsie —— http://example.com/elsie
Lacie —— http://example.com/lacie
Tillie —— http://example.com/tillie
>>> 

可以发现:
Tag.text可以提取Tag中的文字
Tag['属性名']可以提取属性的值

0,用find()方法提取html中的大标签,也就是父标签

1,用find_all()提取大标签中的子标签,返回列表形式

2,用for语句遍历,提取每一个子标签

3,用Tag.text和Tag['属性名']来取出想要的数据


更多BeautifulSoup知识请点击下面的官网文档:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html


>>>阅读更多文章请点击以下链接:

python爬虫从入门到放弃之一:认识爬虫
python爬虫从入门到放弃之二:HTML基础
python爬虫从入门到放弃之三:爬虫的基本流程
python爬虫从入门到放弃之四:Requests库基础
python爬虫从入门到放弃之五:Requests库高级用法
python爬虫从入门到放弃之六:BeautifulSoup库
python爬虫从入门到放弃之七:正则表达式
python爬虫从入门到放弃之八:Xpath
python爬虫从入门到放弃之九:Json解析
python爬虫从入门到放弃之十:selenium库
python爬虫从入门到放弃之十一:定时发送邮件
python爬虫从入门到放弃之十二:多协程
python爬虫从入门到放弃之十三:Scrapy概念和流程
python爬虫从入门到放弃之十四:Scrapy入门使用

上一篇下一篇

猜你喜欢

热点阅读