【零基础学爬虫】BeautifulSoup库详解

2019-03-21 本文已影响3人大菜鸟_

回顾

上一次介绍正则表达式的时候，分享了一个爬虫实战，即爬取豆瓣首页所有的：书籍、链接、作者、出版日期等。在上个实战中我们是通过正则表达式来解析源码爬取数据，整体来说上次实战中的正则表达式是比较复杂的，所以引入了今天的主角BeautifulSoup：它是灵活方便的网页解析库，处理高效，而且支持多种解析器。使用Beautifulsoup，不用编写正则表达式就可以方便的实现网页信息的提取。

一、 BeautifulSoup的安装

pip install beautifulsoup4

二、用法讲解

1. 解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强，常用	需要安装C语言库 lxml
lxml XML 解析器	BeautifulSoup(markup, "xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

2.基本使用

下面是一个不完整的html：body标签、html标签都没有闭合

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

下面使用lxml解析库解析上面的html

from bs4 import BeautifulSoup#引包
soup = BeautifulSoup(html, 'lxml')#声明bs对象和解析器
print(soup.prettify())#格式化代码，自动补全代码，进行容错的处理
print(soup.title.string)#打印出title标签中的内容

下面是容错处理时标签补全后的结果和获取的title内容，可以看到html和body标签都被补全了：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p >
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href=" " id="link1">
    <!-- Elsie -->
   </ a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </ a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </ a>
   ;
and they lived at the bottom of a well.
  </p >
  <p class="story">
   ...
  </p >
 </body>
</html>
The Dormouse's story

3.标签选择器

（1）选择元素

依旧使用上面的html

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

结果是：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p >

从结果发现只输出了一个p标签，但是HTML中有3个p标签
标签选择器的特性：当有多个标签的时候，它只返回第一个标签的内容

（2）获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

输出结果：

dromouse
dromouse

(3) 获取内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)

输出结果：

The Dormouse's story

(4) 嵌套获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)

输出：

The Dormouse's story

(5)获取子节点和子孙节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

输出的是一个列表

['\n            Once upon a time there were three little sisters; and their names were\n            ', 
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>,
 '\n'
, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
, ' \n            and\n            '
, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
, '\n            and they lived at the bottom of a well.\n        ']

另外一种获取方式

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

输出：

 <list_iterator object at 0x1064f7dd8>
0 
            Once upon a time there were three little sisters; and their names were
     　       
1 <a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
2 
　
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
4  
    and　　　
5 　　　
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>　　　　　　　　　　　
6 
    and they lived at the bottom of a well.

（6）获取父节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

程序打印出的是p标签，即a标签的父节点：

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
            and they lived at the bottom of a well.
        </p >

于此类似的还有：

parents属性：输出当前标签的所有祖先节点
next_sibings 属性：输出当前标签之后的兄弟标签
previous_sibling属性输出当前标签之前的兄弟标签

上面是标签选择器：处理速度很快，但是这种方式不能满足我们解析HTML的需求。因此beautifulsoup还提供了一些其他的方法

3.标准选择器

**find_all( name , attrs , recursive , text , kwargs )
可根据标签名、属性、内容查找文档
下面使用的测试HTML都是下面这个

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

(1) 根据标签名，即name查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

输出了所有的ul标签：

 [<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

上述可以继续进行嵌套：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))
   #可以更进一步，获取li中的属性值：ul.find_all('li')[0]['class']

（2）根据属性名进行查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(name='element'))

输出：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

(3)根据文本的内容，即text进行选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

输出：

['Foo;'Foo']

返回的不是标签，在查找的时候用途不大，更多是做内容匹配

find( name , attrs , recursive , text , kwargs )
和findall类似，只不过find方法只是返回单个元素

find_parents() find_parent()
find_parents()返回所有祖先节点，find_parent()返回直接父节点。

find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
#选择class为panel中的class为panel-heading的HTML，选择class时要在前面加‘.’
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))#标签选择，选择ul标签中的li标签
print(soup.select('#list-2 .element'))#‘#’表示id选择：选择id为list-2中class为element中的元素
print(type(soup.select('ul')[0]))

输出：

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

也可以进行嵌套，不过没必要，上面通过标签之间使用空格就实现了嵌套：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

输出：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

获取到html后如何获取属性和内容：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])#或者 print(ul.attrs['id'])
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

总结

推荐使用lxml解析库，必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()，方便
记住常用的获取属性和文本值的方法

更多关于Beautifulsoup的使用可以查看对应的文档说明

扫描下方二维码，及时获取更多互联网求职面经、java、python、爬虫、大数据等技术，和海量资料分享：
公众号菜鸟名企梦后台发送“csdn”即可免费领取【csdn】和【百度文库】下载服务；
公众号菜鸟名企梦后台发送“资料”:即可领取5T精品学习资料、java面试考点和java面经总结，以及几十个java、大数据项目，资料很全，你想找的几乎都有

扫码关注，及时获取更多精彩内容。（博主今日头条大数据工程师）