爬虫第五讲：BeautifulSoup网页解析库

2018-08-21 本文已影响0人谢谢_d802

BeautifulSoup

BeautifulSoup是灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可以方便地实现网页信息的提取

安装BeautifulSoup

pip3 install beautifulsoup4

BeautifulSoup用法

解析库

解析库	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,"html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python2.7.3 or Python3.2.2之前的版本容错能力差
lxml HTML解析库	BeautifulSoup(markup,"lxml")	速度快、文档容错能力强	需要安装C语言库
lxml XML解析库	BeautifulSoup(markup,"xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,"html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

基本使用

import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.baidu.com').text
soup = BeautifulSoup(response,'lxml')
print(soup.prettify())#prettify美化，会格式化输出，还会自动补齐闭合
print(soup.title.string)#打印head里面的title

标签选择器
选择元素

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.title)#html title，并且标签也会输出
print(type(soup.title))#type <class 'bs4.element.Tag'>
print(soup.head)#html head
print(soup.p)#只第一个找到的p标签
print(soup.p.name)#获取名称 就是p标签的名字，就是p嘛

获取名称
见上面例子

获取属性
有些类似jQuery


import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])#返回第一个找到的p标签的属性名为name的属性值，返回值是dropmouse。soup.p.attrs返回的是由属性键值对组成的字典{'class': ['title'], 'name': 'dropmouse'}
print(soup.p['name'])#返回值也是dropmouse，和上面的方法结果一样。

获取内容比如获取p标签中的内容

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b?</p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)#选择之后加.string就是选择标签中的内容，这个内容不包含HTML标签

嵌套选择
'bs4.element.Tag'还可以选择该Tab中的子标签。比如

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="title" name="dropmouse"<b>The doc story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.body.p.string)#也和jQuery类似

子节点和子孙节点

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)#返回p标签内的所有内容，包括换行符。list类型
print(soup.p.string)#none，由于p标签里面嵌套了许多其他HTML标签，而且不止一个，所以返回none

另一种得到子节点的方法

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1><!---Elsa---></a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)#返回包含直接子节点的迭代器
for i,child in enumerate(soup.p.children):
    print(i,child)

* 返回结果：*
<list_iterator object at 0x7fda5c186c88>
0 Once upon a time there were three little sisters;and their names lll
  
1 <a class="sister" href="http://www.baidu.com" id="" link1=""><!---Elsa---></a>
2 

3 <a class="sister" href="http://www.baidu.com" id="" link2="">Lacie</a>
4  and
    
5 <a class="sister" href="http://www.baidu.com" id="" link3="">Tille</a>
6 ;
    and They lived at the bottom of a well.

子孙节点

import requests
from bs4 import BeautifulSoup
#response = requests.get('http://www.baidu.com').text
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)

会返回第一个找到的p下的所有子孙节点。

<generator object descendants at 0x7f0b04eceaf0>
0 Once upon a time there were three little sisters;and their names lll
    
1 <a class="sister" href="http://www.baidu.com" id="" link1="">
<span>Elsle</span>
</a>
2 

3 <span>Elsle</span>
4 Elsle
5 

6 

7 <a class="sister" href="http://www.baidu.com" id="" link2="">Lacie</a>
8 Lacie
9  and
    
10 <a class="sister" href="http://www.baidu.com" id="" link3="">Tille</a>
11 Tille
12 ;
    and They lived at the bottom of a well.

父节点和祖先节点

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)

返回结果：先找到第一个a标签，然后找到这个a标签的父节点，再输出整个p标签包含里面的所有内容都输出。

<p class="story">Once upon a time there were three little sisters;and their names lll
    <a class="sister" href="http://www.baidu.com" id="" link1="">
<span>Elsle</span>
</a>
<a class="sister" href="http://www.baidu.com" id="" link2="">Lacie</a> and
    <a class="sister" href="http://www.baidu.com" id="" link3="">Tille</a>;
    and They lived at the bottom of a well.</p>

祖先节点

soup.a.parents #这就是第一个找到a的祖先标签，返回一个迭代器。迭代器包含所有的祖先，一层层从p标签、body标签、html标签

兄弟节点

import requests
from bs4 import BeautifulSoup
html = """
    <html><head><title>This is a test Html code</title></head>
    <body>
    <p class="story">Once upon a time there were three little sisters;and their names lll
    <a href="http://www.baidu.com" class="sister" id =""link1>
        <span>Elsle</span>
    </a>
    <a href="http://www.baidu.com" class="sister" id =""link2>Lacie</a> and
    <a href="http://www.baidu.com" class="sister" id =""link3>Tille</a>;
    and They lived at the bottom of a well.</p>
    <p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))#后面的所有兄弟
print(list(enumerate(soup.a.previous_siblings)))#前面的所有兄弟节点

用上面介绍的选择器很难精确的选择某个element（往往只能选择第一个找到的元素），所以BeautifulSoup还提供了标准选择器，向CSS选择器一样可以用标签名、属性、内容查找文档。

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

name--标签名

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))#find_all返回一个列表，这里返回找到所有的ul包含ul之内的所有内容。
print(type(soup.find_all('ul')[0]))

*输出结果: *

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">That's ok</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">FOO</li>
<li class="element">BAR</li>
</ul>]
<class 'bs4.element.Tag'>

因为find_all列表中的每个元素是element.Tag类型的标签，所以还可以遍历Tag中的子节点。这样可以层层嵌套的查找

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

返回结果：返回ul下面的所有li

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]

attr find_all(attrs={'name':'element'})查找属性为name:element键值对的所有元素

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={"class":"list"}))#特殊的属性如class、id 可以用class_="list"和id="list-1"代替。
print(soup.find_all(attrs={"id":"list-1"}))

textfind_all(text="FOO")

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text="Foo"))

返回值:['Foo']
查找元素没用，只能判断有没有找到目标。用处不大。

find(name,attrs,recursive,text,**kwargs)

返回找到的第一个元素，如果没找到返回None，find_all是返回所有元素的列表。
不演示了

find_parents() find_parent与find_all和find()类似

返回所有的祖先节点和返回父节点

find_next_siblings(),find_next_sibling()

返回后面所有的兄弟节点和返回后面的第一个节点

find_previous_siblings(),find_previous_sibling()

返回前面所有的兄弟节点和返回前面第一个兄弟节点

find_all_next(),find_next()

返回节点后所有符合条件的节点和返回节点后第一个符合条件的节点

find_all_previous()，find_previous()

返回节点前所有符合条件的节点和返回节点前第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
print(soup.select('.pannel .pannel-heading'))#返回pannel类下pannel-heading类的元素的内容
print(soup.select('ul li'))#返回ul类型之下的li类型的标签，包含内容
print(soup.select('#list-2 .element'))#返回id=list-2下的element类的元素

结果

<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">That's ok</li>, <li class="element">FOO</li>, <li class="element">BAR</li>]
[<li class="element">FOO</li>, <li class="element">BAR</li>]

获取属性


import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul['id'])#返回所有ul的id这个属性的值
    print(ul.attrs['id'])#返回所有ul的id这个属性的值，和上面一样，用这个办法可以返回任意的属性。

获取内容get_text()

import requests
from bs4 import BeautifulSoup
html = """
<div class="pannel">
    <div class="pannel-heading">
        <h4>Hello</h4>
    </div>
    <div class="pannel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">That's ok</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">FOO</li>
            <li class="element">BAR</li>
        </ul>
    </div>
</div>            
"""
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
    print(li.get_text())

返回结果:

Foo
Bar
That's ok
FOO
BAR

总结

推荐使用lxml解析库，必要时使用html.parser或者html5lib
标签选择器速度快但筛选功能弱
建议使用find()、find_all()查询匹配单个或多个结果
如果对CSS选择器熟悉，建议使用CSS选择器select()
记住常用的获取属性和文本的方法