爬虫笔记2-解析

2018-12-26 本文已影响0人三流之路

BeautifulSoup 用于解析已经抓取的文件内容。是解析、遍历、维护“标签树”的功能库。

pip install beautifulsoup4 命令安装模块，import bs4 导入该模块。

BeautifulSoup 库将任何 HTML 输入都变成 utf‐8 编码，Python 3 默认支持编码 utf‐8。

BeautifulSoup 对象

可以对应到文档的标签树。

解析远程网页：

import requests, bs4

res = requests.get('https://www.baidu.com/')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
print(type(soup)) # <class 'bs4.BeautifulSoup'>
print(soup.prettify()) # 输出 HTML 文档内容

解析本地文件：

from bs4 import BeautifulSoup

file = open('/Users/silas/Downloads/GDT DEV guide.4.16.html')
soup = BeautifulSoup(file, "html.parser")
print(type(soup))  # <class 'bs4.BeautifulSoup'>

解析器

解析器	使用方法	条件
bs4 的 HTML 解析器	`BeautifulSoup(mk,'html.parser')`	安装 bs4 库
lxml 的 HTML 解析器	`BeautifulSoup(mk,'lxml')`	`pip install lxml`
lxml 的 XML 解析器	`BeautifulSoup(mk,'xml')`	`pip install lxml`
html5lib 的解析器	`BeautifulSoup(mk,'html5lib')`	`pip install html5lib`

标签元素

select 方法寻找元素：

soup.select('div')  # 在 div 的标签元素
soup.select('#author')  # 属性 id 为 author 的元素
soup.select('div #author')  # 规则可以组合。在 div 里，且属性 id 为 author 的元素
soup.select('.notice')  # 属性 class 为 notice 的元素
soup.select('div span')  # 所有在 div 内的 span 内的元素
soup.select('div > span')  # 所有直接在 div 内的 span 内的元素
soup.select('input[name]')  # 在 input 里且有属性 name 的元素
soup.select('input[type="button]')  # 在 input 里且有属性 type，且属性值为 button 的元素

select() 方法返回一个 Tag 对象的列表，这是 BeautifulSoup 表示一个 HTML 元素的方式。

Tag 值可以传递给 str() 函数，显示它们代表的 HTML 标签。
Tag 值调用 getText() 方法获取标签里的内容。
Tag 值的 get() 方法传入属性名作为参数，获取其属性的值。

>>> tags = soup.select('input[name]')
 
>>> len(tags) # 列表长度
 
3
>>> type(tags[0])
 
<class 'bs4.element.Tag'>
>>> str(tags[0])
 
'<input class="search_key" id="cbsearchtxt" name="q" onkeypress="if(event.keyCode==13){document.getElementById(\'cbsearchsub\').click();return false;}" size="30" type="text"/>'
>>> tags[0].getText() # 自闭合，没有内容
 
''

也可以通过标签名直接获取 Tag，然后获取里面的内容。

tag = soup.p # 获取 p 标签
tag.name # 标签名
tag.attrs # 属性字典
tag.string # 标签内容

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用 `<>`和 `</>` 标明开头和结尾
Name	标签的名字， `<p>…</p>` 的名字是 `p`，格式：`<tag>.name`
Attributes	标签的属性，字典形式组织，格式：`<tag>.attrs`
NavigableString	标签内非属性字符串，`<>…</>` 中字符串，格式：`<tag>.string`
Comment	标签内字符串的注释部分，一种特殊的 Comment 类型

>>> import requests
>>> from bs4 import BeautifulSoup
>>> res = requests.get('https://www.baidu.com/')
>>> res.status_code
200
>>> res.encoding = res.apparent_encoding
>>> soup = BeautifulSoup(res.text, 'html.parser')
>>> soup.title
<title>百度一下，你就知道</title>

Tag

寻找 <a> 标签，是第一个

>>> tag = soup.a
>>> tag
<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>

Name

>>> tag.name
'a'
>>> tag.parent.name
'div'

Attributes

>>> tag.attrs # 字典类型
{'href': 'http://news.baidu.com', 'name': 'tj_trnews', 'class': ['mnav']}
>>> tag.attrs['class']
['mnav']

NavigableString

>>> tag.string
'新闻'
>>> type(tag.string)
<class 'bs4.element.NavigableString'>

Comment

>>> newSoup = BeautifulSoup("<b><!-- This is a comment --></b><p>This is not a comment</p>", "html.parser")
>>> newSoup.b.string
' This is a comment '
>>> type(newSoup.b.string) # 同样的方法获取，区别是类型不同
<class 'bs4.element.Comment'>
>>> newSoup.p.string
'This is not a comment'
>>> type(newSoup.p.string)
<class 'bs4.element.NavigableString'>

标签树的遍历

下行遍历

属性	说明
.contents	子节点的列表，将 `<tag>` 所有子节点存入列表
.children	子节点的迭代类型，与 .contents 类似，用于循环遍历子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

>>> soup.head
<head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>百度一下，你就知道</title></head>

>>> soup.head.contents # 子标签存入列表
[<meta content="text/html;charset=utf-8" http-equiv="content-type"/>, <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>, <meta content="always" name="referrer"/>, <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>, <title>百度一下，你就知道</title>]
>>> soup.head.contents[2]
<meta content="always" name="referrer"/>

>>> for child in soup.head.children:
    print(child) # 子节点

    
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道</title>

>>> for child in soup.head.descendants:
    print(child) # 一层一层的递归迭代

    
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道</title>
百度一下，你就知道

上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

>>> soup.title.parent
<head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>百度一下，你就知道</title></head>
>>> soup.parent # soup 本身没有父标签

迭代所有父标签

>>> for parent in soup.a.parents:
    if parent is None: # 最后会迭代 soup 本身，它没有 parent
        print(parent)
    else:
        print(parent.name)

        
div
div
div
div
body
html
[document]

平行遍历

属性	说明
.next_sibling	返回按照 HTML 文本顺序的下一个平行节点标签
.previous_sibling	返回按照 HTML 文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照 HTML 文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

平行遍历发生在同一个父节点下的各节点下。可能是 NavigableString 类型，不一定还是一个标签

>>> soup.a.next_sibling
' '
>>> soup.a.next_sibling.next_sibling
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
>>> soup.a.previous_sibling
' '
>>> soup.a.previous_sibling.previous_sibling # 前面没有了，什么都不输出

# 遍历后续节点
for sibling in soup.a.next_siblings:
    print(sibling)
    
# 遍历前续节点
for sibling in soup.a.previous_siblings:
    print(sibling)

格式化输出 HTML 内容

prettify() 方法为 HTML 文本标签及其内容增加 \n

>>> soup.prettify()
'<!DOCTYPE html>\n<!--STATUS OK-->\n<html>\n <head>\n  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>\n  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>\n  <meta content="always" name="referrer"/>\n  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>\n  <title>\n   百度一下，你就知道\n  </title>\n </head>\n <body link="#0000cc">\n  <div id="wrapper">\n
...

>>> print(soup.prettify())
<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道
  </title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div class="s_form">
...

prettify() 也可用于标签

>>> print(soup.a.prettify())
<a class="mnav" href="http://news.baidu.com" name="tj_trnews">
 新闻
</a>