爬虫17：解析器BS4

2022-09-14 本文已影响0人 _百草_

BeautifulSoup简称BS4（4是版本号）
和 lxml 一样，也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。
使用BeautifulSoup需要导入bs4库

缺点：相对正则和xpath处理速度慢
优点：使用简单

1、安装

pip install bs4
依赖解析器lxml,所以还需安装lxml:pip insatll lxml
Python自带解释器html.parser但是速度稍慢
也可以使用解析器html5lib:pip install html5lib

“解析器容错”指的是被解析的文档发生错误或不符合格式时，通过解析器的容错性仍然可以按照既定的正确格式实现解析

2、BS4解析对象

创建解析器对象

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")  # content即要解析的内容；html.parser，解析时所使用的解析器；也可以是lxml或html5lib

若是外部文档：soup = BeautifulSoup(open("xiaomi.html", "r", encoding="utf-8"), "lxml") # 创建解析对象

3、BS4常用语法

Beautiful Soup将文档转换为树形结构=>快速遍历或搜索HTML文档
每一个节点都是Python对象：Tag、NavigableString、BeautifulSoup、Comment;前2种常用

Tag: 标签类，HTML文档中所有标签，都可以看作Tag对象
NavigableString:字符串类，指的是标签中的文本内容，使用 text、string、strings 来获取文本内容。
BeautifulSoup:表示一个HTML的全部内容，可视为特殊的Tag对象
Comment:HTML 文档中的注释内容以及特殊字符串，它是一个特殊的 NavigableString。

1）Tag节点

find_all( name , attrs , recursive , text , limit )：获取所有指定标签的元素，并存放列表
1. find_all("标签", 属性) ：带有指定属性的指定标签的所有元素
2. find_all(属性) ：指定属性的所有元素
  参数说明：
  name:tag标签名，字符串对象会自动忽略
  attrs:按照属性名和属性值搜索tag;注意class因是关键字，则使用class_
  recursive:find_all搜索tag的所有子孙节点，设置recursive=False则只搜索tag的直接子节点
  text:用来搜索文档中字符串内容，支持字符串、列表、True
  limit:限制返回结果的数量
find(name , attrs , recursive , text),同find_all(),但仅返回一个结果；故无limit参数
如，soup.a 等同于soup.find("a")
ele.get_text()：获取元素的文本
ele["属性"]：获取属性值
ele.attrs :获取所有属性值，字典类型,
如{'data-listener': 'gameForm', 'href': '/game/62357809', 'class': ['clearfix']}
ele.name :获取tag的名称

import re
url = "./baidu.html"
with open(url, "r", encoding="utf-8") as f:
    content = f.read()

# 创建一个BeautifulSoup解析对象
soup = BeautifulSoup(content, "html.parser")  # , from_encoding="utf-8")
#   warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be
#   ignored.")

soup.find_all(True)  # 匹配任何值，查找所有tag

# # 获取所有的链接
links = soup.find_all("a")  # 等同soup("a")
for link in links:
    # 获取所有link
    # print(link)
    try:
         print(link.name, link["href"], link.get_text())  # get_text()获取元素的文本；link["属性"]获取属性值；
    except Exception as e:
         print(e)

# 获取特定url
links2 = soup.find_all("a", href="http://")  # 返回list类型，包括整个元素
print(links2)

# 正则表达式
links3 = soup.find_all("a", href=re.compile(r"http"))
print(links3)

# 获取文本
print("-"*30)
li4 = soup.find_all("li", class_="hotsearch-item")  # 指定类名的li标签
for i in li4:
     print(i.get_text())

# 获取有属性的标签
li5 = soup.find_all("a", href=True)
print(li5)
li6 = soup.find_all("div", id="bottom_layer")[0].find_all("a", href=True)
print(li6)
print("-"*50)
# 获取有属性的所有标签
id7 = soup.find_all(id=True, class_=False)
print(id7)

# 使用正则表达式
email_re = re.compile(r">.+\d+.+<")
res = soup.find(text=email_re)
print(res)

4、遍历所有节点

tag的contents属性：获取可迭代对象

body_tag = soup.body
# 以列表形式输出所有节点
all_tags = body_tag.contents  # <class 'list'>
for ele in all_tags:
    print(ele)  # 存在\n换行符

以列表形式输出所有节点

tag的children属性：获取可迭代对象

# body_tag.children，则是<class 'list_iterator'>
for child in body_tag.children:
    print(type(child))  # <class 'bs4.element.NavigableString'>或<class 'bs4.element.Tag'>

5、CSS选择器

注：CSS选择器

from bs4 import BeautifulSoup

with open("Python BS4.html", encoding="utf-8") as f:
    html = f.read()
soup = BeautifulSoup(html, "html.parser")

# id选择器; 返回list类型
res = soup.select("#footer")
# print(res)

# 类选择器
res = soup.select(".info-box")
# print(res)

# 标签选择器
res = soup.select("h1")
# print(res)  # [<h1>Python BS4解析库用法详解</h1>]

# 多元素选择器：选择器1,选择器2
res = soup.select("h1,#sidebar-toggle")
# print(res)
# [<span class="toggle-btn" id="sidebar-toggle" toggle-target="#sidebar">目录 <span class="iconfont"></span></span>, <h1>Python BS4解析库用法详解</h1>]

# 后代选择器：元素E 元素F
res = soup.select("#header a")
# print(res)

# 子代选择器
res = soup.select("#header>a")
# print(res)

# 相邻元素选择器
res = soup.select("span[class='toggle-btn']+a")
print(res)