01_7爬虫网页_BS4

2018-07-30 本文已影响87人 siyu8023

BeautifulSoup

是一个非常优秀的Python扩展库，可以用来从HTML或XML文件中提取我们感兴趣的数据，并且允许指定使用不同的解析器。

一、安装BeautifulSoup

1.1 mac安装bs4

完成安装python3后，直接在命令行使用pip3或者easy_install来安装

1.pip3安装

pip3 install beautifulsoup4

2.easy_install安装

easy_install beautifulsoup4

1.2 win下安装bs4

需要先下载bs4安装包到本地python目录，再执行命令

借鉴网上方法

    python3.4.3 对BeautifulSoup的支持不太好，大多网上都是python2.7 的安装教程，而按那个真是颇费周折。最后，在[https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/) 找到了解决方案

     1.下载
        [https://www.crummy.com/software/BeautifulSoup/bs4/download/ ](https://www.crummy.com/software/BeautifulSoup/bs4/download/)
     2.解压至D:\python34   即python安装目录
     3.打开cmd，进入D:\python34\beautifulsoup4-4.4.1  ，这是我的安装路径，这里面有setup.py  文件
     4.cmd中输入 ```Python setup.py install```
     如果行不通,Beautiful Soup的发布协议允许你将BS4的代码打包在你的项目中,这样无须安装即可使用.
     5.测试是否安装好
         5.1输入python，进入python模块
         5.2输入from bs4 import BeautifulSoup检测是否成功。
      另附BeautifulSoup中文文档链接，接下来好好享受爬虫旅程。[https://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#Quick%20Start](https://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#Quick%20Start)

根据方法执行

1.下载和安装

下载bs4包后，解压至本地python目录
备注：cmd中可通过where python获得本地python路径

2.使用Python setup.py install安装

使用Python setup.py install安装报错，是因为进入当下路径文件夹后，无steup.py文件可供安装。调整1和2步骤，下载最新bs4包4.6版本，进入steup.py文件夹中

下载bs4包后，解压至本地python目录.png

执行命令python setup.py install，正常进入安装流程

安装bs4_1.png

安装bs4_2.png

3.进入python，使用from bs4 import BeautifulSoup查看是否正常

from bs4 import BeautifulSoup_报错.png

查找上述报错解决方法，因为使用python3.6版本，需更换库，把python2的库升级为python3的库
具体：将bs4文件夹和2to3.py同时放到lib中，然后在cmd中定位到lib，运行：2to3.py bs4 -w就好了

2to3.py bs4 -w.png

之后运行报错，操作异常，确认后发现，将bs4文件夹放入lib文件夹中执行错误

--1.beautifulSoup4解压目录中的bs4文件夹C:\Users\xx\AppData\Local\Programs\Python\Python36\beautifulsoup4-4.6.0\beautifulsoup4-4.6.0\bs4
--2. 2to3.pyC:\Users\xx\AppData\Local\Programs\Python\Python36\Tools\scripts
都复制到python的安装目录下的LibC:\Users\xx\AppData\Local\Programs\Python\Python36\Lib文件夹下
进入D:\Python37-32\Lib 目录,并执行2to3.py bs4 -w命令

bs4安装包下bs4文件夹.png

2to3.py文件.png

进入python，使用from bs4 import BeautifulSoup查看是否正常

完成安装，并可正常使用bs4.png

二、BeautifulSoup语法

1.学习基本够用的即可
2.学习2,已完成
3.学习3，待学习搜索文档树、find_all
4.学习4，好的实践

2.1. BeautifulSoup类的基本元素

Tag
Tag 对象与 XML 或 HTML 原生文档中的 tag 相同，表示标签，是最基本的信息组织单元，分别用<>和</>开头和结尾

<Tag>.name : Name 表示标签的名字
<Tag>.attrs: Attributes 表示标签的属性，返回字典形式组织，
<Tag>.string : NavigableString 表示标签内非属性字符串
<Tag>.string: Comment 表示标签内的注释部分

1.tag实例
如上之前已经学过，soup.标签a/p,代表遍历到第一个a标签或者p标签

>>> tag1 = soup.a
>>> print(tag1)
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> tag2 = soup.p
>>> print(tag2)
<p class="title"><b>The Dormouse's story</b></p>

2.Name实例
类似：当我们通过soup.title.name的时候就可以获得该title标签的名称，即<title>括号中的title

>>> tag1.name
'a'
>>> tag2.name
'p'

3.Attributes 实例

 >>> tag1.attrs
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

#在这里，我们把 tag1即a标签的所有属性打印输出了出来，得到的类型是一个字典。
>>> tag2.attrs
{'class': ['title']}

如果我们想要单独获取某个属性，可以这样，例如我们获取它的 class 叫什么

>>> tag2.attrs.get('class')
['title']

我们也可以通过这个attrs去更加详细地过滤标签--find_all
如果想要通过名字得到比一个 tag 更多的内容的时候就需要用 Searching the tree 中描述的方法, 比如 : find_all()

>>> soup.find_all('p')
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

4.NavigableString
NavigableString按照字面意义上理解为可遍历字符串，
是BeautifulSoup对象四种类型tag|NavigableString|BeautifulSoap|Comment中的一种

>>> tag1.string
'Elsie'
# tag1即a标签的内容 #
>>> type(tag1.string)
<class 'bs4.element.NavigableString'>
# 知道是一种类型，写明对象类型 NavigableString #
>>> tag2.string
"The Dormouse's story"
>>> type(tag2.string)
<class 'bs4.element.NavigableString'>

5.Comment 实例（与 NavigableString 的区别）
Comment 对象是一个特殊类型的 NavigableString 对象??还是不太懂

2.2 学习属性和打印语句

2.2.1 学习属性

1.标签选择器

在快速使用中我们添加如下代码

print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

作用：通过这种soup.标签名我们就可以获得这个标签的内容
备注：这里有个问题需要注意，通过这种方式获取标签，如果文档中有多个这样的标签，返回的结果是第一个标签的内容，如上面我们通过soup.p获取p标签，而文档中有多个p标签，但是只返回了第一个p标签内容,见2.1.2 打印语句6.属性print(soup.p)打印结果

2.获取名称

当我们通过soup.title.name的时候就可以获得该title标签的名称，即title
print(soup.title.name)

3.获取属性

print(soup.p.attrs['name'])
print(soup.p['name'])

4.获取内容

print(soup.p.string)

结果就可以获取第一个p标签的内容：

The Dormouse's story

5.嵌套选择

见下方 5.属性print(soup.title.parent.name)打印结果
嵌套有父子节点的概念，通过子.parent=父状态，来嵌套获得父的信息

6.标准选择器

find_all:可以根据标签名，属性，内容查找文档
find_all() 方法搜索当前 tag 的所有 tag 子节点, 并判断是否符合过滤器的条件??

待学习find_all示例

find_all(name,attrs,recursive,text,**kwargs)

其余属性目前未涉及，暂时未关注，可继续回溯‘https://www.cnblogs.com/zhaof/p/6930955.html’

2.2.2 打印语句

#-*- coding:utf-8 -*-

from bs4 import BeautifulSoup

html_doc = """ 
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p> 
"""

soup = BeautifulSoup(html_doc, 'html5lib')

print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))

属性print(soup.prettify())打印结果
prettify()函数，实现格式化输出：把代码格式给搞的标准一些，符合html代码格式（从.py文件中发现，相关html格式并不规范），使用prettify规范格式，更清楚

C:\Users\sylvia.li\PycharmProjects\untitled\venv\Scripts\python.exe F:/01_python/01_testsamples/bs4_04.py
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Process finished with exit code 0

2.属性print(soup.title)打印结果
head标签中的起始于<title>完结于</title>的内容

<title>The Dormouse's story</title>

属性print(soup.head)打印结果
html中起始于<head>完结于</head>的内容

<head><title>The Dormouse's story</title></head>

3.属性print(soup.title.name)打印结果
当我们通过soup.title.name的时候就可以获得该title标签的名称，即<title>括号中的title

title

4.属性print(soup.title.string)打印结果
title标签中string，即<title>The Dormouse's story</title>中><中间string部分

The Dormouse's story

5.属性print(soup.title.parent.name)打印结果
title.parent 是指title标签的父标签，是指 head 标签
head.name 可以获得该head标签的名称，即<head>括号中的head

head

6.属性print(soup.p)打印结果
如果文档中有多个这样的标签，返回的结果是第一个标签的内容，如上面我们通过soup.p获取p标签，而文档中有多个p标签，但是只返回了第一个p标签内容起始于<p到/p>为止

<p class="title"><b>The Dormouse's story</b></p>

7.属性print(soup.p["class"])打印结果
第一个P标签中'class'的内容，把class=‘title’中赋给class的内容，字符串‘title’

['title']

8.属性print(soup.a)打印结果
第一个a标签内容起始于<a到/a>为止

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

9.属性print(soup.find_all('a'))打印结果
找到所有a标签并打印相关内容

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

10.属性print(soup.find(id='link3'))打印结果
打印id=link3的标签结果

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

属性soup.select()

练习示例bs4_04.py,学习地址
【目标】
通过采用soup.select()方法，遍历html后获得目标内容。
其中关键点在于，对于所需内容的精准定位，通过（）内的语句来实现：

11.1 关键字class
对于html内的内容，可以通过class（赋值给关键字class的类名）来进行定位
soup.select('.class的类名')

直接写成soup.select('.class')报错，因为找不到目标内容

报错TypeError: 'NoneType' object is not callable.png

head
class:*************************
Traceback (most recent call last):
  File "F:/01_python/01_testsamples/bs4_04.py", line 31, in <module>
    print(soup.selecet('.class'))
TypeError: 'NoneType' object is not callable

Process finished with exit code 1

理解格式含义，使用select()遍历和定位的目标是定位到类class对应名称中的内容：上面html中 class=‘sister’目标类名称是sister格式，按照类名查找 sister

print(soup.select(".目标类名称"))

示例见下

#按照类名查找 sister 上面html中 class=‘sister’目标类名称是sister，格式是print(soup.select(".目标类名称"))
print('sister:*************************')
print(soup.select(".sister"))

比如上述例子会打印处类=sister的内容

sister:*************************
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

参考文档
1.确定遍历目标
2.根据不同目标特性，找到语法格式

11.2 关键字id
按照id名称查找注意格式 print(soup.select("#目标id名称 "))。目标是id=link3，则写成"#link3"

print('id=link3:*************************')
print(soup.select("#link3"))

打印结果：

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

11.4 组合高级查找
按照id名称查找 x标签中目标id y 格式 print(soup.select("x #y"))

print('P标签id=link2:*************************')
print(soup.select("p #link2"))

打印结果：

P标签id=link2:*************************
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

11.3 关键字标签
按照标签查找，注意目标标签名称

print('tag:*************************')
print(soup.select('a'))

打印结果：

tag:*************************
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

综上，在bs4中希望筛选（或者查找）sth方法
1 直接查找
2 find_all和find
3 通过css选择器select()方法

2.2 BeautifulSoup ：一些常用功能的使用和测试

点击此处，很喜欢这位小姐姐的文风，可以静下心阅读

2.2.1 关于bs4的解析速度

1.仔细阅读文档后发现，文档解析器对速度至关重要！
2.安装chardet模块
如果没有安装chardet模块，那么光一个网页就要7秒！！（安装方法自行度娘），还不包括获取网页时间。然而试过后，如过山车一般：，安装了chardet以后7秒变成了一瞬。
ps：然而，用了几天后又变回了7秒，卸载了chardet又变回了一瞬间！

2.2.2 导入方式

BeautifulSoup升级到4以后，导入方法变了，如下：
from bs4 import BeautifulSoup

2.2.3 关于输入文本格式

关于被解析文档的编码格式
1.官方说无论被传入什么编码的文档，都会被统一为unicode
2.实际上有时候我发现，必须以unicode传入才能获得正确结果,这里试验发现，还真的是如此!必须传入decode过的码

    html_doc = open('test-Zhilian-list-page-sm1.html', 'r').read().decode('utf-8')
    # ^ 这个html文件其实是智联招聘搜索页的源码，可以自己保存下来直接试一试。

如下图，最简单一个爬虫示例。如果要用bs4，需要把参数the_page在read()后继续转换编码格式decode('uft-8')

最简单一个爬虫示例.png

2.2.4 关于bs4的文档解析器

又是一个大坑：bs升级到4后，实例化时需要明确指定文档解析器，如：
soup = BeautifulSoup(html_doc, 'lxml')
坑的地方：
1.但是著名的lxml在这里就是个大坑啊，
2.因为它会直接略过html所有没写规范的tag，而不管人家多在乎那些信息
PS:因为这个解析器的事，我少说也折腾了好几个小时才找到原因吧。

【总结】记住，选择 * html5lib*！效率没查多少，最起码容错率强，不会乱删你东西！
soup = BeautifulSoup(html_doc, 'html5lib')

5.# 关于bs4的输出格式 #################
# prettify()官方解释是一律输出utf-8格式，
# 其实却是unicode类型！！所以必须在prettify()里面指定编码。

output = soup.prettify('utf-8')
print repr(output)

01_7爬虫网页_BS4

BeautifulSoup

一、安装BeautifulSoup

1.1 mac安装bs4

1.2 win下安装bs4

二、BeautifulSoup语法

2.1. BeautifulSoup类的基本元素

2.2 学习属性和打印语句

2.2.1 学习属性

1.标签选择器

2.获取名称

3.获取属性

4.获取内容

5.嵌套选择

6.标准选择器

2.2.2 打印语句

2.2 BeautifulSoup ：一些常用功能的使用和测试

2.2.1 关于bs4的解析速度

2.2.2 导入方式

2.2.3 关于输入文本格式

2.2.4 关于bs4的文档解析器

猜你喜欢

热点阅读